Terminology Glosses: Text expansion and Million Short

A few years ago, while reading the Italian translation of Lapidarium, a book by Polish journalist and writer Ryszard Kapuściński (1932-2007), I underlined this sentence: “The description of the boundless and wide-open Russian landscape demanded a broader phrasing.” Even though I don’t remember much of the book besides the elegant prose and the poetic descriptions, my mind goes back to that image every time someone tells me that Russian translations are longer than the original texts.

Text expansion

The phenomenon of text expansion is rather common in translation. A search online shows how authors of various documents on the topic generally agree that, for documents translated from English into Slavic languages, the expansion rate can go from 10% to 15%, and sometimes even more, depending on the language, with a peak of 20% to 30% for Polish. (I have used data from the table published at www.andiamo.co.uk/resources/expansion-and-contraction-factors.) The opposite phenomenon involves translation into languages, for instance, of the Sino-Tibetan family like Chinese and is called text contraction.

Text expansion and contraction happen for several reasons. Language structure, for example, is an important one: some languages are syllabic and have longer words than the source language whereas other languages may combine several source language words into one target language word. Some texts are more descriptive in nature than others and their translation in the target language may require more words, for instance: a technical text is by definition specific as opposed to a work of literature, which may require longer wording for an effective rendering. The writing style of the translator also has an impact on the target language length.

Text expansion is an interesting term, and a challenge for terminologists, mainly because it is polysemic: one of those terms that look alike, but have different meanings. When I entered “text expansion” in Google, the first results were explained as tools by PC World: “These programs let you insert commonly used phrases and other chunks of text using shortcuts so you don’t have to type out every character.” However, the definition I was looking for was rather “the increase in length a text undergoes when translated from a language into another.” A good fit for our ideal termbase, “text expansion” will make the object of two entries in English. Both of them will have it as their term, but their definitions, domain, source of term, context and so on will differ. The use of precise metadata is also recommended. Metadata are key elements in a termbase, as they allow terminologists to correctly classify a term, not only in view of giving the right indications to authors and translators, but also for retrieval purposes: the more accurate the metadata, the easier the retrieval of specific sets of terms when the need arises.

Million Short

When I do my research in a browser, the search engine frequently shows the usual websites in the first page. Every time that happens, I also switch to a different search engine. Million Short, still in its beta version, is able to remove the first 100, 1,000, 10,000, 100,000 or even one million websites from the search results depending on what option the user selects, thus “skipping the popular websites that are being pushed our way in favor of new ones with original content,” as millionshort.com puts it. Recently, I have used Million Short to run a search for machine translation (MT) online and this is the sorted list of results: Google Translate, Microsoft Translator, DeepL, Tilde, SystranNet, WorldLingo Free Translation online, Babylon NG and a few others.

In the last few years, MT has made enormous progress. Some companies already use neural MT and machine learning for better results. We also know that literature represents a bigger challenge for machine translation than a technical text, where larger corpora are available and a lot revolves around standardized terminology.  Out of curiosity, though, I tried to translate the sentence by Kapuściński I mentioned above with all of those tools. This is my source language text: La descrizione dello sconfinato e disteso paesaggio russo richiedeva un periodare più ampio. (translated into Italian by Vera Verdiani). In addition to being a sentence taken from a literary text, it also has the peculiarities of being itself a translation and of having one polysemic word disteso and the seldom used, very specific verb periodare: “to craft sentences,” which is a false friend of the English word period. The results were not unexpected. They were as follows:

Example 1, DeepL:

The description of the boundless and lying Russian landscape required a longer period.”

Example 2, Google Translate:

“The description of the boundless and relaxed Russian landscape required a broader period.”

Example 3, Microsoft Translator:

“The description of the endless and sprawling Russian landscape required a broader periodate.”

Example 4, SystranNet and Babylon NG:

“The description of the boundless and distended Russian landscape required a periodare wider.”

Example 5, WorldLingo Free Translation online:

“The description of the sconfinato and extended Russian landscape demanded to periodare more wide one.”

Even though many of them selected one of the possible nuances of disteso, all of them have a problem with periodare. Unfortunately, I do not have the original Polish sentence and I could not try Tilde, the MT software that specializes in Slavic languages. Also, going from Italian to English I was expecting the text to contract. Instead, I got a very similar morphosyntactic organization of the text in both languages for all the tools, which implied, in this case, a very similar number of words.

@font-face { font-family: “Cambria Math”; }@font-face { font-family: Calibri; }@font-face { font-family: “Warnock Pro”; }@font-face { font-family: Frutiger-Bold; }@font-face { font-family: “Avenir Medium”; }@font-face { font-family: BetaSans-Normal; }p.MsoNormal, li.MsoNormal, div.MsoNormal { margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: “Calibri”, sans-serif; }span.EndM { font-family: Frutiger-Bold; color: rgb(243, 111, 33); font-weight: bold; }.MsoChpDefault { font-family: “Calibri”, sans-serif; }div.WordSection1 { }