Toward global machine translation

Focus

Arvi Hurskainen

Arvi Hurskainen is professor emeritus at the University of Helsinki, Finland. He has been working on rule-based language technology since 1985, specializing in morphologically complex languages such as Swahili and Finnish, and has developed a number of applications in these languages.

achine translation (MT) technology can be divided broadly into two very different approaches. The older technology relies on close integration of human knowledge about language on one hand, and on the calculation speed and power of the computer on the other hand. For a long time, this approach was considered the only feasible one. However, the advent of the internet and the increased power of computers made it possible to develop a very different approach to MT. The development led to various statistical approaches, where machine learning is key. The latest versions of the statistical approaches are known as neural machine translation. While it is difficult to fully understand how this system works, its development has currently usurped almost all resources in the study of machine translation.

The current situation is very unfortunate, because roughly 1% of the world’s languages fulfill even the minimal criteria for statistical or neural machine translation. This trend pushes the majority of languages into a marginal position, because if the language is not computerized, as the customary expression runs, it is not respected.

Most languages are doubly disadvantaged. The current trend in MT does not suit them, as they do not have the needed language resources such as parallel text corpora. The other disadvantage is that they are not commercially attractive, which means that market forces push them into the margins.

However, those languages can still be included into the language technology market. The rule-based approach to machine translation does not exclude any language. A language can be computationally described even on the basis of speech in any field situation without the internet.

Grand aim of language technology

The grand aim of language technology is to develop a system where two language users with no mutual language competence can communicate through an MT system. Ultimately, this system would join all people into a single communication network with no language competence barriers.

Currently we are still very far from this aim. The attempts to construct such multilingual translation networks often include mere tens of languages, and more languages are involved with various degrees of coverage and accuracy. There is often much enthusiasm among the system developers — but in this, they behave as though they’re the leaders of religious groups. Each group leader is convinced that his (they are often males) message is the only one that leads to salvation. These leaders go around to mission tours and try to persuade followers to their flock. I myself, having worked for decades with the language technology, including such languages as Finnish, English, Swahili and other Bantu languages, have been approached by these sect leaders, but so far with poor results. Not only do I have my own approach, I see this sect approach as unnecessary.

Interlingua as mediator

Why should we force all language technology developers into a single approach, if we can include all approaches into a joint global translation system? By using a clearly defined interlingua as a mediator between two languages, we can construct a global language translation system where no language and no approach is excluded.

The idea of using interlingua is in no way new. What is new is the use of a normalized version of English as interlingua, instead of Esperanto or other invented languages. This approach is also independent of translation methods. The only thing that matters is its capability to produce high-quality translation between the non-English language and this normalized English, to both directions. The good translation result may be produced by any system, including statistical and neural methods.

Interlingua can have various levels of abstraction

The big players in MT use the normal surface form of English as interlingua in multilingual translation. This is the simplest solution, because translation systems between a language and English are needed in any case. It is straightforward to use that version of English as interlingua.

The English as it is used today in media is far from ideal interlingua. It contains frequently occurring ambiguities, which are due to omitting clause boundary marks and even leaving off relative pronouns and other subclause markers. Also, the use of gerund form of verbs instead of infinitive forms causes unnecessary ambiguity.

When translation is done from a non-English language into English, these anomalies can be avoided. If the source language encodes such features, which would cause ambiguity in normal English, these codes can be used for avoiding ambiguity, when translating from interlingua into language C.

Consider the sentences below:

(a) I would argue that the modified English described above is the minimum requirement for interlingua.

(b) I would argue, that the modified English, which was described above, is the minimum requirement for interlingua.

The first sentence is written following standard English writing style. The latter is the same sentence translated from Swahili to English. This version has commas and it also has relative pronouns. These seemingly small differences make a huge difference in machine translation, especially if we further translate the sentence into a third language such as Finnish.

Swahili requires that relative pronouns are overtly encoded and that clause boundaries are separated by commas. Not surprisingly, Finnish does the same. Why then should we lose all this critical information in interlingua by trying to imitate “fluent” English and then face huge ambiguity problems, which should be solved in any case?

When we translate sentences (a) and (b) above with Salama to Finnish, we get different translations.

(a) Minä väittäisin, että modifioitu englanti kuvaili ylempänä on vähimmäisvaatimus interlingualle.

(b) Minä väittäisin, että modifioitu englanti, joka kuvailtiin ylempänä, on vähimmäisvaatimus interlingualle.

In sentence (a) the information pertaining to the relative clause was lost, but in sentence (b) it was retained. Next, we test what happens when we translate the same sentence from Swahili to English.

Ningetoa hoja, kwamba Kiingereza kilichotoholewa, kilichoelezwa hapo juu, ni haja ya msingi kwa interlingua.

I would argue, that modified English, which was explained here above, is the basic requirement for interlingua.

The translation of the English sentence above to Finnish:

Minä väittäisin, että modifioitu englanti, joka selitettiin tässä ylempänä, on perusvaatimus interlingualle.

We see that the translation from Swahili to Finnish via interlingua is better than the translation from standard English to Finnish. This is due to the fact that the structure of the sentence components is retained throughout the translation process.

Linguistic tags inherited from source language

In addition to using the modified English as described above, we have also other means for improving the translation process. When we use a rule-based approach in translation, we have a large array of linguistic tags, which can be carried on though the translation process.

This is not, however, a simple thing, because in order to translate from English into the third language, we need first to analyze the interlingua. Analyzers normally analyze only plain text, and tags attached to words disrupt the whole process.

It is, however, possible to construct an analyzer, which accepts surface words with tags attached to them. When the analysis is done, the codes can then be separated as individual codes. In the result, we then will have the codes inherited from the source text and the codes added by the new analyzer of English.

Normally the disambiguation of English is a very heavy process requiring a number of complex rules. Now we have a very different situation because we can compare the inherited tags and the tags inserted by the English analyzer. The rules needed for disambiguation in this approach are normally straightforward, requiring only checking the co-occurrence of old and new tags. Even syntactic mapping of English can be bypassed, because the inherited syntactic tags suit with minor modifications also to English text. The main points where modification is needed are the direction codes < > in tags to show the direction of the head. The word order of Swahili and English is fundamentally different.

For example, consider a sentence extracted from the World Machine Translation Challenge 2017:

Trump says he was being sarcastic when claiming Obama was the founder of ISIS.

The sentence has ambiguities that are difficult to resolve. The Swahili version of the sentence is:

Trump anasema, kwamba alikuwa mwenye kejeli alipodai, kwamba Obama alikuwa mwanzilishi wa ISIS.

Now we translate the sentence from Swahili to English, using a version of Salama, which attaches inherited linguistic tags to translated words.

Trump+n+hum+@subj says+v+pr:na+@fmainvtr-obj, that+conj+@cs he+pron+@subj was+v+past+@fmainvintr sarcastic+adj+@nadj when+conj+@advl he+pron+@subj claimed+v+past+@fmainvtr-obj, that+conj+@cs Obama+n+hum+@subj was+v+past+@fmainvintr the founder+n+hum+@p of+gen-con+@gcon Isis+@nh.

The reading contains tags for part-of-speech, TAM markers, syntax and semantics. This text form is then analyzed with such an English analyzer, which accepts codes attached to surface words. After analysis, codes are detached from the words, so that they can be handled using normal routines.


"<*trump>"
"trump" n hum @subj PROPN
""
"say" v pr:na @fmainvtr-obj N PL

\t"say" v pr:na @fmainvtr-obj V PRES SG3
"<,>"
", COMMA
""
"that" conj @cs N Heur
""
"he" pron @subj PRON MALE SG3
""
"be" v past @fmainvintr V PAST SG3

\t"be" v past @fmainvintr V PAST SG1
""
"sarcastic" adj @nadj A
""
"when" conj @advl QUEST WH
""
"he" pron @subj PRON MALE SG3
""
"claim" v past @fmainvtr-obj V PAST

\t"claim" v past @fmainvtr-obj V EN
"<,>"
", COMMA
""
"that" conj @cs N Heur
"<*obama>"
"obama" n hum @subj PROPN
""
"be" v past @fmainvintr V PAST SG3

\t"be" v past @fmainvintr V PAST SG1
""
"the" DET
""
"founder" n hum @p N SG
""
"of" gen-con @gcon PREP
"<*isis @nh>"
"isis" @nh PROPN
"<.$>"
". **CLB

In the previous reading, we have two sets of codes: those inherited from Swahili (lower case) and those added by the English analyzer (upper case). The disambiguation of English text becomes now remarkably easier, when we have the encoding of both languages at hand. The reading where the encoding is the same is probably the correct choice.

Syntactic coding is inherited from the source text, and no new encoding is needed in this phase.

The text after light disambiguation is displayed below. Redundant codes are removed and the remaining codes are converted into upper-case.


"<*trump>"
"trump" @SUBJ HUM PROPN
""
"say" @FMAINVtr-obj V PRES SG3
"<,>"
", COMMA
""
"that" @CS CONJ CS
""
"he" @SUBJ PRON MALE SG3
""
"be" @FMAINVintr V PAST SG3
""
"sarcastic" @PCOMPL A
""
"when" @ADVL QUEST WH
""
"he" @SUBJ PRON MALE SG3
""
"claim" @FMAINVtr-obj V PAST
"<,>"
", COMMA
""
"that" @CS PRON DEM SG
"<*obama>"
"obama" @SUBJ HUM PROPN
""
"be" @FMAINVintr V PAST SG3
""
"the" @DN DET
""
"founder" @P N SG
""
"of" @GCON PREP
"<*isis>"
"isis" @P PROPN
"<.$>"
". **CLB

The translation can now be done using normal translation routines:

Trump sanoo, että hän oli sarkastinen, kun hän väitti, että Obama oli Isisin perustaja.

Semantic encoding

Semantic tags inherited from the source text are especially valuable in translation, because semantic information usually cannot be inferred from the form of the words. Such semantic features as humanness, animacy, sex, time, place, proper names (male, female, title, organization) can be encoded in the analysis of source language and then be transferred to the interlingua and the third language. News texts of a local language have large numbers of such proper names, which are not known in English or in a third language. Inherited semantic tags help in translation, especially if such semantic distinctions have bearing on translation into the target language.

Language-type-specific encoding

If language A and language C have similar language structures, it sounds far-fetched to translate between them via interlingua. For example, two Bantu languages have normally almost identical noun class structures. It may be tempting to translate between them directly instead of using interlingua in between.

However, it is also possible to retain the noun class information of language A and use it when translating via interlingua to language C. A large part of nouns in language A and C have the same noun class category and they need no new encoding. If the class of a noun in language A is not the same in language C, such cases must be treated separately.

Also, linguistic gender is a dominant feature in many languages. However, there is seldom direct mapping of gender affiliation on a word level.

Suitability of the system and capacity constraints

It is evident that the proposed translation system favors rule-based approaches, because they encode the linguistic information comprehensively and accurately. Statistical and neural translation systems do not do that. And if they try to make linguistic information explicit, the coverage and reliability is so low, that it is of little use in translation. However, if no modification of the interlingua is expected, those systems also work.

A comprehensive MT system often requires a large number of sequenced processes and handling of tens of thousands of rules. When we add to this the second translation round from interlingua to language C, much more computing is needed. However, the capacity requirements depend very much on the comprehensiveness of the translation system. If the vocabulary is limited to the vocabulary used in prose texts and conversation, capacity problems do not occur. Also, domain-specific approaches can be used for reducing capacity requirements.

Conclusion

I have suggested that a global MT system can be constructed using an open approach where emphasis is on high-quality translation between a non-English language and interlingua, which is a modified version of English. I have demonstrated the feasibility of this approach by testing it between two languages, Swahili and Finnish, both morphologically complex languages. In addition to using the modified version of English as interlingua, it is also possible to use a selection of linguistic tags of the source language to aid in further translation process. The system allows any translation method to be included into the system. The main requirement is the translation result of high quality between the local language and interlingua.

Back to Issue