Quality Swahili machine translation

By Arvi Hurskainen September 26, 2012

Africa is a continent where computational language applications are still largely missing. There are publicly available spelling checkers for some languages, but no more advanced applications such as machine translation (MT).

While thousands of researchers work from year to year developing MT systems for various languages, the almost total neglect of African languages requires explanation.

There are two major approaches to MT: statistical and rule-based (RBMT). Statistical methods have a number of advantages, including minimal human work, increasing amounts of source material for compiling parallel corpora, and continually increasing power of computers. It would seem that statistical methods would win the battle, at least in the freely available MT implementations. Statistical methods already produce fairly good translations between languages with similar structures. There are two publicly available translation systems that include several language pairs: Google Translate and Microsoft Translator. On the basis of their translation results one can conclude that both systems rely heavily on statistical methods.

I will demonstrate how these two systems translate the English sentence My three clever and skilled children, who studied abroad, have been promoted by their respective bosses into German: Meine drei kluge und kompetente Kinder, die im Ausland studiert, haben von ihren jeweiligen Vorgesetzten befördert worden (Google) and Meine drei kluge und geschickte Kinder, die im Ausland studiert, haben von ihrer jeweiligen Vorgesetzten gefördert (Microsoft).

The translation by Google Translate is better, but not fully correct. Both translations fail to translate the tense correctly in the relative clause. The result is even less satisfactory if the translation is done from English to Finnish: Nämä kolme fiksu ja taitava lapsia, jotka opiskelivat ulkomailla, on edistetty niiden pomoja (Google) and Kolme kekseliäs ja koulutettujen lasteni, joka opiskeli ulkomailla, on on edistettävä niiden vastaavien kenraalit määrää (Microsoft). The correct translation is Minun kolme viisasta ja taitavaa lastani, jotka opiskelivat ulkomailla, ovat saaneet ylennyksen esimiehiltään. Both translations fail in selecting the correct case. Also, the selection of the correct gloss fails.

If we want to translate from English to Swahili, only Google Translate includes this language pair: Yangu watoto watatu wajanja na wenye ujuzi, ambaye alisoma nje ya nchi, wamekuwa kukuzwa na wakubwa wao.

The result reveals several major problems in translating between this language pair. Generally, Google Translate performs better from Swahili to English than the other way round, especially in frequent short phrases. However, it fails to recognize most Swahili idioms and other multiword expressions. Consider these examples, with the correct translations in parentheses:

Ndobi alipiga pasi. Launderer was a passport (Launderer ironed).

Kijana alichapa kazi. The boy had a good job (The youth worked hard).

Askari alipiga bunduki. The soldiers had a gun (The soldier fired/shot).

Akapigwa faini. Akapigwa fine (He/she was fined).

Gazeti lilipigwa chapa. Lilipigwa printed newspaper (Newspaper was printed).

Watoto walipiga kelele. Children cried (Children made noise).

Watoto walilia. Children were crying (Children cried).

Alipunga hewa Alipunga air. (He/she rested).

Nilipunga mkono Nilipunga hand. (I waved my hand).

Wajenzi walifyatua matofali Builders were fired brick. (Builders laid bricks).

Gazeti lililopitwa na wakati Newspaper lililopitwa time. (Outdated newspaper).

Habari ilipokelewa na mikono miwili Information was received with both hands. (Information was received with pleasure).

Viongozi waliponda mali Officials were to crush material. (Leaders wasted property).

However, Google Translate deals correctly with some frequently occurring multiword expressions, such as Potelea mbali na vitu vyako (Perish with your things), Punde si punde akasema (Thereafter he said) and Wabunge walipiga makofi (Legislators clapped their hands).

The survey of current MT systems reveals that except for Swahili, none of the truly African languages is found in the lists of language pairs. This is due to several factors. One is that the market for MT applications for African languages is considered marginal. Another reason may be the large number of languages in many African countries; the selection of some languages for development may cause political schisms. The scarcity of skilled researchers who master these languages may also contribute to the state of affairs.

However, the more fundamental reason may be that statistical translation systems do not give satisfactory results. Statistical methods require large corpus materials of parallel text and huge dictionaries in constructing translation systems. Another major obstacle is the big structural difference between source language (SL) and target language (TL). The examples above show that the translation of even very simple sentences fails.

Below I shall discuss problems in MT from Swahili (SL) to English (TL) and suggest solutions to them. In the RBMT approach the translation procedure is analysis of the SL; morphological disambiguation of the SL; syntactic mapping of the SL; isolation of multi-word expressions in the SL; adding the lexical glosses of the TL; adding morphological tags of the TL; semantic disambiguation of TL; constructing surface forms of the TL; controlling word order in the TL; and the final outcome of producing clean text in the TL.

The success of the RBMT depends on how well the lexical gloss plus the grammatical information inherited from the analysis of the SL provides material for constructing the correct surface form in the TL. The superb advantage of the RBMT method is that information anywhere in the sentence can be used for constructing the surface form.

In translating from English to Swahili one encounters the problem of noun classes. Swahili has a total of 15 noun classes, and these play a dominant role in word formation through congruence rules. If the noun head is present in the clause, the corresponding congruence can be constructed. But if the noun is replaced by a pronoun in English, which has no noun class system, how is it possible to construct the correct form of the pronoun in MT? In Swahili, pronouns must be marked with noun class markers.

Another major challenge is the construction of the verb form, which is totally different in English and Swahili. English uses auxiliary verbs and a couple of inflectional forms of main verbs in constructing verb forms expressing tense, aspect or mood (TAM). In Swahili, the subject, relative referent and in certain cases object are also marked in the verb, and a number of verb extensions are used for deriving new meanings for the verb. When all this is combined with 15 noun classes, each verb may have millions of surface forms, and only one of them is correct in each case. Verbs contain a subject prefix (Sub), a relative prefix (Rel) and object prefix (Obj), all of which require one of the 15 noun class markers of the corresponding prefix set, depending on the class of the referent. The basic structure of the Swahili verb is Sub + TAM + Rel + Obj + stem + extensions + final vowel. The slot for extensions may have up to four extensions simultaneously. The information for constructing appropriate affixes is, if present, scattered in the sentence. If the referents for the subject, relative and object are nouns that appear in the sentence, the correct prefix forms can be constructed. If the referents are pronouns, the formation of correct prefix forms is more difficult.

There are two ways to produce class markers. In one method, the surface form of the class prefix is produced directly. In the other method, unique class tags are produced first and later converted into surface forms. The latter method is preferable because class tags are unique and do not contain ambiguity, as real prefixes might. For example, if a noun for human being is marked in Swahili as m wa tu and a noun for tree is marked as m mi ti, in both the surface form of the singular prefix is m. It is not known whether m represents class 1 or class 3. But if the markings are 1SG 2PL tu and 3SG 4PL ti, there is no ambiguity, and the tags can be used for inserting correct class prefixes on adjectives, numerals, pronouns and verbs, and finally be converted into the correct surface form. The use of class tags is particularly important in conjunction with nouns because the surface forms contain a lot of ambiguity, and the nouns control the congruence of prefixes and of all members of the noun phrase.

A simple example shows the problems in producing the surface form of the verb in translating from English to Swahili: The boy who was my friend has caused me a lot of trouble. Google Translate comes up with Kijana ambaye alikuwa rafiki yangu imesababisha yangu mengi ya matatizo, while the correct translation is Mvulana ambaye alikuwa rafiki yangu amenisababishia matatizo mengi. Google Translate fails to identify the correct class of the subject prefix, and it omits the object prefix. It also fails to add the applicative suffix i.

RBMT English to Swahili

I shall go briefly through the phases in RBMT using our initial example. The examples were processed using Swahili Language Manager (SALAMA), an environment for developing computational language applications. SALAMA (www.njas.helsinki.fi/salama) includes a mature MT system between the language pair Swahili and English.

The sentence is analyzed with en-fdg of Connexor and modified to be suitable for further processing (Figure 1).

The glosses of Swahili are inserted. For nouns, the noun class tags for singular and plural are inserted in front of the noun stem. Note that some words have more than one reading. When the readings are near-synonyms, such as in books and students, the most obvious reading is the first one in the cohort (Figure 2).

After disambiguation the sentence is as in Figure 3. Also the correct form of nouns, singular or plural, is selected. This is done using the tags inherited from the SL.

In the next phase, we add tags to words, such as verbs, pronouns, adjectives and numerals. The noun bosi belongs to class pair 5/6, but because it means a human being, its modifiers take the prefix of class pair 1/2. Rule writing for adding tags is counterintuitive because the word order is that of English and not of Swahili. Note that the tags are at the end of the reading in the order where they will be in the final word form (Figure 4).

The tags will be shifted to the left as part of the word stem (Figure 5). The plus mark indicates the morpheme boundary. The Swahili noun for children governs the concordance of the noun phrase, as well as the subject prefix and relative prefix of the verb jifunzA. It also governs the subject prefix of the Swahili verb for have been promoted that comes after the relative clause. It also must be noted that the adjective erevu takes a prefix but mahiri, a loan word, leaves it out. There are two ways of forming relative structures. One can use the pronoun amba plus the appropriate class marker, in which case relative is not marked in the verb. Some TAM forms allow the use of the relative marker in the verb. This solution is used in the example sentence, and amba is left out.

The morpheme tags will be converted to surface form (Figure 6). The precise form of the affix is affected by the phonetic quality of the word stem. In order to apply rules for changing word order, the text is put on one line. Also, redundant tags are removed (Figure 7). When reordering rules are applied, the text is as in Figure 8.

The sentence put into word-per-line format reveals the changed word order (Figure 9). After final cleaning, the translated sentence is Watoto wangu werevu na mahiri watatu, waliojifunza ng’ambo, wameendelezwa na mabosi wao husika, which is correct.

Towards hybrid methods

With the previous example, I have attempted to demonstrate some fundamental problems in developing MT applications. Although the words used in the example occur frequently in texts (in the top 12% of the frequency list of 30,000 lexical words), translation by Google Translate fails. For a language such as Swahili, with millions of possible verb forms for each verb, it is practically impossible to find translation examples for each case in the corpus. The same concerns exist for complex noun phrases. The concordance and word order in them follow strict rules, but it is impossible to figure out the rules without having analyzed the sentence first.

Although statistical methods in MT dominate in the market, it is hard to see significant advances in developing MT applications between African languages and English or French using these methods. One argument for using statistical methods is that they are quick, cheap and do not require much expertise. They are quick and fun to a certain degree, but to improve the performance significantly is very tedious.

It is also claimed that RBMT methods are expensive because they require a lot of manpower and high expertise, both in the linguistics of the languages concerned and in programming skills. That is true, but at the same time we have thousands of human translators doing the dull job of translating all sorts of texts manually. Why not invest in developing high-quality MT applications and make their jobs easier?

It is likely that the development leads to a kind of hybrid solution, where rule-based methods form the basis of translation, and statistical methods are used to resolve cases for which rules cannot be written, and to enhance the fluency of translation. For example, SYSTRAN, one of the leading vendors on the market, has ended up with this solution.