Localization for the long tail in Africa

By David Filip & Jama Musse Jama September 20, 2013

According to the latest Ethnologue statistics, Africa currently has 2,146 living languages. However, out of these, only 225 have been institutionalized in some form and shape, and 501 are on the way to join them. Altogether that is just over a third now on a course to become relevant for international multilingual efforts within this decade.

Really, the situation is not at all bad. In fact, with the median of 27,000 speakers per language, Africa is safely above the world median of 7,000 speakers per language. Also worldwide, less than 10% (682 out of 7,105) of languages are institutionalized, and thus Africa is above the standard with 10.5% of its languages institutionalized.

Let us have a macro look at Africa and try to extract a practical number of languages most relevant for commercial localization efforts. PanAfriL10n or PAL, the African localization wiki, provides a valid critique and discussion of the Ethnologue’s notion of a living language, especially in the African context. It is clear that many of the Ethnologue’s listed living languages, and not only in Africa, could be and for reasons of practicality in many cases should be classified as dialects. PAL’s list of major African languages apparently stopped in 2007, although the public has an option to propose another major language for the list along with an explanation. The PAL list in its latest version proposes 93 major languages as the result of developments diverging from the underlying list and profiles of 82 languages originally compiled by the African Studies Center of Michigan State University (1979-1997).

Clearly if we are looking at pro bono and prosumer efforts, which is often what is needed for the long tail, we must be concerned with the total of institutional, developing and vigorous languages. Scholars interested in cultural diversity will on the other hand be looking mainly at dying and threatened languages, as there might be a short window for documenting most of them. This is basically per definition, as the dying languages only have speakers who are no longer able to biologically procreate, so that their languages will be dying out with the individuals (Figure 1).

For commercial localization efforts, the 23 highest priority languages (or sometimes dialect groupings) were listed originally by Michigan State University. Due to developments in the course of the twenty-first century (mainly constitutional institutionalization of languages in South Africa) two of the entries split into multiple institutional languages that are weaker than the originally suggested macro language, but which still deserve a place at least in the wider list of major African languages. As a result, the adapted list now has 29 entries (Figure 2). The last two entries have the number of primary speakers well under one million, but still deserve their place on the high priority list due to the large number of second language speakers, making the language in question effectively an important vehicular language and playing in fact a supercentral role for a number of languages on top of their strong institutionalization in their home countries. Arabic obviously plays a special cultural and religious role, despite most of the 200 million worldwide speakers living outside the African continent. Zulu is the clear indigenous supercentral leader with second language speakers at 150% of mother tongue speakers. The future of the Fulfulde and Mandingo clusters is unclear; these could increase their importance if consolidation efforts won over the general push for institutionalizations of the constituting dialects as languages.

Case study: Somali

Somali, number seven on the list of top African languages, belongs to the Afro-Asiatic language family and it is one of the major East Cushitic languages of the Omo-Tana group. It comes in three major varieties: Af-Soomaali, also known as Standard Somali or Common Somali (based mainly on the Northern Somali dialect); Benaadir (mutually intelligible with standard and Northern Somali) and Af-Maay or Maay (mutually unintelligible with the previous two, or very difficult to understand). Somali dialects are spoken by a population estimated at 16.5 million, mainly living in the Horn of Africa (Somalia, Somaliland, Djibouti, Kenya, Ethiopia) and also widespread in Europe, the Middle East and North American countries, where the Somali diaspora recently resettled. Somali is one of the best documented of the Cushitic languages, with academic studies dating from before 1900.

In the past, Somali had been written with a number of different scripts, including an Arabic-based script known as Wadaad’s writing; a Latin-based alphabet; and the three main indigenous alphabets devised by local clan leaders in the course of the twentieth century: Borama, Osmanya and Kaddare.

Invented in 1933 by Shiekh Abdirahmaan Shiekh Nur of the Borama district in Somaliland, Borama — also called Gadabuursi — script was known and used by a small circle of the Shiekh’s associates. The script has no character for the glottal stop and comprises 21 consonants and seven vowels. Figure 3 is an example of printed lyric poetry in Gadabuursi-written Somali. Osmanya, invented in 1922 by Osman Yusuf Kenadid, gained more acceptance than the Borama script, produced a larger body of literature, and yet still it lost the competition for mass usage with the Arabic and Latin scripts. Interestingly, Osmanya has been standardized in Unicode since version 4.0 in 2003, although it hasn’t really been in use since 1972.

In 1972, the Latin alphabet was officially and finally adopted and at the same time Somali was made the sole official language of the Republic of Somalia, officially the home of about half (8.3 million) of the worldwide population of Somali speakers. As of 2004, Arabic, English and Italian were established as national working languages along with Somali, which signifies the importance of two important African phenomena: Islam and postcolonial influences. Maay, Swahili and Oromo have not been institutionalized within the Republic of Somalia despite significant numbers of speakers.

Somali is one of the best-established African languages, a fully institutionalized language well over the regional average of less than 400 thousand speakers (and also well over the European average of more than five million speakers), yet it is not available on Google Translate or Bing Translator, nor is it an option on Facebook or any other widespread commercial applications. Somali terminology does not exist on the Microsoft language portal, for example. However, Jama has built a significant Somali corpus by crawling online Somali resources and has made a Somali spellchecker available through his own open source editor and through browsers such as Firefox.

As with Cushitic languages in general, Somali is an accentual tone language and has distinctive pitch contrasts, mainly in nouns. Tone can distinguish lexical items, as well as gender, number and case. For instance, gender can change according to the accentual tone on the vowel. The difference between inan (boy) and inan (girl) is in the tonal accent on the a. Similarly, dameer (female donkey) is pronounced damẹer, differentiating it from dameer (male donkey). In the case of number, consider ardey (student) versus ardéy (students), and mádax (head) versus madáx (heads). Finally, the accentual tone may also determine completely distinct lexical items. For instance, daan (cliff, edge, block) versus dàan (lower jaw; cheek).

Somali has rich inflectional and derivational morphology. The main inflectional rules for the noun include complex pluralization patterns and two-gender distinction. Somali nouns can be formed through a process of derivation, adding morphemes, mainly suffixes, to a base that can be a noun or verb. Consider the example of pluralization through suffixation by adding o: dameer (male donkey, singular) becomes dameero (donkeys, plural). Likewise, albaab (door, masculine singular) becomes albaabbo (doors, plural). In both of these cases, the nouns also become grammatically feminine, influencing noun-verb agreement.

Somali’s rich morphology is a good reason for statistical machine translation developers to keep away. Yet they did not ignore Arabic or Russian, as these are hugely important supercentral languages of global importance. However, Czech and Slovak, with only ten and five million speakers respectively, have not been ignored. So we are back to market motivations — if your region is not rich or stable enough to generate sufficient market forces, the development to computer and localization readiness will need to remain in the hands of prosumers for at least a little while longer. Yet corporations do understand the importance of land grab, so as a native speaker you can volunteer as a Google localizer into Somali, and mobile device systems are being localized into Somali. Somali is a good example of a language that could have a big future in cyberspace, but has not yet cut it.

Activities to make Somali localization ready

There have been some recent information and communication technology (ICT) developments for the Somali language. Every modern language needs applications that meet the current state of ICT, such as building electronic dictionaries and computer-aided translation (CAT) tools as well as language learning materials. In order to build such applications, we need to have developed base resources, like stemmers and analyzers. Developing such language resources usually happens with well-described languages with longstanding written tradition boasting large numbers of speakers. One of the main challenges for the languages with less written tradition is the lack of data for statistical approaches.

We present a finite-state morphological analyzer for Somali, and compare two different methods to build a morphological analyzer for Somali. Both methods were developed recently by Jama and can be used in comparable ways for creating Somali text concordances, word frequency data and text comparisons. The first method is a rule-based morphological analyzer and the second one uses inductive logic programming to “learn” the morphology rules from the input corpus data. The initial rationale behind the development of these methods was to build an accurate word list for a Somali spellchecking dictionary, and now both are being improved as advanced tools for text mining applications, such as creation of concordances, CAT applications, and computer assisted poetry writing in Somali. There have been no such tools for Somali. To analyze Somali prose and produce frequency data and so on, we used two different books, Waasuge iyo Warsame: socdaalkii 30ka maalmood, a short novel written by Xuseen Sheekh Axmed Kaddare in 1983, and Aanadii Negeeye, a long novel written by Ibraahin Yuusuf Axmed “Hawd” in 2007.

SomMorph is a web-based application that develops noun and verb derivatives of Somali word base forms according to the rules defined in the accessible official sources published since the 1960s. These rules define different ways to recognize words derived from the base words occurring in the vocative case, in different gender and grammatical number, with pronouns, different inflections of verbs in indicative and other moods. Working toward the first spellchecking dictionary, redsea-online.com (the website of the Somali cultural foundation run by Jama) has collected nearly 54,000 unique terms with its web crawler and through user suggestions. They have been independently confirmed as either Somali word base forms or technical terminology (foreign terms adopted and phonetically assimilated, or Somalized). Words that users put through the spellchecker as well as words harvested in regular crawls of Somali language websites were reviewed and evaluated for inclusion in a growing Somali dictionary. These words were used later as input for SomMorph to produce other grammatical forms such as vocative case, gender derivatives and so on. Application of the rules produced a corpus of over 1.4 million Somali words. The result is a very rich and structured database, where each word points to its root.

SomIMorph, on the other hand, is a text analyzer that uses inductive logic programming to identify the root of a Somali word through a learning process. This can be applied for instance in SomConcor. SomConcor is a web-based application that produces a collated list of the principal words contained in a text. The entry point is the list of the words most frequently used in the text. The user can also query a concordance for a given word list. For each selected word, the result is a list of citations where the queried words are used in all possible grammatical forms. Other features of SomConcor include a distance or proximity search, which means that users can look for a pair of words within a given distance from each other. For each identified word, the user can go back to its root, and the root can also be expanded to show all of its derivatives (lemmatization). With the wildcard search, the user can look for words containing certain parts using the wild card symbols (*, ?), for instance, a query for cab* will find all words starting with cab.

SomISearch is an intelligent search engine that can search for words in all their inflection forms and also for their synonyms. It is a concept search system. For instance, one can search the word gobannimo, which means freedom, and the engine will also search for xornimo, which means independence, a near synonym of gobannimo. The user can search for cabbayaa (the third person present continuous tense of the verb Somali verb that means to drink) and the engine will query all derivatives of the Somali verb meaning to drink. The result also contains a list of forms of poetry used in this concept. SomISearch also implements the above described wildcard and proximity search.

e-Qaamuus is a portable version of the Redsea Somali Dictionary. It has been available on Android since December 2012 and the iPhone version is in the making. The dictionary data produced by Redsea are also feeding the Somali version of the Pan-African Living Dictionary Online (PALDO), which is a platform building project supported by the African Academy of Languages at African Union Commission (ACALAN) and UNESCO, which is meant to become the African specific successor project of Kamusi. Kamusi (Swahili for dictionary) has had global ambitions, yet is currently able to serve only as a comprehensive bilingual Swahili-English dictionary, as these two languages have both almost 61,000 terms on Kamusi, while other languages struggle around their first hundred “terms” or are missing. Nevertheless, PALDO seems to be imminently targeting Kinyarwanda and Somali, so it is worth watching for further developments.

As one of the results of the previously-mentioned efforts, official spellcheckers based on the Hunspell library are available for StartOffice, OpenOffice, Mozilla and other free and open source software (FOSS) applications. This has also helped to develop and stabilize official Somali IT terminology.

Takeaways for corporate and LSP practitioners

The landscape of African language development projects is fragmented and it is hard to figure out which projects are living and active. ACALAN is a good contact point to find out about available language resources especially for African central and supercentral languages. Watch the Kamusi and PALDO projects that are aiming to bring multilingual African online dictionaries. Consult PanAfriL10n.org for information (localization profiles) on the top 93 African languages. Use redesea-online.com for Somali localization resources.

Virtaal (virtaal.translatehouse.org, based in South Africa) provides African and FOSS localizers with an open and standards-based CAT tool. Similarly, TRF-backed webpage trommons.org provides access to open source technology and localization infrastructure based on Localisation Research Centre (LRC) and Centre for Next Generation Localisation (CNGL) research and development. Many of the served nonprofits target African causes, while Virtaal has been used for key African FOSS localization projects such as Mozilla Firefox translations. Translators without Borders engages massively in African pro bono localizations.

Latin and Arabic scripts are the two most important scripts that most African languages use in their writing systems. Issues with extended characters and diacritics in African variants of the Latin script have been overcome by the widespread use of the Unicode standard. The Arabic script of course brings all the well-known bidirectional challenges, but nothing that the Unicode Bidirectional Algorithm could not handle.

Two regionally important scripts are the Ethiopic script Ge’ez, and N’Ko. The Ethiopic syllabary (it encodes syllables rather than letters) is used by several languages in Ethiopia and Eritrea, most importantly Amharic. N’Ko, unlike Ge’ez, is a modern script invented in late 1940s by Guinean writer and linguist Solomana Kante. N’Ko is used to write several West African languages of the Mande group. Since Kante spent several years trying to encode his mother tongue with the Latin and later the Arabic script, he took the best features of both scripts to create the independent N’Ko alphabet. N’Ko has existed in Unicode since version 5.0 in the year 2007 and Ge’ez since version 3.0 in 2000. Other scripts are important for particular languages only or historically.

Although the African people have troubled relationships with postcolonial languages, and ACALAN together with the United Nations Economic Commission for Africa have made the strategic decision to support the development of indigenous supercentral languages rather than the postcolonial languages, the importance of the regional postcolonial languages as pivot and reference languages should not be underestimated. The cultural relationships on the postcolonial axis are still determining whether or not there will be enough qualified translators for a given European-African language pair. Also, as China bids for world dominance of natural resources, Chinese is becoming an important source and pivot language for African localization and business communications.