Developing MT Benchmarks for Under-Resourced Languages

Machine translation (MT) systems have little difficulty translating between languages like Spanish and English or English and Chinese — but for languages with limited resources, it can be difficult to collect enough intelligible data to develop high quality MT. However, an international team of researchers has recently won the 2021 Wikimedia Research Foundation Award for their project exploring ways in which MT systems can be developed for under-resourced languages, with a specific focus on languages that are indigenous to Africa.

“The natural language processing (NLP) community has awakened to the fact that it has a diversity crisis in terms of limited geographies and languages,” reads the report. “Language prevalence in societies is directly bound to the people and places that speak this language. Consequently, resource-scarce languages in an NLP context reflect the resource scarcity in the society from which the speakers originate.”

The researchers first explored the ways in which the internet has left behind indigenous African languages as a whole, often relegating these languages to the remote fringes of the World Wide Web. Many major languages spoken across Africa have a robust online presence — for example, there are hundreds of thousands of articles on Wikipedia alone that are written in Egyptian Arabic, and a little more than 90,000 in Afrikaans. However, many languages that are native to the continent have been left behind, severely impacting the amount of data available for such languages; despite having 4.2 million speakers across East Africa, the Luo language has a grand total of zero Wikipedia entries written in the language.

In addition to the lack of data on such languages, the researchers also noted that there was a lack of researchers in Africa who specialize in MT and NLP — as a result, the researchers recruited 400 participants from across the continent to help develop datasets for local languages, a project which according to the study was ongoing at the time of publication. A group of participants at a university in Nigeria, for example, has been training models on religious stories and undergraduate theses that have been translated into Yoruba and Igbo.

Thanks to the researchers’ large team of participants, the team was able to develop and publish MT benchmarks for 39 African languages that have been classified as low-resource languages. The researchers also note that this study is the first time in which human evaluation of MT systems for many of these languages has ever been conducted, a feat that was possible due to the “sheer volume and diversity” of languages represented by the 400 participants in the study.

Andrew Warner
Andrew Warner is a writer from Sacramento. He received his B.A. in linguistics and English from UCLA and is currently working toward an M.A. in applied linguistics at Columbia University. His writing has been published in Language Magazine, Sactown Magazine, and The Takeout.


Weekly Digest

Subscribe to stay updated

MultiLingual Media LLC