Machine translation for low-resource languages: A look at Lesan

A team of researchers based in Ethiopia has developed and launched a new machine translation (MT) system specialized for low-resource languages.

Lesan, which was presented at the 35th Conference on Neural Information Processing Systems earlier this month, is an MT system that currently allows individuals to translate between English, Amharic, and Tigrinya. In the researchers’ presentation, they demonstrated that Lesan outperforms leading MT systems like Google Translate in its translations of these language pairs — the researchers hope that their efforts will expand the accessibility of information available online in these languages.

“We want to make sure that everyone has equal access to information to help them understand the world,” Lesan’s website reads. “Our technology will further open up access to educational resources on the internet so that children, youth, and adults can enjoy lifelong learning. These are the reasons why we started Lesan.”

While systems like Google Translate can be highly accurate in translating between pairs of high-resource languages — that is, languages that have been heavily researched and have a large corpus of data that’s readily accessible, such as English, Spanish, or German — the state-of-the-art systems tend to perform significantly more poorly when it comes to low-resource languages. 

Lesan in action (Source: Teka Hadgu et al., 2021).

Languages like Amharic and Tigrinya, for instance, tend to have fairly small online presences, despite being spoken by millions of speakers. As the researchers note, Wikipedia only has around 15,000 entries written in Amharic and less than 250 in Tigrinya. Due to the lack of freely accessible data in the languages, they are classified as low-resource languages, even though these languages are spoken by a combined population of roughly 60 million speakers. On the flip side, there are more than six million Wikipedia entries written in English, spanning a total of 17 billion words in the language alone.

In order to circumvent the lack of online sources for data in these languages, the researchers supplemented data from online sources with texts derived from offline sources like books, newspapers, and magazines. This allowed the researchers to build a more thorough and representative corpus of text in the languages, creating a system that was consistently able to outperform more well-known MT systems in human evaluation conducted by experts in the respective languages.

“Unfortunately, millions of people cannot access [the information on Wikipedia] because it’s not available in their language,” the researchers write. “In future work, we would like to leverage Lesan’s MT system to empower human translators towards our mission of opening up the Web’s content to millions of people in their language.”

JMCC Conferente Center

The conference will be taking place in the stunning surroundings of the recently renovated John McIntyre Centre (JMCC). Part of the Pollock Estate, the JMCC is located in the vibrant Southside of Edinburgh and nestled on the edge of Holyrood Park. The estate offers a residential conference village environment within the city. It is easily accessible on foot or by public transport from the city centre while also boasting electric car charging points and parking.

Andrew Warner
Andrew Warner is a writer from Sacramento. He received his B.A. in linguistics and English from UCLA and is currently working toward an M.A. in applied linguistics at Columbia University. His writing has been published in Language Magazine, Sactown Magazine, and The Takeout.


Weekly Digest

Subscribe to stay updated