A team of researchers based in Ethiopia has developed and launched a new machine translation (MT) system specialized for low-resource languages.
Lesan, which was presented at the 35th Conference on Neural Information Processing Systems earlier this month, is an MT system that currently allows individuals to translate between English, Amharic, and Tigrinya. In the researchers’ presentation, they demonstrated that Lesan outperforms leading MT systems like Google Translate in its translations of these language pairs — the researchers hope that their efforts will expand the accessibility of information available online in these languages.
“We want to make sure that everyone has equal access to information to help them understand the world,” Lesan’s website reads. “Our technology will further open up access to educational resources on the internet so that children, youth, and adults can enjoy lifelong learning. These are the reasons why we started Lesan.”
While systems like Google Translate can be highly accurate in translating between pairs of high-resource languages — that is, languages that have been heavily researched and have a large corpus of data that’s readily accessible, such as English, Spanish, or German — the state-of-the-art systems tend to perform significantly more poorly when it comes to low-resource languages.
Languages like Amharic and Tigrinya, for instance, tend to have fairly small online presences, despite being spoken by millions of speakers. As the researchers note, Wikipedia only has around 15,000 entries written in Amharic and less than 250 in Tigrinya. Due to the lack of freely accessible data in the languages, they are classified as low-resource languages, even though these languages are spoken by a combined population of roughly 60 million speakers. On the flip side, there are more than six million Wikipedia entries written in English, spanning a total of 17 billion words in the language alone.
In order to circumvent the lack of online sources for data in these languages, the researchers supplemented data from online sources with texts derived from offline sources like books, newspapers, and magazines. This allowed the researchers to build a more thorough and representative corpus of text in the languages, creating a system that was consistently able to outperform more well-known MT systems in human evaluation conducted by experts in the respective languages.
“Unfortunately, millions of people cannot access [the information on Wikipedia] because it’s not available in their language,” the researchers write. “In future work, we would like to leverage Lesan’s MT system to empower human translators towards our mission of opening up the Web’s content to millions of people in their language.”