Nonprofits

Translation Fights Back

The TICO-19 initiative focuses on translating COVID-19 information, often for under-resourced languages

Eric Paquin

Eric Paquin is the chief technology officer at Translators without Borders (TWB), overseeing the organization's language technology development and innovation. He joined TWB in February 2020, bringing over 20 years of experience from the localization and tech industry. Paquin is a French Canadian who has set up roots in Ireland.

eric-paquin
eric-paquin

Eric Paquin

Eric Paquin is the chief technology officer at Translators without Borders (TWB), overseeing the organization's language technology development and innovation. He joined TWB in February 2020, bringing over 20 years of experience from the localization and tech industry. Paquin is a French Canadian who has set up roots in Ireland.

A

 year after its emergence, the world is still trying to find ways to quell the COVID-19 pandemic. Different industries look at the problem from a unique perspective — something that is also true for the translation industry. As the pandemic spread across the globe, the translation industry looked at how language and language technology could address a global crisis of information. Thus, the Translation Initiative for COVID-19 (TICO-19) was born. TICO-19 is an initiative that launched at the onset of the crisis and saw translators, technologists, and researchers from Translators without Borders (TWB), Amazon, Appen, Carnegie Mellon University (CMU), Facebook, George Mason University, Google, John Hopkins University, Microsoft, and Translated joining forces, using language technology to make COVID-19 information available in as many languages as possible.

Fighting back with language technology

Wired magazine has declared that COVID-19 is history’s biggest translation challenge.

The TICO-19 effort marks a unique collaboration between public and private entities that came together shortly after the pandemic was declared. The focus of TICO-19 is to enable the translation of COVID-19-related content into a wide range of languages, many of which are underserved by commercial language technology.

If you are lucky enough to have access to information in your own language, it still might be a challenge to find reliable information or understand new terminology that has been emerging across nations. Who had even heard of social distancing this time last year? As the term gained popularity, the question became: how do you convey the concept of physical distancing across not only languages, but cultures?

For example, TWB research found that using phrases like “keep away from people” were far better understood in African languages like Swahili than the direct translation for “social distancing.” (see https://www.devex.com/news/how-do-you-say-social-distancing-in-swahili-96856)
Some would be tempted to turn to machine translation (MT) in order to get some of the meaning behind those new terms. But while machines might get a translation correct in some contexts, the meaning in the context of the COVID-19 crisis may be literally lost in translation.

Human translators, on the other hand, do realize that words may be new or may be translated differently in the context of the crisis. But as a translator, how do you access the right terminology? In commercially viable languages, you may have access to certain resources like translation memories (TMs) that have been developed early on. Eventually, machine translations also catch up.

However, if you are looking for information in less commercially viable languages, you might struggle to find the right resources. This is why TICO-19 members are working together to develop efficient and scalable language technology for 37 languages and counting, including languages that are currently under-resourced by technology, like Dari, Dinka, Hausa, Luganda, Pashto, and Zulu.

“Language technology is a powerful tool that can help people communicate more consistently, quickly, and confidently about global issues like COVID-19. Yet many languages don’t have the necessary data needed to build this innovative technology,” explained Grace Tang, TWB’s head of special projects. “We’re excited that industry leaders recognize this gap, and are working with us to develop technology that can help everyone communicate about COVID-19, no matter what language they speak.”

How do you say quarantine in Hausa, for example? ƙeɓancewa sabida magani.

Ensuring COVID-19 terminology is globally equitable and accessible

The translated content focuses on key COVID-19 terminology, ensuring COVID-19 information is more globally accessible and equitable.

TWB brings language technology expertise to TICO-19, particularly for marginalized languages. Its language equality initiative, Gamayun, uses advanced language technology to improve two-way communication in marginalized languages. The ultimate goal is to enable everyone to give and receive information in the language and format they understand.

TWB’s TICO-19 involvement builds on previous Gamayun experience. Through this initiative, TWB has built language datasets and machine translation engines for Rohingya, Tigrinya, Kanuri, Congolese Swahili, and other low-resource languages.

Sourcing the right content and the right translations

Generally, translators working in marginalized languages have far fewer resources with which to translate, making their jobs more challenging, which further hinders the amount of resources that can be created in those languages. In order to produce helpful tools, we had to ensure the relevance of the content and the quality of the translations.

We created the TICO-19 benchmark, a dataset specialized in the medical domain, which is intended to track the quality of current machine translation systems. The set includes data for very-low-resource languages and was produced with three criteria in mind: diversity, relevance and quality.

First, we sampled from a variety of public sources containing COVID-19-related content, representing different domains. We took special care to diversify the domains and sources of the data, selecting PubMed articles on COVID-19, an English-Haitian Creole dataset from CMU built during the humanitarian response to the 2010 earthquake, and some COVID-19 articles from Wikipedia, Wikinews, Wikivoyage, and Wikisource.
Second, to make our content relevant for relief organizations, we chose the languages to translate into based on the requests from relief organizations on the ground, and the humanitarian priorities of TWB.

Third, we established a stringent quality assurance process, to ensure that the content was translated according to the highest industry standards. The quality of the translations was checked, and revisions were made to ensure quality assurance rated above 95% across all languages, before any subsequent edits. Some low-resource languages like Somali, Dari, Khmer, Amharic, Tamil, and Marathi required several rounds of translation to reach acceptable performance.

The collective effort resulted in a collection of translation memories and technical glossaries so that language service providers (LSPs), translators, and volunteers can make use of them to expedite their work — and ensure consistency and accuracy at the same time.

To help MT practitioners advance medical and humanitarian machine translation, as well as other natural language processing (NLP) applications, we also provided mono-lingual and bi-lingual datasets and an open-source, multi-lingual benchmark set (which includes data for very-low-resource languages) specialized in the medical domain, which is intended to track the quality of current machine translation systems, thus enabling future research in the area.

The collaborators forming TICO-19 have made test and development data available to AI and MT researchers in 35 different languages: nine high-resourced, pivot languages, and 26 lesser-resourced languages, in particular languages of Africa, South Asia, and Southeast Asia, whose populations may be the most vulnerable to the spread of the virus.

The same data is translated into all of the languages represented, meaning that testing or development can be done for any pairing of languages in the set. Furthermore, the team has converted the test and development data into translation memories that can be used by localizers from and to any of the languages.

Readiness for future crises

Enabling efficient and accurate communication through translations for the majority of the world’s languages and particularly the most vulnerable ones is still a long road ahead. With this effort, we only addressed a fraction of the needs for a fraction of the world’s languages. Nevertheless, we hope to pave the way with the MT resources that we have released, and that they will have an immediate impact for the languages we covered and the communities that speak those languages. Our effort has opened up possibilities and communication channels that will allow the MT research community, both academic and commercial, to be more prepared for the next crisis.

But it does not stop here. The initiative also created opportunities for further collaborations. As more organizations and people offer their help, we are hoping to see the data expanded to even more languages with the same level of quality, producing more TMs and creating machine translation language models for low resource languages.

The research aspect is also being expanded to include voice technologies. Many of the communities we help have a low level of literacy and the only way to effectively communicate information is through voice. We are hoping that building voice datasets for these languages will help in the creation of language models and get closer to TWB’s long-term goal of achieving real-time, two-way communication through language technologies.

Be part of the effort

The initiative developed translated datasets for approximately 3,100 key COVID-19 terms and phrases. The resulting datasets, machine translation engines, and translation memories are publicly accessible through TICO-19’s GitHub and TWB’s online language data portal to make sure this specialized content can inform future machine translation initiatives.

If you are a professional translator and have already produced COVID-19 related content, you can share your translation memory and we will combine and release it with ours. Similarly, if you have compiled terminologies with COVID-19 terms, or if you find errors in our published terminologies, reach out and we will update them accordingly.

If you’re interested, you can also volunteer with Translators without Borders (TWB) as long as you are fluent in at least one language other than your native language. Whether you are interested in translating medical texts or translating for crisis response, there are engaging projects available to suit all preferences. Professional translators are especially encouraged to apply. Find out more at www.translatorswithoutborders.org.