Machine translation for less-resourced languages

We have been witnesses to the evolution from human to machine-assisted translation. This is a propitious time for rapid advances in language technologies. We now have enough computing power to support the complex algorithms that drive statistical machine translation (SMT) and powerful open-source tools like Moses. In recent years, SMT has become a major developmental breakthrough by providing a cost-efficient and fast way to build and use machine translation (MT) systems. 

However, even recent advances fall short of fulfilling expectations regarding MT systems. The quality of an SMT system largely depends on the size of training data. Obviously the majority of available parallel data has been generated by major languages. MT works by statistically comparing the parallel corpora of two languages and calculating the probabilities that are used to generate the most likely translation. As a result, SMT systems for the most widely used languages are of much better quality than systems for less-resourced languages. 

This quality gap is further broadened by the linguistic structure of many smaller languages. Languages such as Latvian, Lithuanian and Estonian, to name just a few, have a complex morphological structure and free word order. To learn this additional complexity from corpus data by statistical methods, proportionally much larger volumes of training data are needed than for languages with a simpler linguistic structure.

Another drawback preventing wider implementation of MT is its general nature. Although free web translators provide reasonable quality for many language pairs, they perform poorly for domain and user-specific texts. Current free systems cannot be adjusted for particular terminology and style requirements. For example, Google Translator currently provides MT for more than 50 languages. However, for smaller languages such as Latvian or Estonian, translation quality is quite poor, particularly for domain-specific texts. 

While large languages have the benefit of large markets that successfully amortize investments in proprietary systems, smaller languages also suffer from smaller consumer markets and lower overall translation volumes. Many producers of goods and services supply content mostly in larger languages because the cost of human translation into smaller languages is prohibitively high and the quality of existing MT solutions is insufficient. In the localization and translation industry, huge pools of parallel texts in a variety of industry formats have been accumulated, but the application of this data has not yet been fully utilized in modern MT. At the same time, this industry is experiencing unrelenting pressure on efficiency and performance. Clients expect more to be translated in nearly real time at lower prices.

Presently, integration of MT in localization services is in its early stages and is mostly the realm of large agencies working with the large languages. The cost of developing specialized MT solutions is prohibitive to most players in the localization and translation industry, while the quality and confidentiality afforded by the free generic MT offerings are not sufficient to reap substantial efficiency gains in the professional localization industry setting. 

MT has been a puzzle in the area of natural language processing since its inception in the early 1940s. Historically, three main MT strategies have been prominent: direct, interlingual and transfer. The rules-based transfer MT strategy with a rich translation lexicon has returned good translation results and found its application in many commercial MT systems, such as SYSTRAN, PROMT and others. However, this strategy requires immense time and human resource investments to incorporate new language pairs or to enhance translation quality. The more competitive SMT approach has been getting ever-growing traction since the first research results in the late 1980s with the Candide project at IBM for an English-to-French translation system. The SMT strategy, first suggested in 1949 by Warren Weaver and then abandoned for several decades until the late 1980s, has proven to be fruitful approach to foster development of MT. Cost-effectiveness and translation quality are the main reasons the SMT paradigm has become the dominant current framework for MT theory and practice. With the advent of the web, the wide availability of data in digital form and the cost reductions experienced in the computing power have made this approach the most potent. In a majority of cases, SMT research and development activities have focused on a dozen of widely used languages, creating a technological gap for “smaller” under-resourced languages.

The rules-based approach requires a large investment in developing elaborate supporting tools such as morphology analyzers, syntactic parsers, extensive dictionaries, and a complex set of hundreds of interrelated rules for analyzing, transferring and generating output sentences.


Platform for SMT development

Building an MT system is a complicated task that requires expert knowledge and the necessary infrastructure. We thought that it would be possible to create an online MT factory that would simplify MT development for smaller languages and specific domains. This concept started our work on a cloud-based platform. 

What are the characteristics of a good platform for the SMT development? It must be easy to use, with no complicated code. It must process and store your data, taking care of a variety of formats and alignment of parallel sentences. SMT training should be a few mouse clicks away, and then you should be able to access and use the results in a familiar tool whenever and wherever you need to translate. It would be great to have someone do the heavy lifting and collect additional corpora to improve quality.

In order to continue its commitment to reducing the technology gap for small languages we became involved in the European Union-supported LetsMT! pro-
ject. In its quest for both language diversity and facilitating cross-border communication, the European Union (EU) is supporting projects that can use easily accessible, affordable technologies to bridge the language divide in a multilingual world. The LetsMT! consortium includes the project coordinator Tilde, Universities of Edinburgh, Zagreb, Copenhagen and Uppsala, the localization company Moravia and the semantic technology company SemLab. 

The aim of the LetsMT! is to take advantage of the huge potential of existing open-source SMT technologies to develop an online collaborative platform for sharing data and building MT systems (Figure 1). 

LetsMT! provides a simple interface to step you through the process of creating your own MT engine. The volume of open-source parallel resources is limited, which is a critical problem for SMT, since translation systems trained on data from a particular domain, such as parliamentary proceedings, will perform poorly when used to translate texts from a different domain, such as news articles. To attack the most difficult problem for small languages — the lack of training resources — the core concept of the platform is to share resources. The platform provides a repository of data collected by pro-
ject partners and by the users of You take your data, supplement it with the data already provided on the platform and generate your MT engine (Figure 2). 

Although the project is endeavoring to accumulate large collections of parallel texts in a variety of industry formats, languages and domains, the most successful data collection effort is the online repository of translation memory (TM) data by the TAUS Data Association ( TAUS is a visionary pioneer in the data collection give-and-get approach. To further advance benefits provided by the TDA, LetsMT! is working on the integration of its MT generation services with the TDA data repository. The uniqueness of the LetsMT! platform lies in its commitment to the privacy of data. Although you may use publicly available data and share your own, other users cannot see the content of the corpora or download them out of the platform. It can be used only for the generation of MT engines. If you chose not to share your data, it cannot be seen or used by other LetsMT! users. All categories of users — public organizations, private companies, individuals — can upload their proprietary resources to the repository and create a tailored SMT system trained on these resources. The latter can be shared with other users who can use them further on. To ensure scalability of the entire system, LetsMT! is hosted in the Amazon Web Services infrastructure, which provides an easy access to on-demand computing resources. LetsMT! services for translating texts can be used in several ways: directly through the LetsMT! web portal, through a widget on a user’s web page, through browser plug-ins, or through integration in computer-assisted translation (CAT) tools and different online and offline applications. Localization and translation industry businesses and translation professionals can access LetsMT! services in their production environments, typically involving various CAT tools.

To create our own English-Latvian system, we used the Giza++ and Moses SMT toolkits for data alignment, training of SMT models and translation (decoding). The total size of the English-Latvian parallel data used to train the translation model was 5.37 million sentence pairs.

The parallel corpus includes publicly available DGT-TM (1.06 million sentences) and OPUS EMEA (0.97 million sentences) corpora, as well as a proprietary localization corpus (1.29 million sentences) obtained from TMs that were created during localization of interface and user assistance materials for software and user manuals of IT&T appliances. To increase word coverage, word and phrase translations were included from bilingual dictionaries (0.51 million units). We used a larger selection of parallel data that was automatically extracted from a comparable web corpus (0.9 million sentences) and from literary works (0.66 million sentences). 

The monolingual corpus was prepared from news articles from the web and the monolingual part of the parallel corpora. Total size of the Latvian monolingual corpus was 391 million words.

Since Latvian belongs to the class of highly inflected languages with a complex morphology, numerous inflectional forms of words increase data sparseness. For this reason the SMT system was extended within the Moses framework by integrating morphologic knowledge. We introduced an additional language model over disambiguated morphologic tags in the English-Latvian system. The tags contain morphologic properties generated by a statistical morphology tagger. The resulting system was evaluated with an automated metrics BLEU score of 35.0.


MT for localization

The localization industry is experiencing unrelenting pressure to provide more efficient services, particularly due to the fact that volumes of texts that need to be translated are growing at a greater rate than the availability of human translation, and translation results are expected in real time. 

For several decades, the most widely used CAT tools in the localization industry have been TM systems. Since TMs contain fragments of previously translated texts, they can significantly improve the efficiency of localization work in cases when new text is similar to previously translated texts. However, if a text is from a different domain than the TM or in the same domain from a customer using different terminology, benefit is minimal. 

Developers of CAT tools have recognized the benefits from integrating MT into their TM systems (Figure 3). For example, SDL Trados Studio 2009 supports three MT engines: SDL Enterprise Translation Server, Language Weaver, and Google Translate. ESTeam Translator and Kilgray’s memoQ are other systems that also support the integration of MT. 

Although the idea to use MT to optimize the localization process is not new, it has not been explored widely in the research community. Different aspects of post-editing and machine translatability have been researched since the 1990s. Increasing the efficiency of the translation process without degradation of quality is the most important goal for a service provider.

In recent years, several productivity tests have been performed in the translation and localization industry settings at Microsoft, Adobe and Autodesk. The Microsoft Research trained SMT on MS tech domain was used for three languages for Office Online 2007 localization: Spanish, French and German. By applying MT to all new words, on average a 5%-10% productivity growth was obtained.

In experiments performed by Adobe, about 200,000 words of new text were localized using rule-based MT for translation into Russian (PROMT) and SMT for Spanish and French (Language Weaver). Authors reported an increase of translator’s daily output by 22%-51% (Flournoy and Duran, 2009). At Autodesk, a Moses SMT system was evaluated for translation from English into French, Italian, German and Spanish by three translators in each language pair (Plitt and Masselot, 2010). For measuring translation time, a special workbench was created to capture keyboard and pause times for each sentence. Plitt and Masselot reported that although all translators worked faster when using MT, the proportion varied from 20% to 131%. They concluded that MT allowed translators to improve their throughput on average by 74%. Tilde also evaluated the results of the LetsMT! platform for using an English-Latvian SMT integrated into TM in a localization workflow. We chose a simple method and measured the change in performance of translators working with and without MT using a platform integrated plug-in for SDL Trados 2009. A quality assessment was also performed according to a standard internal quality assessment procedure.

Evaluation scenarios and results

We wanted to evaluate how usable our English-Latvian MT system was for localization tasks and based our evaluation on a measurement of translation performance. Performance was calculated as the number of words translated per hour. 

For the evaluation, two test scenarios were employed: a baseline scenario with TM only, and an MT scenario with a combination of TM and MT. The baseline scenario established the productivity baseline of the current translation process using SDL Trados Studio 2009 where texts are translated unit by unit. The MT scenario measured the impact of MT in the translation process when translators are provided with not only matches from a TM (as in the baseline scenario), but also with MT suggestions for every translation unit that does not have a 100% match in TM. Suggestions coming from the MT were clearly marked for translators to treat them carefully. 

Typically, translators trust suggestions coming from a TM, and they make only small changes if a TM suggestion is not a 100% match. Translators usually are not double-checking terminology, spelling and the grammar of TM suggestions, relying on the fact that TMs should contain quality data. However, translators must pay particularly careful attention to suggestions coming from MT, as it may be inaccurate or ungrammatical.

In both scenarios, translators were allowed to use whatever external resources were needed (such as dictionaries), just as during their regular work. Five translators with different levels of experience and average productivity expectations were involved in the evaluation.  

The quality of each translation was evaluated by a professional editor in the standard quality assurance process of the service provider. The editor was not made aware whether the text was translated using the baseline scenario or the MT scenario. An error score was calculated for every translation task. There are 15 different error types grouped in four error classes: accuracy, language quality, style and terminology. Different error types influence the error score differently because errors have a different weight depending on the severity of an error type. For example, errors of comprehensibility — an error that obstructs the user from understanding the information — have a weight of 3, while errors of omissions/unnecessary additions have a weight of 2. Depending on the error score, the translation is assigned a translation quality grade: Superior, Good, Mediocre, Poor or Very Poor.

The test set for the evaluation was created by selecting documents in the IT domain from the tasks that have not been translated by translators in the organization before the SMT engine was built. This ensures that TMs do not contain all segments of texts used for testing. Documents for translation were selected from the incoming work pipeline if they contained 950-1,050 adjusted words each. Each document was split in half, and the first part of it was translated as described in the baseline scenario; but the second half of the document was translated using MT. The project manager ensured that each part of a single document was translated by a different translator so that the results are not affected by familiarity to a translated document. 

Altogether, 54 documents were translated. Every document was entered in the translation project tracking system as a separate translation task. Although a general-purpose SMT system was used, it was trained using specific vendor TMs as a significant source of parallel corpora. Therefore, the SMT system may be considered slightly biased to a specific IT vendor or a vendor-specific narrow IT domain. The test set contained texts from this vendor and another vendor whose TMs were not included in the training of the SMT system. We will identify these texts as narrow IT domain and broad IT domain for easier reference in the following sections. Approximately one-third of the texts translated in each scenario was in the broad IT domain.

The results were analyzed for 46 translation tasks (23 tasks in each scenario) by analyzing average values for translation performance (translated words per hour) and an error score for the translated texts. Usage of MT suggestions in addition to the TMs increased productivity of the translators on average from 550 to 731 words per hour (32.9% improvement). There were significant performance differences in the various translation tasks; the standard deviation of productivity in the baseline and MT scenarios was 213.8 and 315.5, respectively. At the same time, the error score increased for all translators. Although total increase in the error score was from 20.2 to 28.6 points, it still remained at the quality evaluation grade Good. We have not made a detailed analysis of reasons causing an error score increase, but a possible explanation could be a higher rate of error in translated segments originating from MT than in translations made from scratch. There were significant differences in the results of different translators, from performance increase by 64% to decrease by 5% for one of the translators. Analysis of these differences requires further study, but most likely they are caused by working patterns and the skills of individual translators.


Synchronicity and challenges

There are several other exciting developments increasing the quality and accessibility of MT to less-resourced languages. Among those that promise to improve quality is the EU-funded research project ACCURAT, which aims to find, analyze and evaluate innovative methods to acquire more corpora from the web. The premise is to use not parallel (word-for-word) translations, but to use content from different languages with similar content and to research how to extract parallel data from this comparable corpora. This could help get more data in smaller languages and narrow domains for MT development.  

Another valuable initiative is the establishment of the META-NET network. Recognizing that data is the equivalent of other natural resources, the EU has committed to involve language research institutions and commercial entities in a network to find, catalog, standardize and make publicly available those language resources that are currently being held in companies and universities across Europe. These initiatives are significant steps to help diminish the digital divide between small languages and large, and make the world of knowledge universally accessible.

For the industry as a whole, it would be useful to address several problems that are detrimental to the continued progress of MT. Development costs would lessen if in the name of interoperability APIs of different systems would be developed in a standardized way. The other challenge in data-driven MT development is population of the web with a raw machine translated output. When automatically collected and included in the corpora used for MT development, it greatly diminishes the quality of resulting MT system. Therefore, we would like to rally the industry behind an effort to tag machine-translated content. We envision a world where speakers of small languages will have the same access to information and services as large language speakers, no matter where in the world they are and no matter where the necessity arises. Translation will be available on-demand in real time, delivered on your smart phone, TV, refrigerator and work desk, where MT will merge seamlessly with voice technologies. We intend to continue to be a part of this exciting frontier of technology.