Case study: Implementing Moses

By Renat Bikmatov, Serge Gladkoff, Marina Kostionova & Andrei Kopylev May 24, 2011

Machine translation (MT) has been on the agenda from the first days of computers. The first experimental MT system presented in 1954 had a tiny vocabulary of 250 words and knew only six rules of grammar. It soon became clear that language is a much more complicated thing than it appears, and that there is a huge gap between an idea and its practical applicability.

Until recently, a belief was widespread among translation services professionals that the quality of MT, after reaching a certain level, stopped improving. The existing grammar and rules-based systems (RBMT) approached the limit of their capability. The highest quality that MT could achieve could not get over a certain threshold of editability and usefulness. What MT produced was not a workable text but a useless set of words; it was easier to translate the original text anew than to edit the garbage produced by MT.

Recently, however, we have heard that the situation is close to a dramatic turning point. Proponents of this view argue that MT is close to a breakthrough and is getting over the previously impregnable barrier of usefulness for real-life application. Indirectly, this is confirmed by the growing number of large projects with significant investments in MT technology, of which Google Translate is the most famous, as well as the increase of public research and reports on MT.

Claims of a “breakthrough” in MT industry are summed up graphically in Figure 1, showing statistical machine translation (SMT) implementations breaking the ice of RBMT stagnation into “near human” quality.

According to the partisans of MT deployment, we are now seeing a number of factors that drastically change the situation with the actual application of MT technology. First, processing power increased significantly, and quick processing of previously unimaginable amounts of data became possible. Secondly, research organizations and companies engaged in the localization industry have accumulated huge databases of correct translation pairs over the years, previously translated, tested and edited by humans. Considerable contribution to the development of MT was made by budgets of anti-terrorist efforts in the United States. Around the same time, the first systems of SMT appeared, capable of analyzing databases of human translation and using them to improve the quality of output text. Soon, Google implemented statistics-based technology, and its Google Translate system greatly helped to increase the popularity of MT. When an approach that allowed quick and accurate editing of MT texts was developed, some translation and localization service providers began integrating MT in their production process. It became clear to industry experts that MT was not an idle fantasy anymore, not a matter of the distant future, but a possibility that needs a careful study, at least. Meanwhile, the technology was not standing still, and statistical systems were joined by hybrid ones, in which the statistical analysis of data was combined with using a set of grammatical rules.

Is it true that the plateau of quality has been left behind? If so, radical changes await the translation industry. The technology about to be introduced threatens to completely restructure the actual process of rendering services, both for individual translators and for companies providing such services, for whom everything will be affected: the environment of service provision, customer demands, the insides of technical process and pricing. Taking up this work, we set a practical goal: to understand the potential of modern MT technologies and whether it is worth the effort to conduct further research in this area. Will automated translation be able to improve performance and speed without losing quality? Where and in what areas should MT be applied, and, if it should be, how should the process change? Of course, ultimately we were interested in the extent of feasibility and economical viability for the integration of MT in the business processes of translation companies.

To stage the experiment that was supposed to answer these questions, we chose several commercially available MT systems: PROMT 8, which we use for a number of pilot and real projects, PROMPT 9 and Moses, an SMT-based system. The choice of Moses was not only due to the fact that this product is open and distributed free of charge, but also that it served as a prototype and starting point of many current commercially successful MT systems.

First steps with Moses

We downloaded the system from the developer’s site, installed it and set the minimum configurations required for the system to work. However, before you can start working with Moses, it has to be trained. Simply speaking, an SMT system learns by analyzing an extensive set of texts and their translations. Comparing the translation and the original, the system collects statistics on the most probable options for translation, which allows it to make fewer mistakes. The more pairs of original translation that are available to the system, the better the statistics and the higher the quality of translation.

Large text corpora that can be used to train MT systems are stored and provided on a subscription by TAUS Data Association (TDA). From the TDA website, we loaded the corpus of English to Russian translations on three topics: computer software, computer hardware and legal documentation. In addition, we have produced our own corpus of translations from English into Russian, commissioned by Microsoft, based on our database and accumulated over many years of work. This corpus includes texts on information technology. Finally, we have created another small corpus of our own translations on highly specialized topics — we needed it in order to compare how Moses works on corpora of different sizes.

Before the corpora are used for MT system training, the texts had to be cleared of items that could interfere with the analysis of texts and reduce the quality of future translations. The translated text often has complex formatting, such as RTF, HTML or XML. In these formats, data on text layout are stored as special markups or tags. Tags clog up the text and hinder training the MT system, so we had to remove them.

In addition, from time to time encoding errors occur in the process of translation, and the segment of the translated text becomes a meaningless set of characters. It can also happen so that a segment of source code corresponds to an empty string of translation. These segments also had to be deleted. Finally, we removed the untranslated strings, nontextual strings (numbers, dates and so on) and equality signs (=) if they were at the beginning of the line — as it turned out, this character prevents text processing in Microsoft Excel. The total amount of information “garbage” that had to be removed for Moses training purposes made up 10-20% of the corpora. Currently, the TDA community is working to clean up its corpora of text, but at the time of the experiment the texts were available only in “dirty” form.

Each corpus of text was a separate translation database. The goal of the first experiment that we conducted was to assess the quality of translation that Moses produces in a standard configuration, without advanced setup. We trained the system on corpora of text provided by TDA, and gave it a test translation. The text consisted of ten expanded sentences containing general (conceptual) descriptions of certain IT environment capabilities.

Then we performed a more thorough configuration of the system to improve productivity and quality of translation, and trained the system on our corpus of IT texts. We submitted the same test text, and analyzed the quality of translation delivered by Moses after additional setup.

Finally, using all the same test text, we assessed the abilities of several freely available MT systems: Bing Translator, available on the Microsoft website, PROMT (using “Computers” as the domain), SYSTRAN and Google Translate. We compared translations made by these systems with the results produced by Moses. For comparison, we used translations of the same source text from English into Russian — this is the direction we were interested in, first and foremost — as well as from English into several other languages. The results produced by MT systems were compared with the published benchmark human translations.

Table 1 sums up the results, reflecting the difference between the quality of MTs and human translations using a percentage of how much the sentence needed to be edited. The higher the percentage, the worse the quality of MT. As we see, Moses outright overtook PROMT and almost had a tie with Microsoft’s Bing Translator.

The system had to be rebuilt with a specific language model. A language model is a set of phrases with indicators of the frequency of their use in a given language. Based on these frequency data, the statistical system chooses between possible options in the translation. Our task was to find versions of the language models compatible with the original code of Moses. We tried different versions to test whether the final solution worked or not, and if it did, then how well. We combined many different versions and distributions of both Moses and linguistic models in combination with different distributions of the operating system Linux.

We had to choose between two language models, Stanford Research Institute Language Modeling (SRILM) and Istituto per la Ricerca Scientifa e Technologica (IRST). SRILM proved difficult to configure and compile, and transferring it to other computers required serious changes to the configuration of the entire system. The IRST model was less quirky and easier to use. Virtually no versions of this model caused errors during compilation. Thus, we decided that IRST was the best option for us. A number of utilities (tools) have been developed for Moses that facilitate the use of the system and extend its capabilities, but also often cause compatibility problems. Therefore, we sought to find the optimal combination of different versions of components and utilities.

In the process of Moses system setup, we experimented with some of its linguistic parameters, and investigated how the language model affected the result of translation. As a result, we selected a configuration that was not only stable and facilitated using the system, but also greatly enhanced the quality of the text output.

We conducted a number of studies in training Moses. To expedite the process of training the system, we prepared an AMD-64 computer with 16 GB RAM, installed a 64-bit Linux Ubuntu on it and configured it. With the transition to this configuration, the speed of training and translation increased significantly. While on the old configuration (32-bit architecture, with 2 GB of RAM) it took more than 11 hours to train the system on TDA data, now it took only 3.5 hours. Another benefit of the new configuration is that it allows distribution of the load in parallel on multiple processors, further increasing the speed.

Generally, there are two ways of training an MT system. First is semi-automatic training using a script. In this case, the system needs to be told which files to work with, where to find them and what to do with them. The second option is fully automatic learning. In this case, the scripts on the server process the corpora of texts in automatic mode, and then configure and start a new translation service accessible through a web interface. The advantage of the first option is that it allows you to upload into the system something that is not a monolithic language database, but a set of blocks (dictionaries), that you can connect, disconnect and combine. Therefore, even if we fail in fully automating the process of training on large volumes of text, the option of semi-automatic training and processing remains quite acceptable. After all, you can only run a finite number of translation services on the same computer, even a very powerful one, which means that the possibilities provided by a fully automated process are not infinite.

Moses system: advantages, disadvantages

When testing Moses, we have identified a number of significant technical challenges that currently impede the widespread adoption of MT in the business process of translation. First, in the corpora of translated text, some segments contain large paragraphs instead of individual sentences, with the total length of more than 255 characters. Moses can build a language model for such long lines, but does not know how to build a match table of source/translation pairs. To resolve the problem, the long strings can be broken down into phrases. However, it only makes sense if the source and target strings contain the same number of sentences, so that the system can match the pairs.

The problem with fragmentation into segments presents other challenges as well. The system perceives punctuation marks as letters. Therefore, if the language model is built on long strings consisting of several sentences, it can contain phrases consisting of the end of one sentence and the beginning of the next. Such phrases do not carry meaning and are “information noise” that reduces the quality of translation. Breakdown of long strings into shorter fragments may also be useful here.

Another difficulty lies in the fact that, in the basic Moses configuration, it processes and reproduces only words consisting of lowercase letters. To restore the capital letters in the beginning of a sentence, for example, requires additional processing of the text. Fortunately, compared to the total post-editing of MT, the restoration of the correct letter case takes relatively little time.

When translating texts containing markup tags (RTF, XML, HTML) we need to ensure that the system at least does not damage the tags and, ideally, puts them in the right places in the translation. The solution to this problem has not been found so far. We may need a tool for preliminary preparation of the original text and a related tool for processing of the translated text.

Finally, the training corpus of text often contains ambiguous or obsolete terms and even barbarisms. Therefore, for SMT systems we will need tools for automated terminology control. In RBMT systems, this problem is partly solved by creating a thematic glossary, but even in this area the problem of ambiguous terms remains open.

Having tested Moses in action, we have identified its main strengths and weaknesses. The latter include the fact that Moses, as well as similar open source products, is free of charge, but in order for the system to work really well, deep knowledge and a serious effort are required. The absence of clear and detailed information on configuring the system aggravates the situation. The documentation describes only the basic components, without which work is impossible, and broadly identifies the areas for further customization. It takes several hours of efforts to make the program work. Just the transition to a 64-bit platform took more than a day, and this is not including the preparation of the operating system.

All of the above applies to the Linux operating system, on which Moses is built. This is an open and very flexible operating system that can be configured in many different ways, which is its advantage, but is also a drawback. You never know in advance whether a particular application, including Moses components, will be compatible. Naturally, the developers of Moses could not take all these factors into account when preparing the documentation, and a huge amount of information has remained behind the scenes or between the lines.

The high complexity of the setup process is the price for the fact that Moses is freeware. In the original configuration the system is not exactly broken, but it shows significantly worse results than after a complicated refinement. To the credit of our experts, they have successfully grasped the intricacies of the system settings and in a short time managed to achieve very high quality translations comparable to the world’s leading developments.

The advantage of Moses is that even in the basic configuration, it provides almost the same quality of translations as the brand-name hybrid statistics-based MT systems, the development of which took a lot of time and money. And Moses provides this high quality after relatively short configuration and training. However, for complex language pairs such as Russian-English, the quality of all MT systems is much closer to each other than to human translation.

This last fact vividly illustrates that the mirage of MT is still unreal, though it has recently gained exceptional brightness and color. It never seemed so convincing that actual implementation is almost achieved, and this can lead to severe underestimation of the applicability of this technology and investment required for implementing it in real projects where you need to analyze the return on investment in promising research developments.

As to the available body of texts, they are suitable for training Moses, but require preprocessing. However, the TMX format in which the corpora are stored does not allow us to process their content in preparation for the training of Moses, or to use them as a translation memory. Existing TMX editors are expensive, and the choice is small. Besides, they have problems with the support of various languages and even the actual format of TMX. Text cleaning and exporting into the right format are lacking or not working correctly in these editors. Therefore, to prepare training data for Moses, we were forced to develop our own TMX editor.

Results and conclusions

The experiment with Moses demonstrated that use of statistical systems allows for improvement of the quality of MT compared to the level we have observed until recently. We experienced the benefits of the SMT method and its limitations. In addition, we obtained more information on the areas in which particular MT methods or combinations thereof produce the best result, and also found the configuration settings and refined the methods of training SMT systems, such as Moses, substantially improving the resulting output text. We concluded that further research on the areas of real applicability of this technology is required, not just a routine evaluation in order to dispel another myth about its practical applicability. The threshold of the real applicability is really close, although the range of applicability is still narrow and the entry barriers are high.

We found that the weakness of the SMT method is the terminology: the more diverse the corpus of text on which the system is training, the more it is confronted with the presence of several translations of the term, and often makes the wrong choice. Most experts, however, are inclined to believe that the future belongs to SMT: in the last five years SMT and especially hybrid systems have developed very quickly. However, RBMT systems do not stand still either, as evidenced by the new PROMT 9, which far surpasses the previous version. The most promising option now seems to be a hybrid system combining the advantages of both technologies, which additionally uses different methods of analysis and synthesis of grammatical structures and terminology processing appropriate to the language and subject domain. Among the widely available hybrid statistical systems today, Microsoft and Google systems are the leaders.

Practice has shown that in the training of SMT systems, the subject domain of the corpus of texts should be made as narrow as possible. Even the volume of a translation database can be sacrificed for the sake of “purity” of the text, even though on a large corpus the system learns more effectively. The optimum size of single-language corpus is not less than one million words: on a corpus size of less than 100,000 words Moses produces very poor quality. On the other hand, training a grammar-based system requires no less effort: to achieve a decent quality of translations, a full glossary must be created for it. It is very important to specify the grammatical properties (attributes) of the glossary terms, and this laborious manual work may take a long time. However, even a semi-automatic import of an external glossary, for example, into PROMT, in which grammatical attributes are often placed wrong, significantly improves translation. Thus, if we compare an SMT system trained on highly specialized corpora of text and complemented by the use of hybrid technology and optimization tools to an RBMT system trained on a full glossary, the results are roughly the same. This means that we have a choice: if there is a large body of consistent translations, but no time to prepare a glossary, it is better to use statistical hybrid technology; if there is no large corpus of text but a glossary is available or you have time to prepare it, use the grammatical system. If you have both, the choice is determined by finer detail: the quality of a particular MT technology for a specific language pair, the availability of hardware resources, timeframes, the budget for post-editing and so on.

With regard to Moses, we can say with confidence that it is really suitable for practical application. Moreover, in two months of work we have achieved quality only very slightly inferior to the results of the world’s largest MT systems, the development of which took years. This shows the enormous potential of Moses; the efficiency of software development on the principles of open source; and demonstrates the economic feasibility of Moses’ deployment. We expect that in the future, companies adopting Moses will require the services of highly skilled professionals who can quickly and efficiently train the system. Also, for most languages and most of the texts, end-to-end post-editing of the produced MT will be required. This work will require special skills, and the experts to do it have to be trained in advance. However, the cost will be lower. The main point is clear: MT really does have a future, and in some cases, its implementation can bring real benefits to companies.