In 1978, Kurzweil Computer Products launched the first commercial optical character recognition (OCR) product. It had mixed reviews and was reasonably good for clean Courier font published text. Initial quality on a mix of typefaces was only around 90%, which was too low for general acceptance: it was quicker and cheaper to transcribe from scratch than it was to post-edit the OCRed text. By 1990, overall quality for commercial OCR products had reached 97%, which was still too low, but by 1999 overall quality had reached 99% — the tipping point where general use and acceptance of OCR was a given. We are approaching a similar point with machine translation (MT).
The great advances in MT over the past decade resulted from the timely coming together of four main technologies: First, Big Data, in the form of the availability of very large parallel multilingual corpora. Second, the alignment of bilingual corpora at the sentence/segment level. Third, the application of Bayesian probabilities to work out the word and phrase alignments for each segment in the form of Hidden Markov Models (HMM). Fourth, the use of the word and phrase alignments to “decode” new text.
This statistical machine translation (SMT) approach originated, like so much good work in this field, from IBM research and was first presented as a concept at the 1996 COLING conference in Copenhagen. These ideas were taken up by the European Union (EU) funded Moses project and have been seminal in breaking down the language barrier on the Internet.
SMT has provided very important advancements in MT. Whereas previous attempts at MT could only deal with very restricted language pairs and controlled source language, text SMT is totally unconstrained in this respect, and with enough data can attempt any language pair if given enough aligned segment data.
We have all invariably used SMT in the form of Google Translate or from the multitude of spinoff SMT engines. The list of contributors in this field is long both in terms of academic institutions and researchers, but special note must go to the EU and Philipp Koehn, Franz Josef Och and Daniel Marcu who published the seminal paper “Statistical Phrase-Based Translation” in 2003, as well as Hermann Ney for his original work on the HMM approach to word alignment. Special note must also go to Professor Andy Way and his team from Dublin City University (DCU). DCU has become one of the foremost SMT centers of excellence worldwide.
The EU-funded EuroMatrix and EuroMatrixPlus have provided great improvements to the overall performance and quality of SMT, with recent improvements to the decoder footprint from Marcin Junczys-Dowmunt. Franz Josef Och became the brains behind Google Translate, which has arguably become the most well known and iconic image of SMT. Daniel Marcu established Language Weaver, while Philipp Koehn helped establish Asia Online as one of the most famous commercial SMT companies.
Great improvements have been made to the basic SMT models being used, but the Bayesian and HMM concepts are still at the core of the alignment process. Improvements have been made to the alignment model, concentrating on phrase alignment as well as hierarchical phrase alignment. Nevertheless, the core concept of SMT continues to be “guessing” based on the Bayesian probability model as to what are the most probable alignments between the source and target languages at the word and/or phrase level.
The basic, fundamental premise of SMT is that we do not have access to the kinds of bilingual lexicons we want, so the only way is to try and work out the most probable word and phrase alignments.
The flying FALCON
One big problem with SMT is the need for a large amount of data to make it accurate. Smaller companies may not have the linguistic assets or the research needed for their own engines — unless they collaborate. The FALCON project attempts to address this by bringing together a collection of European language technology developers and academics. It is an EU-funded Seventh Framework Project comprising Trinity College Dublin (TCD), DCU, Easyling, Interverbum and XTM International. FALCON stands for Federated Active Linguistic data CuratiON, and is largely the brainchild of David Lewis, Research Fellow at TCD. FALCON initially had the following goals:
To establish a formal standard model for Linked Language and Localisation Data (L3Data) as a federated platform for data sharing based on a Resource Description Framework (RDF) metadata schema.
To integrate Easyling’s Skawa proxy-based website translation solution, Interverbum’s TermWeb web-based advanced terminology management and XTM’s web-based translation management and computer assisted translation products in one seamless platform.
To integrate and improve SMT performance benefitting from the L3Data federated model as an integral part of the project as well as integration of the DCU SMT engine with XTM.
The FALCON project started in October 2013 and is scheduled to run for two years, ending in September 2015.
FALCON will provide a mechanism for the controlled sharing and reuse of language resources, combining open corpora from public bodies with richly annotated output from commercial translation projects. Federated access control will enable sharing and reuse of commercial resources while respecting business partnerships, client relationships and competitive and licensing concerns.
You can think of the L3Data aspect of FALCON as a distributed, federated database that points to the domain-specific training and terminology data that is available, given certain commercial restrictions as regards private data, and that can be used to build custom SMT engines on the fly. In the world of the internet, only a distributed federated linked data database can achieve this. FALCON will use the highly flexible RDF as well as a Simple Protocol and RDF Query Language (SPARQL) database. Using the semantic web concept, FALCON will provide a fast and efficient mechanism for sharing translation memory and terminology data for specific domains.
As Don DePalma of Common Sense Advisory describes eloquently in a Global Watchtower blog post entitled “Building the Localization Web,” this will potentially allow smaller language service providers to have access to a much broader range of linguistic assets than would otherwise be the case. A federated, distributed L3Data store will allow for a flexible and very scalable model, without the limitations and restrictions associated with centralized repositories.
The improvements to SMT foreseen at the start of the FALCON project were to involve continuous dynamic retraining of the SMT engine with real-time feedback of post-edited output. It would also involve named entity recognition (NER) to protect personal and product names, for example, from being processed accidentally by the SMT engine. For instance, President Bush should never be transliterated as President Small Shrub, and NER ensures that it is not.
The project also aims to provide an optimal segment post-editing sequence that will offer maximum benefit for the continuous retraining of the SMT engine. Another of its goals is the integration of terminology into the SMT chain by forcing the SMT engine to use existing terminology, where it is identified, (so-called “forced decoding”) rather than relying on the statistical probabilities for the translation. Lastly, it aims at active translation memory (TM) and terminology resource curation through an L3Data RDF database built as part of the FALCON project.
Apart from the L3Data store, which in its own right is an important step forward in terms of establishing a federated way of holding relevant data, and curation and optimal translation sequence, the improvements build on existing advances in SMT. Nevertheless their integration into a production workflow based around XTM represents an important incremental step forward in terms of automation and consolidation of techniques.
The initial SMT engine for FALCON was going to be OpenMaTrEx (www.openmatrex.org) from DCU. OpenMaTrEx was an adaptation of the Moses SMT engine, but with an added twist: it introduced the concept of “marker” or function words to assist in phrase alignment. All languages use around 230 function words such as prepositions, conjunctions, pronouns and ordinals such as if, but, above, over, under, first and so on to delineate phrases and subsegments in sentences. This was an interesting avenue of experimentation that in the end did not provide the hoped for improvement in alignment, but the concept was nevertheless very sound from the linguistic point of view.
During the initial investigations concerning improvements to phrase handling and so on, the FALCON project team stumbled across BabelNet (www.babelnet.org). However, the implications of BabelNet were not immediately apparent during the initial design phase. It was only while investigating ways of improving SMT performance in terms of word and phrase alignment that its significance became truly apparent. An initial review of the BabelNet dataset and API provided a revelation.
BabelNet to the rescue
BabelNet is a marvelous project funded by the European Research Council as part of the MultiJEDI (Multilingual Joint word sensE DIsambiguation) project. BabelNet is a multilingual lexicalized semantic network and ontology. So far, so good. What is impressive about BabelNet is its sheer size, quality and scope: BabelNet 2.5 contains 9.5 million entries across 50 languages. This is truly Big Lexical Data. Roberto Navigli and his team at the Sapienza Università di Roma have created something even more remarkable: the plan for BabelNet 3.0 is 13+ million entries across 263 languages. This increases the size, breadth and depth of BabelNet’s semantic data even more (Figure 1).
By trawling through Princeton’s remarkable WordNet lexical resource for the English language and then through Wiktionary, Wikipedia and following through additional resources on the Internet, BabelNet has produced a veritable multilingual parallel treasure trove. Its richness also allows for word sense disambiguation (WSD) for homographs, one of the big tricky spots for MT and SMT.
The BabelNet API makes it easy to produce bilingual dictionaries. It does not take a great deal of imagination to work out what the addition of extremely large-scale dictionaries can have on the accuracy of SMT engines. Even just adding the dictionary data to the training data for a Moses-based SMT engine has a significant effect on the accuracy and quality scores.
Big lexical data has the potential to remove the “blindfolds” that have shackled SMT to date, significantly improving both accuracy and performance through bilingual dictionaries and word sense disambiguation.
BabelNet will continue to grow in size and scope over the next few years adding further online dictionary data such as IATE (http://iate.europa.eu/) and other multilingual open data resources. BabelNet now forms an integral part of the FALCON project, helping improve word alignment for the DCU SMT engine.
The future
There is still much work to be done. The Moses GIZA++ word aligner is not optimized for dictionary input and has no direct notion of mechanism for WSD. The Berkeley Aligner can take dictionary input as it is designed for both supervised and unsupervised operation but is primarily designed for word and not phrase alignment. Much research work remains, but the fundamentals of SMT have now been significantly shifted. BabelNet in its current form does not tackle function words, but it is relatively simple using existing internet resources to “harvest” the bilingual equivalents between various languages. The use of function words can then be used to assist with subsegment and phrase alignment in the manner foreseen by OpenMaTrEx.
The SMT team at DCU, Trinity College and the rest of the FALCON team will be working on adapting existing Open Source software such as Moses and the Berkeley, Apache and Stamford tools to take maximum advantage of BabelNet.
Many other features of SMT regarding morphology and differences in word sequences between languages remain to be fully resolved in the open source domain, but the basic building blocks for truly effective machine translation are now in place. Just as search engines revolutionized the way we access data on the internet using methods unforeseen in the early 1990s, SMT is well on the way to becoming our primary mode of translation. We are already using SMT to get the gist of what is on a given web page or email in a language that we do not understand, of course.
Human endeavor is always based on incremental improvements. Just as OCR reached a tipping point in the mid-1990s, so will SMT become the predominant tool for translation within the next five years. Just as translation memory, terminology tools and integrated translation management systems have helped to automate and reduce translation and more significantly project management costs, integrated and automated quality SMT will further automate the actual translation process itself. Translation will largely become an SMT post-editing process.
The quality and data resource issues have been largely addressed in theoretical terms but implementation of these ideas is well on the way. The translation workflow will be mainly around post-editing for most commercial translation projects. This can only be a good thing for all concerned as the demand for translation is growing at around 8% per annum and further automation of the process is the only way to meet this growing need. The advances in MT and increased SMT usage contribute greatly to the expansion in global trade and thus help to lift billions of people from levels of poverty.