Deep neural MT in patent translation

Anyone familiar with localizing patent content will acknowledge that it requires both language as well as domain-specific skills. Take that up a notch and attempt to machine translate more than 500 billion words of English patent data into Japanese in three months for use in a leading global patent research platform, and you have your work cut out for you. However, that is exactly what one of our customers required. There were some challenges in the project and approaches had to be adapted.

The requirement at the outset was comparatively simple: “Machine translate all the data from English to Japanese for use in an intellectual property research platform as fast and as well as possible.” While this mission statement seems to be easy, as always, the devil is in the details.

The challenge

•Volume: 500 billion words of machine translation in three to four months is a significant volume. To put this in context, one of the larger global social networks announced in 2016 that it had machine translated roughly 100 billion words per year in 2015 for all languages supported, which means that this patent volume alone is about five times that volume.

•Quality: “Quality” obviously means different things to different people, but in this context the quality measurements were automated scores such as Bilingual Evaluation Understudy (BLEU) as well as general comprehension — could a human read and adequately understand the text? We also needed to account for the correct handling of the complex patent mark-ups such as formulas and references. This is especially difficult to do in language pairs with significant reordering such as English to Japanese. The specific challenge was to ensure that all these additional content elements retained the right position within the document and any links and references were maintained.

•Complexity: Patent content is complex since it generally refers to multiple fields of technology, ranging from agriculture to nanotechnology. Consequently, the applied terminology is complicated and if human manpower were to be used to do the translation, a domain expert would be required. This alone makes the translation task difficult. However, the real complexity is the mark-up of formulas, references to figures, citations and so on in patents. From a processing perspective, this means that as the translation is performed, the source and target position of every item needs to be tracked in a way that mark-up can be removed for the translation process so that it will not negatively impact the translation. Then it must be reinserted at the correct position during post-translation processing.

•Engine training: Training a neural engine requires vast amounts of data, especially considering the complexity of the language and the breadth of technology fields. In addition, the quality and the mark-up of the data needs to be specific and of very high quality. This is due to the fact that unlike statistical machine translation (SMT), pruning potentially poor training data out of the engine is no longer possible with neural machine translation (NMT). Also, in contrast to SMT, NMT only relies on bilingual training data, which requires the preprocessing to be specifically designed around NMT.

•Infrastructure: While this seems to be a rather mundane challenge, the availability (and cost) of sufficient graphics processing unit (GPU) resources is not insubstantial. Both the choice of GPU and the tuning of the system for maximum efficiency and throughput as well as the ultimate system deployment is a task that needs to be planned carefully.

The state of NMT

Before delving into the details of the approach, it is important to understand the state of NMT and its latest cousin, deep neural machine translation, which can stack several layers of neural networks. NMT is an exciting and promising new approach to machine translation. However, while the technology is promising, we still have some way to go to commercial implementations that can rival rule-based machine translation (RBMT) and SMT in all use cases. Some recent claims by industry players suggest that near-human quality is right around the corner for all of us. These claims, however, might be a bit of an over-simplification because they fail to clearly define the assumptions and limitations that affect projects such as the one discussed in this article. 

First of all, NMT is not a drastic step beyond what we have traditionally done in SMT. Its main departure is the use of vector representations (“embeddings” or “continuous space representations”) for words and internal states. The structure of the models is simpler than phrase-based models. There is no separate language model, translation model and reordering model, just a single sequence model that predicts one word at a time.

However, this sequence prediction is conditioned on the entire source sentence and the entire already-produced target sequence. This gives these models the capability to take long-distance context into account, which seems to have positive effects on word choice, morphology and split phrases (such as “switch … on,” which can be prevalent in German and some other languages). The other key benefit is the generalization of data. For example, the ability to add to the statistical knowledge of how the word cars acts in translation from samples that contain car or autos.

Ultimately, the training of the models is similar to phrase-based models. It takes a parallel corpus, and learns all required model parameters from it and requires vast amounts of data (even for “simple” languages and domains, over a million segments is required). In addition, it is far more dependent on the quality of the training data than with, for example, SMT. One key problem with neural models is the limited vocabulary due to computational constraints. These models are trained with, say, 50,000-word vocabularies, and any longer words are broken up into word pieces. This is a real problem for deployments with large numbers of brand names or large specialized vocabulary.

Another serious concern is that neural systems are very hard to debug. While in phrase-based SMT systems there is still some hope to trace back why the system came up with a specific translation (and thus remediate that problem), there is little hope for that in neural systems. In addition, the errors produced by neural systems are sometimes quite capricious. The system may just produce output words that look good in context but have little to do with the source sentence.

The approach

The current state of NMT therefore means that it was a very promising technology for the project, but it also needed attention to detail and limitations to produce the desired outcome. Initial research prior to the project had proven that the technology was valid and the desired result could be achieved in terms of quality, but significant engineering efforts were required to move from a successful proof of concept to a production grade deployment. The key approaches to the challenges are discussed below.

•Infrastructure and architecture: As discussed in the technology section, an infrastructure upgrade was required to enable the processing capabilities and workflow for this project. The most obvious upgrade was the addition of GPUs for the translation step, but given the volumes of data to be translated, the entire system had to be reconfigured to ensure that the pre- and post-processing as well as the overall workflow was capable of handling the required volumes; and in parallel, also handling the standard, day-to-day production volumes of updates across a wide array of languages.

In order to mitigate the deficiencies of NMT when it comes to potential out-of-domain content and problems with very short and very long sentences common in this type of content, a hybrid system was deployed, combining the best of both worlds: SMT and NMT. All in all, the system deployment itself and the tuning of the system required no insubstantial engineering efforts and capital.

•Quality and training: That the quality of training data and the closeness to the domain is relevant to achieve adequate, quality translations is nothing new, and this applies equally to SMT and NMT. However, initial deployments quickly showed that given that the NMT engine is only trained on parallel data and is very challenging to debug, the quality parameters for the training data as well as the handling of mark-up, formulas, references and so on required two things. First of all, much higher-quality parallel data, and second of all, a new process to be developed in order to effectively deal with mark-ups and training the engine.

The result was a complete redevelopment of the processes to validate the millions of segments of aligned data to ensure that source and target were exact translations and not merely approximations which could lead to very undesirable outcomes with NMT. In addition, a completely new training pipeline was developed to effectively deal with the mark-up in the data and to ensure the system could deal with the mark-up effectively during translation.

•Complexity: The complexity of the data with mark-ups, as well as the language pair itself, results in a lot of “long distance re-ordering,” meaning words traveling to opposite ends of the sentence. This makes it very difficult to maintain the correct position of all the mark-ups and presented a range of challenges. These challenges were addressed by a) developing a totally new training process and b) performing extensive pre- and post-processing to mark-up content prior to translation, and re-inserting mark-up post translation at the right position as well as ensuring consistency of references.

Result and conclusions

The result is a working system that meets all requirements defined in the mission statement and provides best in class quality translation, better than any system we had access to. However, it also was an ambitious project in the sense of applying emerging technology to a business initiative.

While the project was successful and proved that NMT offers substantial improvements over older technologies, it also clearly showed that deploying NMT requires a solid understanding of the use case, a wide range of data, processing and IT skills, and a well-structured approach. NMT is very powerful if applied correctly, but it is not the “black box” magic translation system that will provide best in class quality translation in all cases. Like all other translation technologies, quality comes from domain adaptation and “quality” thinking in the entire process building the system. At the end of the day, the golden rule of data processing still applies: “garbage in, garbage out.” However, if deployed correctly and trained for specific use cases, NMT and specifically deep NMT can provide extremely promising results and a whole new level of translation outcomes.