We have been working for a large client in the manufacturing field, and have been meeting their translation requirements with a mixture of human translation and post-edited machine translation. During 2017 we looked at using neural machine translation (NMT) to supplement and ultimately replace the statistical machine translation (SMT) approach we have been using to date. We have built a set of domain-specific NMT engines for this client and have conducted comparative tests against the existing SMT engines. The tests have been conducted using actual client data to hopefully achieve the most realistic results. These tests and their results should have been quite straightforward, but as we quickly discovered, this was not actually the case. The results were informative, and in some cases, unexpected.
Background
The client’s requirement is for the localization of technical material into up to 60 languages. Most of their work is in 15 of these languages. This material is in a highly structured format with very strict authoring guidelines, making the material almost ideal for use with translation technology such as computer-assisted translation tools, and ultimately machine translation (MT).
All work, whether post-edited or human translated, goes to an independent third party for quality assessment with stringent criteria. As a vendor, we are judged on these quality assessment scores, and year-on-year we have achieved very high client satisfaction ratings (90%) using this blended approach.
Although SMT and NMT serve the same purpose (translating text from one language into another), they are quite different in their approach. The SMT process comprises a number of separate subcomponents whose outputs are weighted and recombined to maximize the translation quality. NMT on the other hand operates on deep neural networks, and is trained as a single end-to-end operation.
Given current SMT performance, we are eager to compare these two MT technologies for this specific client, and hope to achieve better results with NMT.
Ongoing support
The MT engines have been trained on the client’s data and vocabulary and are therefore specific to their material. At the time of writing, we have 11 MT engines deployed and in use, with a further four in development (being tested before they are deployed). The engines have been built using the Moses SMT technology, using data provided by the client. This data includes translation memories in TMX format, glossaries in Excel format and some “generic” material.
To continually support this client, we are looking to enhance our solution so that the existing engines are updated, and new engines are created to cover more of the client’s required languages. Existing engines are periodically retrained using the latest material from the client, including the most up-to-date translation memories and glossaries. This keeps the engines in line with the client’s latest work and most recent changes in vocabulary.
Our standard engine build process is to create a group of engines from differing data sets and test engine outputs, using the same test data. For each engine test, a BLEU or quality score is generated. The BLEU scores (an industry standard method of evaluating machine translation quality) from these engines are compared before an engine is selected for further development and use.
The normal process is for work to be translated by the appropriate engine before being sent to an editor for correction. Post-edited files are then sent for an independent quality assessment (QA). On one occasion files were sent directly for QA without being edited, and the QA team returned them with ratings of between 96% and 98%, the “pass” score being 85%.
This episode taught us a couple of important lessons — the most important being proper adherence to procedure. Additionally, it did show us that we were on the right course with the use of MT for this client’s work.
Shift to NMT
As part of the ongoing support for this client, we are exploring the latest MT technologies, and in 2017 that means NMT. Why, if the current quality is so good, do we want to change? Well, although the existing technology does produce good results, there is no advantage in standing still. If we can improve the performance of our systems we will, but we will always strive to be better.
The MT engines we have in use do produce good results, however, depending on the language, some engines perform better than others. If a change in approach will improve the performance of the lower scoring languages, bringing them up to the level of the better languages, then this change will be pursued. It will mean a more efficient post-editing process and a better final result. Additionally, the change in approach will also aim to produce quality increases in the better performing languages.
There are a number of techniques used to create an NMT engine. We employed the state of the art system (attention-based encoder-decoder architecture) as the NMT implementation, while experimenting with different configurations. We subsequently determined the configuration that returned with the best BLEU quality score on test data.
The language pairs under investigation were carefully chosen, as for this client they were typically understood as “difficult” language pairs, and were performing poorly using the SMT engines. We were particularly interested in them because it was hoped that NMT would solve, or at least improve, the performance of these difficult language pairs.
For a fair comparison between SMT and NMT, exactly the same data sets were assigned. This would ensure that the different engines would be built, evaluated and tested using the same information. As an additional test, a newer test data set was created. This new test data set comprised data from more recent client jobs, and so should be unknown to both engine types.
The SMT engines were trained and evaluated on servers with 2-core central processing units (CPUs), and NMT engines were trained and evaluated on servers with one NVIDIA K80 graphic processing unit (GPU) accelerator from Tesla. The NMT servers required the additional processing power to ensure that the build time could be kept within a reasonable timeframe. Their training/evaluation data size and the rough training times are listed below. Note that some of the SMT models are hierarchical phrase-based, and the others are phrase-based depending on the target language to maximize the output quality. For each language pair, a selection of data was removed from the main text data set and put aside for development and testing. This selection was generated at random from the main data set, and included approximately 20,000 words. The development and testing data were identical for both the SMT and NMT engines for each language.
Looking at the table, the time needed to build an NMT engine is on average about six times longer than that of an SMT engine, and this time increase affects the cost of creating these engines. Supposing all of the engine training was done on Amazon EC2 servers, the cost of running the core CPU instances is approximately $0.111 per hour, and the running cost of GPU is about $0.972 per hour. Therefore, taking the increased build time into consideration, the total financial cost for building the NMT engine is approximately 50 times that of the SMT engine. Of course, once the engine is built it can be used again and again.
Once the engines are built and deployed, the SMT engines achieve a speed of around 1,000 words per second on a single CPU, while NMT achieved around 120 words per second on the GPU. Using the same hardware configuration we used at the engine building stage, the running cost for the NMT engine is approximately 70 times that of the SMT engine.
Good news and bad news
Once the engines have been built, they are scored against a test data set. This data set is not used to build the engine and so is unknown as far as the MT system is concerned. The test data is in a bilingual format, and the translation has been approved and is treated as a gold standard. The scoring process measures how well the MT result compares to this gold standard. This measure is based on how many groups of words the MT result and the gold standard have in common. The process produces an overall score between 1 and 100. Low overall scores indicate a poor translation performance. Overall scores below 20 are considered as too poor to use, while scores in the range of 30 to 60 are considered suitable for post-editing. A very high overall score, however, is sometimes indicative of a poorly selected test data set, which may not actually be unknown to the system and will skew the results.
We might expect the above table to show the NMT engines outperforming the SMT engines, but the table only shows this happening for some of our target languages. From the above table and graph, we can see the results are somewhat mixed and controversial, and depend upon the chosen language pair and test data set. There are four different scenarios:
1. NMT fits the training data better, and performed better on new data (EN>RU and EN>IT). This is the “expected” outcome.
2. SMT fits the training data better, but NMT still outperformed SMT on new data (EN>ES, EN>DE and EN>NL).
3. NMT fits the training data better, but underperformed SMT on new data (EN>PL).
4. SMT fits the training data better and also beat NMT on new data (EN>SV, EN>UK).
The results for scenario 1 are as expected, with the NMT engines outperforming the SMT engines.
Scenario 2 could be explained by the age of the test data set and the maturity of the SMT engines. These engines are some of the oldest we have, and over the time that they have been running, the test data, which should be “unknown” to the engines, has become “known” through the periodic updating of the engines with new client data. It is for this reason that we created the “new” test data set, and using this test data, the engines perform as expected, with the NMT engines outperforming the SMT engines.
In scenarios 3 and 4, we have the engines created from the smallest data sets; 1 to 3 million words against the 5 to 10 million words used to create the other engines. This may go some way toward explaining their poor performance, and may provide a benchmark against which we will measure the data sets for other languages before beginning engine creation. We will be looking at the languages again when we have collected more data.
A comparison
As a final assessment, we performed a series of tests using a data set created from files taken from other clients in similar fields, and compared their BLEU scores using both the SMT and NMT engines. The results can best be described as mixed, with some languages performing adequately, and others very poorly. As both engines were built using a very client-focused data set, and were not expected to operate as generic engines, these results should not have been too surprising, as we would only get good results on material that is very closely related to the client’s original material.
Generally, NMT will give us better results than SMT, but it’s not a case of “NMT is good and SMT is bad,” as there are circumstances where the older SMT technology will perform just as well if not better.
The poorest performing languages had the smallest data sets, so there is possibly a minimum data set size that needs to be reached before NMT results show an improvement over SMT. The results we have seen would suggest that this limit is about four million words. This is an important consideration when building domain or subject specific engines, as collecting this volume of data will present challenges.
The time taken and the costs involved in creating an NMT engine are much longer than that of an SMT engine, and the running speeds are also slower. These factors will influence the choice of technology, especially if a rapidly deployed and agile solution is required.
Will we persevere with NMT?
In short, yes, because although for this case it doesn’t give an “across the board” improvement, it does offer an improvement for some languages. With extended use and with engine rebuilding, the results will get better. It may be that we continue to use the “older” SMT technology for some languages, and switch to NMT for others. The choice of technology will be driven by the results gained, so that the best available solution is used for each language.
The lessons we’ve learned supporting this client can be applied to other customers, and will inform the choice of MT technology employed on an individual basis.