Artificial Intelligence

Neural Machine Translation Versus Large Language Models

Which technology will drive the future of automated translation?

By Jourik Ciesielski

A

new age of language technology has dawned. But in the half-light, linguists and business leaders alike are still forming a vision of what, exactly, that age will look like.

Prior to the introduction of large language models (LLMs), neural machine translation (NMT) defined the computer-assisted translator’s toolset. And to some degree, it still does. That’s why the advent of LLMs hit the language industry with the force of a nuclear blast — everyone knew the world was changed forever, but no one could say exactly how.

One could make an educated guess about who might define the conversation though, and sure enough, eyes turned toward the world’s largest and most influential language companies for clues. When considering the question of best practices vis-à-vis NMT and LLMs, MultiLingual reached out to several of these major players and received responses from Bureau Works, LILT, Lionbridge, memoQ, Pangeanic, and Translated.

Their combined responses led to some fascinating insights.

Generally speaking, the whole language services industry agrees that adaptive machine translation (MT) is the quickest and easiest way to implement customized MT solutions. Since it adapts in real time to legacy data, it gives the quality of highly customized models without the need for training and maintenance.

Despite those strengths, the adaptive MT market is small. But recent changes to the market have been dynamic. LILT is an established player. RWS has revamped Language Weaver. SYSTRAN’s fuzzy match adaptation is integrated into memoQ and XTM products. And ModernMT’s evolution toward maturity includes a human-in-the-loop feature widely hailed, even by linguists, for its efficacy.

However, MT model training has proven cumbersome. Training on bilingual data is expensive, time-consuming, and difficult to control. What do you do if your model still performs poorly after an extensive training round? And MT glossaries can do more harm than good if implemented recklessly.

That’s why many companies were eager to leverage LLMs. Translation management system (TMS) providers rushed to add LLM-driven translation features — from Bureau Works and Crowdin to Smartling and Transifex — while memoQ is gradually rolling out adaptive generative translation. Google partnered with Welocalize (and a few other companies) to evaluate its adaptive LLM translation solution. SYSTRAN was acquired by ChapsVision, claiming that “in this new AI era, it is more difficult for small players to keep the pace, and so the best option is often to aggregate with other actors to get bigger and thus stronger.” And Unbabel announced the release of Tower, a multilingual LLM based on Meta’s Llama 2 for translation-specific tasks. Pangeanic followed suit with its ECO LLM.

What’s more, IBM announced the deprecation of Watson Language Translator, its NMT service, encouraging users to migrate to — guess what? — WatsonX LLMs. This move establishes IBM as one of the first tech giants to sunset its NMT efforts and focus on LLMs for automated translation purposes.

Clearly, our industry has taken a big step to promote LLMs and their flexible adaptation techniques to the new default for automated translation. While one can draw their own insights from the state of language technology, it’s worth listening to the leaders in the industry for their perspectives. And happily, they have no shortage of thoughts on the matter.

Advertisement

The localization industry has seen many types of customized MT: models trained on bilingual corpora, glossaries, adaptive MT, and now prompting-based MT through LLMs. What do you think is the best approach to customized MT?

Bureau Works (founder and CEO Gabriel Fairman): The best approach is what we call “context sensitivity,” which uses LLMs’ analytical and predictive capabilities. We work with a retrieval-augmented generation (RAG) framework that examines the text and looks for relevant context in the translation memories (TMs), glossaries, MT repositories, work unit, and preferences. After we retrieve the context, we have a dynamic system that ranks this context according to relevance using a wide range of metadata, including author, creation date, times confirmed in the past, and semantic plausibility. We then feed this context a cluster of LLMs that work as arbitrators to suggest the most likely outcome of all of this context. This suggestion then goes through a formatted filter and is returned to the translation editor. This is the approach that is most likely to create a translator digital twin and is therefore most dynamic and effective. It’s also easy to scale and manage, as all knowledge is stored in TMs and glossaries and does not require fine-tuning instances.

Translated (Tech Evangelist Kirti Vashee): The business objective of using translation technology that enables an enterprise to be multilingual at scale is to improve the global customer experience and drive international market success. The technology used needs to be scalable, responsive, reliable, and cost-effective, all while producing high-quality output across a large number of language combinations. Adaptive MT technology has shown itself to be the most capable enabling technology to date. More recently, we’ve seen evidence that LLMs — if properly implemented — can improve fluency and raise the overall quality for a subset of languages. However, we have yet to see this scale to the other production needs mentioned above.

We anticipate that, soon, LLMs will become a viable enterprise solution for translation. This will likely come when we move towards task-specific LLMs trained specifically for translation. These models will be smaller and more practical to deploy and maintain than today’s massive foundational models.

In the interim, both LLMs and classical MT approaches may be useful in parallel. Still, most enterprises would likely prefer a single integrated solution unless there are significant advantages for key languages by using two different production pipelines.

In general, the choice of technology will always be secondary to the positive measurable impact on global customers. MT quality differences must be balanced with latency, throughput, and cost realities. The preferred solution will probably be the technology that provides a reliable, consistent, high-quality, and cost-effective deployment in production scenarios.

LILT (VP of Growth Allison Yarborough): LILT combines all of these (and we believe this approach is best); we train on bilingual corpora for adaptive MT on both TMs and online while the translators are working, we utilize glossaries in the translation algorithm, and we integrate translation samples akin to LLM prompting into the MT system. Each method has its advantages and disadvantages, but we found that the combination provides the best results.

Lionbridge (CTO Marcus Casal): While there is no one-size-fits-all approach to customized MT, new methods exist to improve its results. Traditionally, customization involved training a base model for specific brands, domains, or other use cases, but there was limited demand for this level of specificity. With the rise of LLMs, we’re identifying a new approach: using the LLM to improve the output of a base MT engine rather than customizing the engine itself.

Essentially, through a well-tuned, strategic prompt flow, we can prompt the LLM to check the quality of the translation and refine it based on specific requirements like glossaries and audience. We have found a lot of value in a two-step process that combines baseline MT engines with highly targeted LLM prompting strategies to achieve both accuracy and fluency in customized translation. And, of course, this is a prompt flow with iterative prompting ranging across personas and source/target/bilingual language to achieve the desired outcome.

memoQ (Chief Evangelist Florian Sachse): Promoting based MT through LLMs combines a strong language encoding (structure, grammar, tone of voice) with a high level of relevant domain information, which can be provided in a prompt. LLMs will still need to improve for certain languages by providing more data. But for many first-tier languages, LLMs can generate fluid and grammatically correct content. Increasing the correctness of the generated content will depend on prompt engineering and the context information, which, in our case, typically comes from TMs and terminology. Improving the translation quality will not work through retraining the LLM but by improving the prompt, which is much more predictable and controllable. If predictability and repeatability (continuous workflows) are key, this is the most efficient approach.

Pangeanic (founder and CEO Manuel Herranz): In 2024, the best approach to customized MT continues to be NMT. We have achieved a level of parallel corpora availability that allows for the creation of MT engines at very economic costs. It scales well, and adaptation can happen in several ways. At Pangeanic, we provide the ability to inject data to a baseline model with three levels of aggressivity, which customizes models in minutes. Other companies do it “on the fly” — a very attractive concept, but also a way to accumulate and propagate “on the fly” errors. Serious and professional workflows always require a human verification of the TMX file before it is injected into the adaptive NMT engine for retraining. NMT is much cheaper to run than LLM-based translation, as well. It is more “controllable” for specific objectives, such as ecommerce, subtitling with a lot of conversational expressions, software, and healthcare.

Prompt-based translation is proving very popular, and it has advantages and disadvantages. The largest disadvantage is the lack of control in the output. Let’s not forget that LLMs are generative AI (GenAI). In science and engineering, we are used to having the same results if we apply the same formula. Well, we all know that asking the same question to an LLM does not necessarily guarantee the same translation result. That’s not bad if you have occasional translation needs like translating an email. But try to incorporate LLM-based translation consistently at scale while fully respecting terminology and styles, and the LLM seems to have a mind of its own.

All independent MT companies, as well as TMS companies, are working to incorporate GenAI into their workflows, but with no guarantee or customization. Prompting is not enough. There is a temptation to assume that, after getting 10 results right, all translations are going to be fine and that LLM-based translation will work just like NMT translation does. It doesn’t.

We have tested pure prompt-based LLM translation. Unless you have a specific model trained for the translation task, clever and tried prompting, and an established workflow, it will generate free versions and not “accurate” translations. In short, models trained on bilingual corpora and glossaries are very effective, and relevant and sufficient data is widely available — at least in major languages. Adaptive MT can further enhance the quality if there is sufficient and regularly updated training data.

However, prompting-based MT using LLMs offers more natural and contextually relevant translations, especially when domain-specific training data is limited or non-existent. LLM translation is great for off-the-cuff Japanese <> Spanish or Polish <> Mandarin. I do see the value there.

So, how long will we hold on to NMT? Not long, I dare say. I envisage GenAI systems that, at a similar or higher cost, offer a lot more automation from a single application programming interface (API) connection, benefitting from GenAI’s fluency and post-editing (PE) in context at scale.

Gabriel Fairman
Founder and CEO of Bureau Works

Kirti Vashee
Tech Evangelist at Translated

Allison Yarborough
VP of Growth at LILT

Even though TMs are often perceived as obsolete, they are still the primary linguistic resource in many localization programs. What will be the role of TMs in the GenAI era?

Bureau Works: TMs are great sources of context. They will continue to be relevant, but they will become easier to maintain and expand.

Translated: TMs will continue to play an important role in adapting model output to the user’s needs. TMs have been mission-critical for all data-driven approaches to MT, from statistical MT onwards. However, the quality of the data is important. The maintenance of TM has received much less attention than matching-leverage maximization. More attention is needed on data cleansing and data optimization for prompting, RAG, and other processes that are useful for LLMs and other technologies beyond. AI learns from data, and more relevant data will usually produce better results. All TM is not equal, and human-reviewed and quality-certified TM will always be more useful.

In the future, metadata that is not widely used or available today (source, quality, domain) may become more important as optimizations will be based on utility and relevance to a downstream AI process rather than just a simple string match, as is the case most often today. Synthetic data creation is also likely to become more important, and this is highly dependent on the quality and categorization efficiency of the seed data.

LILT: TMs provide training data, and the accuracy of models is highly dependent on the quality of data used to train them. If high-quality data is used to train the model, it generates better outputs, and if poor-quality data is used to train the model, it will generate bad outputs. The same principle applies with TMs: if the TM quality is high, it is a useful data source to train a model, and if a TM quality is poor, it should not be used to train a model. LILT fine-tunes a bespoke model per customer per language, and a customer’s TMs are a data source in that fine-tuning and customization for the customer’s preference, tone, and terms. There will likely be an expansion of the data source types that are used to fine-tune models; notably, there are already application layers and tools that capture human feedback in real time (such as LILT’s platform) to create a real-time model training cycle. TMs also provide value in consistency, particularly with exact matches and near-exact matches. Over time, we may see “low fuzzies” devalued as compared to suggestions from GenAI (assuming that the linguist can simultaneously access both). As low fuzzies are often already overvalued in perceived usefulness, we do not perceive this as a bad shift.

Lionbridge: While TMs excel at reducing translation costs and maintaining consistency within specific domains, the future of localization lies in their synergy with MT and LLMs. Particularly exciting is the potential to leverage LLMs to enhance TM quality, which can boost the value proposition of TMs.

GenAI can help address some of the challenges with traditional TMs, like outdated content. For example, formal modes of address might still be present in a TM, even if they are no longer used. Fixing these inconsistencies manually is time-consuming and expensive. But with sophisticated language prompts and an iterative approach, GenAI can reliably and cost-effectively update that TM. However, this requires deep expertise in both the languages and the domain being translated.

And I think this is just the beginning. GenAI has the potential to supercharge existing language assets, unlocking their potential and making localization more efficient and effective.

memoQ: TMs will be the key to providing good translations. Our research shows that TM-enriched prompts can get translations out of LLMs that are on par with customized MT systems. This approach is bringing back more control and value to linguists who care about the TMs they maintain and allows any linguist to benefit from MT. They just need to properly do their homework: maintaining TMs and terminology.

Pangeanic: Prompt-based models can be a strong choice when domain-specific training data is limited. This is where I see a very good value to a technology and workflow that is otherwise becoming obsolete by the day. This may not be a very popular statement, but unless TMSs include something really revolutionary, there is little value in systems that are designed to receive files — which are then processed by TM-based systems — with the added costs of project managers to save on translation costs as a result of translation matches.

The TMs we have built over the years are excellent resources (parallel corpora, fit for machine learning), so we can “tame” — not just fine-tune — LLMs to produce the desired results.

Florian Sachse
Chief Evangelist at memoQ

Marcus Casal
CTO of Lionbridge

Manuel Herranz
Founder and CEO of Pangeanic

Language data management (LDM) was anticipated to become a central (and perhaps lucrative) service after the big NMT push of 2016-2017, but it never really got off the ground in the language industry. Will LDM revive to fine-tune LLMs and/or enrich RAG mechanisms?

Bureau Works: In the past, greater corpus was synonymous to greater quality. Now, we think smaller, more well-built corpus produce better results. I’m confident that LDM will continue to grow, but we don’t believe it’s critical to results anymore.

Translated: While RAG and prompt engineering are currently very much in vogue, how we steer LLMs to perform more effectively for specific tasks is likely to evolve further. Recently, there has been success with linking knowledge graphs to produce more relevant and contextually accurate results from LLMs. This area of expertise could grow, as it involves logically connected data concepts, better contextual relevance, and some basic semantics —all elements that are close to skills prevalent in our industry. LDM can only grow when the platform for which these services are performed is more stable and not evolving as quickly as LLMs are today. We have already seen that early advice on prompt strategies is now outdated and less relevant.

LILT: While the concept of LDM was strong, it was limited by the weak tooling available to support it. Older systems are generally limited to receiving linguistic data solely in the form of a TM, and many MT-focused systems don’t expect “bring your own data,” so most companies ultimately defaulted to TMX files. The newer focus on LLMs makes the need for a vertically integrated platform that can seamlessly tie the production of high-quality content to linguistic data, training, and tuning tasks that much more essential. As companies begin to deploy and operationalize LLMs at scale, they will begin to understand the importance of fine-tuning for content quality, brand alignment, and use case specificity. As a result, they will likely allocate more time, resources, and consideration to LDM.

memoQ: The game-changer is that LDM can focus on the domain-specific part. There’s no need to also capture the general language aspects. Also, retraining of the whole MT system is not needed. This reduces training effort and increases predictability. LDM can become much more efficient. Better chances to get it off the ground.

Pangeanic: I see a bright future for RAG. Translation buyers do not typically care how the magic is worked out in the background. That is a discussion that is left for us developers.

A workflow that has in-domain NMT engines and runs RAG-based PE is something we are already testing in production at Pangeanic. I envisage a good future for it, and results are very promising. It’s a challenge to make a vector database behave like a TM; it’s not designed to be a typical database looking at fuzzy matching. But once you master the process, the system produces amazing quality translations at scale.

Now, the point for LDM systems is that MT providers can run the necessary data management, including keeping a copy of the edits for the system to continually improve (with more parallel data, not necessarily just TMX files). So unless companies offering LDM or TMSs move into the MT or automatic PE arena, there is little added value to pure management.

Advertisement

Quality evaluation has always been a delicate discipline in the MT space. Automated quality metrics are primarily intended to measure the impact of model training, whereas human quality evaluation is time-consuming and not entirely unbiased either. Will MT quality evaluation be redesigned by GenAI?

Bureau Works: I think MT quality evaluation has already been redesigned. Semantic verification is a lot more powerful than MT quality estimation (MTQE) percentage scores. Not only can potential errors be flagged, but many of them can also be remediated prior to translator inspection. Our entire paradigm of quality has long been due for an overhaul, and I think we’ll be able to connect quality to content performance as opposed to purely technical prowess or how it is perceived by a small number of technical people.

Translated: For certain, one thing that is not going to change is the ultimate need for human validation of model output. We’re already seeing the next wave of LLM development being powered by very high-quality human annotations for supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). There’s only so far we can go by allowing one model to assess the output of another. That said, there will be an increasing role for automation with GenAI to assist and enhance the process.

Quality evaluation metrics provide a quality assessment of multiple versions of an MT system that may be used by the system developers to better understand the impact of changing development strategies. Commonly used evaluation metrics include BLEU, COMET, TER, and ChrF. They all use a human reference test set to calculate a quality score of each MT system’s performance and are well understood by the developer.

On the other hand, quality estimation (QE) or MTQE scores are quality assessments made by a model without using reference translations or actively requiring humans in the loop. It is, in a sense, an assessment of quality made by a model itself on how good or bad a machine-translated output segment is. MTQE can serve as a valuable tool for risk management in high-volume translation scenarios where human intervention is limited or impractical due to the volume of translations or speed of delivery. LLMs have the potential to play a larger role with QE, going beyond a simple rating of quality to provide richer, more actionable data.

LLMs have a massive database of reference text to determine whether a sentence or text string is linguistically correct (at least for the high-resource languages). LLMs can be trained to identify translation error types and thus could be useful to perform automated QE of machine output and increase efficiency in high-volume translation scenarios. GenAI can provide useful assistance in rapid error-detection and error-correction scenarios. Also, the worst quality in a large data set — such as 5,000 out of 1 million sentences — can be extracted and cleaned up with focused human efforts to improve the overall quality of the corpus.

LILT: Yes. It is likely that, in the future, QE models will be trained for specific domains — as has been done with MT systems — and there will still be a human in the loop to train the system and verify and audit system output.

Lionbridge: GenAI has the potential to revolutionize MTQE. Sophisticated prompting techniques will enable us to move beyond just measuring accuracy, offering a more nuanced assessment that considers factors like fluency and target audience resonance. However, GenAI’s true potential lies in its ability to automate translation that doesn’t just create replicas, but generates fluid, targeted content that truly connects with users.

Traditionally, language quality evaluation has been dominated by metrics focused on accuracy, fidelity to the source text, adherence to grammatical rules, and consistency. These are all important, but they fail to capture the emotional impact or user experience. Now with GenAI, we can create impactful content in a way that was never possible before and improve how language connects with users at both intellectual and emotional levels. So, while the core aspects of quality remain vital, the exciting part is that the conversation can now expand to include the impact and emotional value of content, going beyond just mirroring the source text. To me, this is a game-changer.

memoQ: Quality evaluation will stay relevant, I think. The domain adaptation through prompting still needs a systematic approach. But on the other side, there will be new opportunities for AI quality estimates (AIQE), like ModelFront or TAUS are offering. Either LLMs will be able to do the job as well or the higher predictability of LLM-based translations will allow it to identify outliers with even simpler models, making AIQE more affordable. All in all, LLMs are developing fast because of funding our industry would never be able to provide. And we will see some more surprises for sure in the future. Still, I believe that, though these innovations will continue to have a strong impact on the localization industry, they will not disrupt it.

Pangeanic: Yes, completely. The BLEU score was never that good or accurate, but at least it provided a kind of measure for system improvement. You could fool BLEU completely, and it could give you arguably modest improvement percentages when human evaluators appreciated better fluency, for example. Sometimes BLEU would even penalize that fluency!

The trouble with current evaluation systems is that we all want to use GenAI in translation because of the way it handles context (that is the added-value proposition of language service providers and translators), and yet we are measuring quality by the segment or the number of edits. But in a context-aware translation world, edits can come from “humanization” of the context — from making it truly relevant or more current to an audience. And that is being penalized by current systems. MTQE offers some advantages because the system itself may or may not be very confident in the output it has produced, which is fine. We need some sort of confidence score, at least to route the lowest confidence segments to humans. But, we are pushing for document-level (or at least chapter-level) translation, which is making use of the full attention window/context an LLM can offer (say 32k tokens, or double that in some systems). If we take 32,000 tokens, we are dealing with some 10-12 or 15 pages. We definitely need new metrics and to leave the segment-level mentality behind. 

Jourik Ciesielski is the cofounder of C-Jay International, chief technology officer at Yamagata Europe, and a Nimdzi Insights consultant and researcher.

Advertisement

Related Articles