The Great Gap:
Will MT ever be on par with human translators?
By Andrew Warner
Marco Trombetti has a challenge for Elon Musk.
What began as a sort of office-wide joke has become a mission to “allow everyone to understand and be understood in their own language before Elon goes to Mars.”
“We’re pretty sure that Elon is working on the wrong problem,” said Trombetti, the co-founder and CEO of the Rome-based language service provider (LSP), Translated. “Elon thinks that the biggest step forward for humanity is making life interplanetary. We believe that understanding [one another] — on this planet — is more important.”
As Trombetti puts it, language has been a critical part of human evolution. It’s how we’re able to not only understand one another but also cooperate with one another. Solving the language challenge — that is, allowing people to seamlessly communicate with one another regardless of the language or languages that they speak — opens up a whole new set of possibilities. Mutual understanding and cooperation, he says, are the tools necessary to solve other challenges like the one Musk has devoted himself to.
“Allowing everyone to understand and be understood is the problem to solve,” Trombetti said. “And it’s more important than going to Mars because understanding is the tool. If we solve understanding, then we can allow people to cooperate at a global level, and then we can be a multi-planetary species. Elon is essentially forgetting the tool.”
And it looks like Trombetti and co. might be taking the lead here, especially now that Musk’s priorities lie elsewhere, what with his frenzied, $44 billion acquisition of Twitter last year. While Musk was fumbling around with various overhauls to the social media platform’s content moderation policies, verification systems, and staffing, the Translated team was tracking data on our progress toward human parity in MT and what it could mean for human translators’ work.
While presenting this data at the September 2022 conference of the Association for Machine Translation in the Americas, Trombetti noted that we are currently closing the gap between human translators and MT. About three months later, Translated released an in-depth report expanding on the streamlined information that Trombetti presented at the conference.
Looking at data from the last decade or so, the report — entitled This Is the Speed at Which We Are Approaching Singularity in AI — suggests that within the next decade or so, the quality of MT output will be equivalent to that of the most highly skilled human translators, at least for those working in the so-called tier-one languages like English, Spanish, and Chinese.
The folks at Translated reassure us that human linguists will always be critical to the industry, even going as far as claiming that when the gap does close, human translators will be able to earn more than they do now — not less. Still, many in the industry remain skeptical of claims that human parity is on the horizon, let alone claims that it will be a boon for human translators.
For better or worse, if MT is indeed on track to reach human levels of quality — and not just human levels, but skilled human levels — then the language services industry may be in store for a whole new era. But what will that era look like? Human parity in MT and the technological singularity are hotly debated topics that have preoccupied linguists and computer scientists alike for decades — MultiLingual spoke with a handful of language industry experts to gain insight into how close we actually are to reaching human parity in MT and how it could impact the industry.
Discussions on topics like the singularity and human parity tend to be a bit tricky, in part because their definitions can vary depending on who you’re talking to. The word “singularity” in particular is a fairly loaded term that often comes up within the context of science-fiction dystopias wherein tyrannical robot overlords control the planet. Indeed, one of the more prominent definitions of the term, proposed by the famous computer scientist Ray Kurzweil, identifies it as the point in which progress in AI snowballs out of control in a way that radically and unpredictably changes humanity’s relationship with technology.
Translated’s report takes a more grounded approach to defining the singularity, narrowing the scope to focus on the point at which MT systems are capable of regularly producing a perfect translation — that is, a translation that doesn’t require any editing. Unlike other commonly held definitions of the singularity, Trombetti identifies this as the language singularity. And the future Translated envisions when we reach that point is quite far from the dystopian scenario described by others who talk about technological singularity.
In this sense, the folks at Translated use the term to refer to something a bit closer to what others might simply call human parity. However, the definition of human parity is, like the definition of “singularity,” a bit hazy — in the most general terms, it refers to a level of quality in artificial intelligence (AI) systems that mimics that of human translators.
Historically, the two main conceptual pillars of human parity in MT have been fluency and adequacy. Fluency emphasizes grammatical correctness — for a translation to achieve fluency, it has to use proper spelling and sentence structure. Adequacy, on the other hand, is more about making sure that the target text conveys the same meaning as the source text, without omitting or mistranslating anything.
In a 2018 study where they claimed to have achieved human parity for MT between Chinese and English, a team of Microsoft researchers narrowed down the conversation using the following definition:
“If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations then the machine has achieved human parity.”
Human parity doesn’t necessarily mean a given translation will be completely free of any errors — it just means that its quality will be indistinguishable from one that a human translator might produce. Of course, human translators are not infallible, so the Microsoft researchers argued that we have to allow the same level of leeway for MT systems as well.
The term also comes up in other areas of AI, like speech recognition and generative AI. Around the same time that Translated released their report on our speed toward the singularity in MT, OpenAI released ChatGPT, a chatbot built on top of the GPT-3.5 large language model. Users of ChatGPT can prompt the bot to write all sorts of content and it will generate a text that’s, at least on the surface level, indistinguishable from that produced by a human being (albeit, a human being with a fairly stunted and static writing style, but a human nonetheless).
With the advent of high-quality tools like ChatGPT, Olga Beregovaya says we may need to rethink the way we discuss and define human parity — although she says we’re very close to achieving it, there may be other dimensions we need to look at it from. While today’s state-of-the-art MT produces highly adequate and fluent texts, there are still areas where it falls flat that may not fit squarely into issues of adequacy and fluency — for instance, more structured content or transcreated copy.
“One thing that we need to address as an industry is, what do we understand when we mean human parity?” Beregovaya, the vice president of AI and MT at Smartling, said. “The definition of human parity really needs to be revisited.”
Regardless of how we define it though, innovators in the industry have long been concerned with this question of if and when we’ll achieve human parity. Early attempts to translate human language with machines were a bit shaky, but researchers remained optimistic that machines would one day be capable of carrying out translation tasks just as well as any human could.
For instance, the researchers behind the famed Georgetown experiment — in which linguists at Georgetown University and IBM collaborated on the first publicly shown MT program in human history back in the early 1950s — claimed in 1954 that they needed, at most, five years to perfect MT from romanized Russian into English. We now know, of course, that this was one of the great underestimations of the Cold War era.
Fast forward to today, nearly 70 years after those researchers demonstrated their MT system. As the underlying mechanisms used to produce MT output have evolved from rule-based to statistical to neural, we’ve made a lot of progress. Neural MT, today’s state-of-the-art approach to MT, produces highly readable and fluent output, but Dr. Christopher Kurz, head of translation management at ENERCON in Aurich, Germany, says this high level of readability can often be mistaken for linguistic perfection.
“Just because they’re readable, does not mean that they are right,” he said.
In 2018, when the researchers at Microsoft claimed to have achieved human parity in MT for translations between English and Chinese, many quickly regarded this as an overly optimistic interpretation of MT’s capabilities.
For instance, an independent evaluation conducted by researchers at the University of Zurich found that, while human raters judged human and machine translations of isolated sentences fairly similarly, human translations fared better when it came to evaluating translations of an entire document. In their experiment, the researchers found that human evaluators’ preference for MT output fell from 50% to 37% when looking at an entire document, rather than isolated sentences within the document.
The findings of those researchers align well with Dr. Kurz’s position that MT often falls flat in supralinguistic areas that human translators typically wouldn’t — for example, following a client’s style guide. He remains skeptical about claims of human parity, noting that we’re not yet at a point where MT systems can review a style guide and produce a translation that fits into all those stylistic requirements. While Dr. Kurz maintains that MT is a useful tool for human translations, he says we’re still a long way from human parity.
For their part in pinpointing our progress toward human parity and singularity, Translated analyzed records of the time it took for highly skilled professional translators to edit MT output. This metric — time-to-edit (TTE) per word — is a bit different from other, automated means of measuring the quality of MT output like BLEU or COMET. Unlike those metrics, Trombetti says it’s the most reliable method for quantifying the amount of cognitive effort a human translator has to put into reading it.
A perfect sentence should have an average TTE per word of one second, accounting for the amount of time it takes to read a word, process that it’s well-suited for the text, and move on to the next one. If — or when — MT programs reach an average TTE of one second per word, Translated argues that we’ll have reached the singularity.
Translated’s report uses TTE data from two billion sentences edited by 126,000 of the most highly performing translators using Matecat, the company’s computer-assisted translation (CAT) tool. The company reports that the average TTE per word in 2014 was well over three seconds, but that number has steadily declined since then, reaching around two seconds by 2022.
And when you plot all that data onto a graph, it reveals a fairly linear pattern — projecting that line a bit further out into the future, it looks like the average TTE per word should cross that one-second threshold sometime around 2028. Silvio Gulizia, the company’s head of content, says this isn’t necessarily an exact prediction — the singularity could come in 2027, or it might not come until 2029.
This falls in line with many other, earlier predictions that we’ll reach the singularity toward the end of this decade, but Translated claims to be the first to make that prediction with quantifiable data. In the domain of language, Trombetti says progress comes in waves, each of which tends to be limited to certain languages. While the singularity for languages like English and Spanish might come in 2028, it’s much further off for low-resource languages like Nepali or Swahili.
And in fact, some in the industry hold onto that notion that we’ve already achieved human parity — for example, Konstantin Dranch, co-founder of the MT implementation and training company Custom.MT, says human parity has already been achieved, but just for those select few tier-one languages.
“Human parity is already here, it is simply not very evenly distributed,” he said. “ChatGPT can write a better essay than many people, MT can make a better translation into Romance languages than a big part of language learners. At the same time, give it Turkish, and it’s not so impressive. Human parity is a wave with fragmentation across languages.”
On the other end of this spectrum is Dr. Kurz, who thinks we’re still quite far off from human parity in MT. Although MT achieves impressive results and is a useful tool for human translators to make their jobs more efficient, even in instances where it does produce high-quality, human-like results, the same model may not consistently produce output of the same quality. And as Beregovaya notes, MT programs struggle with homonyms and they may also hallucinate — that is, generate words or phrases in the output that have no relevance to anything in their input.
Moreover, while means of measuring MT quality like TTE and BLEU give us an approximation of whether or not the MT model is doing a good job, Dr. Kurz also questions why we don’t gauge MT quality using the same standards that we do for a human translator.
“When we talk about human parity, we should not limit ourselves to these standards or methods that are specially designed for machine translation,” he said. “We should apply the same standards as we do for human translation text.” Until MT can think critically about the source text and incorporate a client’s unique specifications, he says it will be difficult to make claims that we’ve achieved human parity.
Human translators who are leery about human parity in MT might take some comfort in Translated’s reassurance that human linguists will be able to generate more earnings as MT quality nears human parity. The logic goes like this: As translators spend less time editing MT output, they maintain that humans will be able to translate more words per day than before, producing more work for roughly the same level of effort. With reductions in TTE, Trombetti says that “one hour of their work does not convert into 500 words — it converts into billions of words.”
Still, Trombetti’s not shy about the fact that human translators’ role will change alongside these developments. In addition to their linguistic knowledge, Trombetti says cultural knowledge and awareness will become more important than ever, as translators will have to work as cultural mediators, ensuring that MT output is culturally appropriate for the target audience.
“By definition, singularity means that the machine will be able to produce language that is better than what a single human can do,” he said. “But what about emotions? What about cultural differences that need to be mediated?”
And Jane Nemcova, the former managing director of Lionbridge AI and an adjunct professor of AI at the Middlebury Institute of International Studies at Monterey also sees new, more varied opportunities opening up for human linguists. Historically, she says students in linguistics programs have often felt corralled into one of a handful of career paths, most commonly translation and interpreting, or academia and education.
With improvements in MT, she, like Trombetti, envisions a future where translators front-burner cultural mediation over linguistic concerns. Sensitivity readers who are deeply familiar with the source and target cultures will also likely take on a more significant role, she says — AI is known to replicate unsavory human biases — Google, for example, has had to address concerns of gender bias and sexism on Google Translate in recent years — and it will be up to humans to review MT output and ensure that texts are scrubbed clean of any prejudiced or offensive phrasing.
Even if human parity is achieved in our lifetimes, Beregovaya suspects that more highly regulated, high-stakes industries — think pharmaceuticals and other areas where human lives might be at stake — will be slower to rely so heavily on MT. “Specialized translation and translations that can potentially carry a lot of liability will still, to a great extent, require human review and validation,” she said. Similarly, Dranch believes that translators “will provide accountability and a quality program around MT, either by reviewing every sentence, and/or evaluating the general level of fitness for purpose.”
The bottom line? Human linguists aren’t going anywhere — their role is simply evolving alongside MT.
“Your ability to understand more than one language, to understand the nuance in meaning, and the way that the brain uses language as a reflection of thought is one of the most valuable skills that anybody can have,” Nemcova said.
If you’ve been trying to wrap your head around the concepts of blockchain, Web3.0, or the metaverse, look no further. Nimdzi’s Nadežda Jakubková has you…→ Continue Reading
Subscribe to stay updated between magazine issues.