Interpreting and Technology
The Revolution at our Doorstep
By Barry Olsen
Where we’ve been. Where we are. And where we may be headed.
The last decade has been one of accelerating change and disruption for interpreting. Providing spoken language services has always been a specialized niche of the overall language services market. Even more sub-specializations are based on where interpreters work, like hospitals and courtrooms, or how they work, be it on site, over the phone, or in simultaneous or consecutive mode. Interpreting is a complex network of requirements and specializations that must be carefully navigated. Over the years, the complexity of providing this highly specialized and highly personal human service at the right place at the right time has made professional interpreting difficult to scale to meet the growing demand for multilingual communication. This complexity has also made interpreting a prime target for increasing efficiency through the use of technology.
In 2012, I published a simple framework for categorizing technologies affecting language interpretation by grouping them into three categories:
- Technologies for the delivery of interpreting services
- Technologies that augment an interpreter’s performance
- Technologies designed to replace human interpreters ‘ purpose altogether
This categorization has proven quite handy to quickly qualify technologies and their possible impacts on the interpreting market. Let’s take a look at how the technologies in each of these categories have evolved and then consider where they may be headed in the coming months and years.
Technologies for the delivery of interpreting services
The technologies for delivering interpreting grew slowly and steadily over the last decade until their use exploded during the pandemic. All eyes were on fast-developing technologies for multipoint video conferencing and multi-channel audio, all delivered over the internet. At the onset of the pandemic, some over-the-phone (OPI) and video remote (VRI) interpreting providers were caught flat-footed as their existing technological infrastructure was optimized for connections between two endpoints and not set up to easily provide multipoint audio and video. They had to scramble to upgrade their infrastructure to accommodate the shift and growth in demand.
Then, there was remote simultaneous interpreting (RSI).
Multiple tech startups focused on developing RSI tech shortly after standards-based in-browser data exchange technologies like WebRTC emerged in the early 2010s. These new standards allowed creative startups to build customized platforms for video remote interpreting (VRI) in healthcare and remote simultaneous interpretation (RSI) for online multilingual meetings. These platforms, together with mainstream web conferencing platforms that quickly adapted to add multi-channel audio for simultaneous interpretation, kept multilingual communication going during the pandemic, with everyone from hospitals and clinics to courtrooms and international organizations meeting online to ensure the continuity of their operations.
The abrupt and almost complete shift from on-site to online interpretation in 2020 was a wild ride for interpreters. With breakneck speed, people and organizations had to learn to communicate across languages online. Interpreters adapted quickly to this new way of working, but there were technical bumps along the road, many of which persist today. The most prominent challenge has been audio quality. You can’t interpret what you can’t hear. No one considered that people would be so reluctant to put on headphones to avoid echo and feedback, that built-in omnidirectional microphones on laptop computers would be of such poor quality, or that people would think it was acceptable to connect to important online meetings from their car or the corner café. But here we are.
The pandemic-induced spike in demand led to a brief but notable interest in remote interpreting from technology investors. The four largest remote interpreting startups garnered more than $66 million in a mixture of private equity and venture capital investment. It’s another clear example of how interpreting and technologies that expand access to it have captured mainstream attention.
Post-pandemic, remote interpreting hold on to a larger share of the overall interpreting market than it had before, even though its prevalence waned along with the pandemic. According to data from Nimdzi’s Interpreting Index, the cumulative market share of remote interpreting (OPI, VRI, and RSI) went from 20% in 2021 to 95% at the height of the pandemic to 49% in 2023. Nimdzi also reported that a new category overlapping both on site and online meetings, known as hybrid meetings, where speakers, attendees, and interpreters may participate either on site or remotely, now accounts for 12% of the overall market.
Remote interpreting technologies are now an integral part of the overall service offering. Organizations in both the public and private sectors now have a better understanding of the value multilingualism offers their online interactions and are using remote interpreting to further their missions and business goals. Remote interpreting has also expanded and flattened what was largely a geographically limited talent pool. Interpreters based in Bogotá or Buenos Aires can now easily interpret for a client in Boston or Boise. And access to interpreting talent that was previously geographically limited, say a highly trained Turkish-English or Hungarian-English interpreter, is now available just about anywhere through remote interpreting.
Remote interpreting catapulted into the mainstream because of the pandemic and has held onto a significant share of an expanded market. It is changing staffing and pricing models as well. How will remote interpreting be affected by speech-to-text and speech-to-speech translation? Read on!
According to Nimdzi’s Interpreting Index, the cumulative market share of remote interpreting (OPI, VRI, and RSI) went from 20% in 2021 to 95% at the height of the pandemic to 49% in 2023.
Technologies that augment an interpreter’s performance
For years, excitement has abounded among tech-friendly interpreters about technology helping us improve how we work. The idea of custom-built glossary management software, receiving and accepting customized job offers on smart devices, and later the possibility of proper names, terminology, and numbers appearing on screen in real-time thanks to advances in speech recognition technologies created significant buzz. However, with a relatively small target market and the limited revenue opportunity, also known as the total addressable market or (TAM), the development of interpreter-specific tools has been spotty at best. The situation is lamentable because interpreters could benefit greatly from tools designed specifically with their requirements in mind.
While some specialized tools exist, largely for glossary building and management, most interpreters find ways to adapt mainstream technologies to support their linguistic work. For example, many interpreters are now using consumer-grade machine translation (MT) services before and during interpreting assignments to produce target-language scripts for interpreting pre-written speeches that are often delivered at very high rates of speed. However, they must carefully consider the confidentiality of information they are entrusted with when using these services and ensure that it is not retained by the service to further improve and train the MT system.
The upshot here is that given the relatively small TAM for interpreter-specific tools purchased by interpreters, this category is likely to remain a niche with limited activity and innovation until these tools offer a measurable improvement in the quality and efficiency of the service. When that happens, language service companies will have a clear incentive to develop and support such tools, particularly in the OPI and VRI space.
Technologies designed to replace human interpreters altogether
Coupling speech recognition with machine translation and speech synthesis technologies to produce multilingual captions or speech-to-speech translation is not a novel idea. Many companies focused on this type of service over a decade ago but with very limited success. What has changed is the speed and accuracy of the component technologies.
In the 2010s, this category was still largely in the realm of science fiction. Today, it is the most dynamic, and worrisome, of the three. As speech recognition and natural language processing technologies have improved radically over the last two years, technologies to replace human interpreters have gone from being an object of derision to the proverbial elephant in the room that can no longer be ignored.
In a nutshell, here’s what’s changed: Automatic speech recognition (ASR) accuracy, although still imperfect, has seen a massive improvement in many languages and accents. Accuracy is most often measured by the ratio of errors to the total number of words spoken, known as the word error rate or (WER). These rates now hover around 5%, with some providers touting WERs below 4%. For context, the average human transcriber has a WER of 4%. This means that ASR is 95% accurate as compared to the 96% accuracy of a human transcriber.
Next, statistical machine translation (SMT) gave way to neural machine translation (NMT) in 2016, when Google released its version of it. This shift dramatically dropped the amount of post-editing required to make a translation usable. And just last year, OpenAI publicly launched ChatGPT, the first of many now publicly-available large language models (LLMs), which have mostly eliminated the agrammatical and stylistically poor texts frequently generated by SMT. But they also came with their own unique problems, like making information up (or “hallucinations”) or omitting content. Even so, LLMs appear to be the next stage in technology advancement as big tech companies like Meta, Google, and Tencent recently released models that are specially designed to provide speech-to-text and speech-to-speech translation.
Lastly, text-to-speech (TTS) technologies have improved notably. The mechanical and monotonous voice so characteristic of computers since the 1980s has given way to natural and very human-sounding voices. Current speech-to-speech translation offerings allow the ability to choose a male or female voice and even a specific accent of a given language. But wait, there’s more. Multiple AI companies now offer voice cloning, which opens the possibility of cloning a speaker’s own voice and using it to provide spoken translation in multiple languages. Put all these advances together, and you have the current state-of-the-art of speech-to-speech translation, which can now arguably produce accurate and fairly natural-sounding interpretations for certain language combinations and types of communication.
The space continues to develop and change with tremendous speed. Is it exciting? Yes, definitely. Is it concerning, maybe even a little scary? I think it is, and not just for individual interpreters worried about their livelihoods. What will the availability of speech-to-speech translation that is “good enough” mean for end users in certain low-stakes scenarios? It begs the question of how “good enough” may be very positive for some kinds of interactions, where the risk is low if 5% of content isn’t fully accurate, or catastrophic for high-risk or high-stakes interactions in medical, legal, and high-pressure diplomatic or military situations.
Before and during the pandemic, interpreters and interpreting companies focused primarily on the technologies for delivering interpreting services. But just as many meetings were returning to on-site venues and the number of hybrid multilingual meetings was on the rise, these much-improved offerings of speech-to-text and speech-to-speech translation began to catch the attention of early adopters. Now, the focus is rapidly shifting to how and when, not if, AI will have a significant effect on the interpreting market.
Remote interpreting platforms are now at a crossroads. They are uniquely positioned to offer multilingual captioning and speech-to-speech translation as users of their software are already in front of a screen and can turn the service on with the click or tap of a button. Mainstream video conferencing services also offer multilingual captioning, with other tech companies providing apps or plugins to offer speech-to-speech translation on most popular video conferencing services. Indeed, at least two of the leading RSI technology companies have rolled out speech-to-text and speech-to-speech translation offerings to complement their existing services.
Speech-to-text and speech-to-speech translation are still in their infancy. These technologies appear to be good enough for something. But the main question is good enough for what? And many other questions still need to be answered about the use of AI for multilingual communication. Are end users comfortable reading captions or listening to much improved but still synthetic voices? Who is liable for errors when AI is used in high-stakes multilingual interactions? Will listeners prefer to hear the cloned voice of the original speaker rather than that of a human interpreter? The answer to these and many similar questions is: “We just don’t know yet.”
Where do we go from here?
You’ll notice that I have never used the term machine interpretation and for good reason. AI cannot interpret yet. It provides a machine translation of the words. It does not process emotion. It cannot understand context. It still struggles with homophones, homographs, and homonyms. And industry-specific acronyms and jargon continue to be problematic. Perhaps most important of all, AI cannot identify when it makes a mistake or when something that it produces is nonsensical, and then correct it in real time, which human interpreters regularly do.
These limitations have tremendous ramifications for using AI in multilingual communication, given that professional interpreting is provided in many high-stakes interactions where decisions affecting the health, rights, and well being of millions of people are made. AI should be seen as an opportunity to expand access to cross-language communication, as a tool that should be used properly. This is why it should be the task of the stakeholders in these myriad interactions to determine how it can best be used beneficially, when it shouldn’t be used, and what safeguards should be in place when it is.
This is the goal of the Interpreting SAFE-AI Task Force, an effort launched recently by a group of language industry veterans to build a unified, authoritative voice on how AI can be used in cross-language communication. It will comprise a cross-section of all stakeholders, including technology developers, end users, interpreters, and language service companies. The task force is ramping up now and plans to prepare an initial set of guidelines by the end of the year. Given the speed of change in the AI space, these guidelines will likely be a living document that will be updated as technology advances.
What is happening in the language services space with AI is not taking place in a vacuum. The adoption of AI-powered technologies is something that is affecting all of society. The private sector is scrambling to determine how AI will change how they do business. Governments are trying to determine how AI needs to be regulated. Writers and actors are going on strike because they want assurances that they will not be replaced by AI. And on, and on, and on.
We are truly at the beginning of a new reality that multilingual communication has never seen before. Interpreting is a relatively new profession that had, up until now, seen very little change. It is now facing a revolution. AI-powered tools will play greater roles in our lives and language services, potentially expanding access to multilingual communication to billions of people. And that is a good thing, especially if its adoption can be guided to minimize risk and harm. For those of us making a living by helping the world communicate, it’s a scary but also welcome development