Automatic interpretation for health care

Following decades of anticipation, automatic spoken language translation (SLT) has finally emerged from the lab and entered widespread use. The Google Translate application, for instance, can bridge dozens of languages in face-to-face conversations, switching languages automatically so as to enable hands-free use; Skype Translator as powered by Microsoft speech translation software, enables translated video chat among a half-dozen languages, with sophisticated measures for cleaning up the stutters, errors, and repetitions of spontaneous speech; and several smaller companies (SpeechTrans, ili, Lexifone) are seeking niches in anticipation of expanding demand, offering wearable or phone-based SLT.

Still very rare, however, are systems directed at serious use cases like health care, business-to-business, business-to-customer, emergency response, law enforcement and military and intelligence use, in which reliability is essential — not only measurable accuracy per se, but also user confidence.

From its inception, our company decided to address the health care market, since the demand was most evident there. San Francisco General Hospital, for example, receives more than 3,500 requests for interpretation per month, or 42,000 per year for 35 different languages. Requests for medical interpretation services are distributed among many wards and clinics. In view of the evident difficulties — mistakes could indeed be serious — the decision was made to emphasize facilities for verification and correction. Likewise, recognizing the need to address various use cases within health care, tools for rapid customization were emphasized as well.

Following a long gestation, the resulting system was pilot tested in 2011 at the San Francisco Medical Center of Kaiser Permanente, the largest health care organization in the United States. An independent evaluation was carried out at the conclusion of the test.

System description

Interactive automatic interpretation allows users to monitor and correct the automated speech recognition (ASR) system to ensure that the text passed to the machine translation component is correct. Speech, typing or handwriting can be used to repair speech recognition errors.

Next, during the machine translation (MT) stage, users can monitor, and if necessary correct, one especially important aspect of the translation — lexical disambiguation.

The system’s approach to lexical disambiguation is twofold: first, it supplies a back-translation. Using this paraphrase of the initial input, even a monolingual user can make an initial judgment concerning the quality of the preliminary machine translation output. Other systems such as IBM’s MASTOR have also employed back-translation.

In addition, if uncertainty remains about the correctness of a given word sense, the system supplies a proprietary set of Meaning Cues — synonyms, definitions and so on — which have been drawn from various resources, collated in a database and aligned with the respective lexica of the relevant MT systems. With these cues as guides, the user can monitor the current, proposed meaning and when necessary select a different, preferred meaning from among those available. Automatic updates of translation and back-translation then follow.

The initial purpose of these techniques is to increase reliability during real-time speech translation sessions. Equally significantly, they can also enable monolingual users to supply feedback for offline machine learning to improve the system. Until now, only users with some knowledge of the output language have been able to supply such feedback in systems such as Google Translate.

The system adopts rather than creates its speech and translation components. Nuance, Inc., supplies speech recognition; rule-based English↔Spanish MT is supplied by Word Magic of Costa Rica; and text-to-speech is again provided by Nuance.

To facilitate customization for multiple use cases, the system includes prepackaged translations, providing a kind of translation memory. When they’re used, reverification of a given utterance is unnecessary, since they are pre-translated by professionals (another option would be to have them verified using the system’s feedback and correction tools). 

Identical facilities are available for Spanish as for English speakers: when the Spanish flag is clicked, all interface elements — buttons and menus, onscreen messages, handwriting recognition — change to Spanish.


In 2011, Version 3.0 of this system was pilot tested at the Medical Center of Kaiser Permanente in San Francisco. The project ran for nine calendar months, with use in three departments — pharmacy, in-patient nursing and eye care — during three of those months. At the conclusion, 61 interviews were conducted by an interpreter from an outside agency. A formal internal report gave the results. Reception was generally positive, but responsibility for next steps has remained divided since that time, and there has been no further use to date.

We rely on Kaiser’s internal report, based on a commissioned survey by an independent third party. The report itself is proprietary, but we’ll reproduce its findings in essence.

First, though, we’ll make several preliminary points concerning stumbling blocks for the pilot project. All of these impediments have by now been removed as a result of the striking infrastructure advances over the years since the pilot concluded.

Version 3.0 was designed to cooperate with the then-current Dragon NaturallySpeaking, to be installed separately, and thus required speaker-dependent speech recognition: each speaker had to register his or her voice. This process took two or three minutes, including a 30 second speech sample; and, while this interruption was no great burden for English-speaking staff members, it rendered impractical speech recognition from the Spanish patients’ side.

Microsoft’s handwriting recognition was integrated into the system for both languages; but correction of errors was tricky at the time, so that this addition, too, incurred a training cost.

One more speed bump resulted from a software feature intended for customization: patients and staff could be registered in the system, so that their names could appear in transcripts, and so that various personalization features could be added later. However, registration of the login user was required rather than optional, and this process necessitated still more training time.

Taken together, these obstacles necessitated 45-minute training sessions for participating staff members.

Further, because the experiments predated the era of modern tablets, portability was inferior to that available now, while physical set up was much less convenient. On the first-generation tablets used, for instance, it was necessary to manually configure the physical buttons that turned the microphone on and off.

With these initial obstacles in mind, we can now summarize the results of the organization’s evaluation. The report cited:

•High praise for the “idea.” Higher than the actual experience of it.

•Translation quality was “good enough” as rated by members/patients.

•Limited English speakers would still use the system to verify the conversation and ensure completeness.

•Issues of literacy and computer literacy impact applicability.

•Even though the system had issues (low to fair graphical user interface, slow processing, lack of recognition of voice), members thought it was “cool.”

•Most people, and especially those who lacked English skills, preferred an in-person interpreter, although one person noted it wastes time to wait for a live interpreter, and a provider commented that the system saved the wait for Language Line.

•Hard for members to use tablet in the hospital.

•A number of patients declined to use it in the hospital but data is lacking as to why.

Patient responses to six significant questions are tabulated in Figure 1. The rightmost column shows the percentage of respondents who replied to each question with Completely or Mostly.

After the pilot

The pilot raised some issues to be addressed. First and foremost, there was a glaring need to facilitate speech input from the Spanish side. This goal implied implementation of speaker-independent speech recognition; and this has been carried out by exploiting advances in Dragon NaturallySpeaking. Auxiliary third-party software was also required to enable adaptation of Dragon software for use on desktop and tablet computers.

The need was also obvious for reduction in set-up and training time. Subsequent improvements reduced the total warm-up to a few minutes for both staff and patients. For example, Microsoft handwriting recognition has improved to the point that its correction facilities can be learned independently.

Another clear need has been to speed up the interactions. While numerous staff members praised the ability to verify translations, others also stressed that verification consumed limited time. To balance these competing wishes, we implemented a new set of icons allowing quick switching between Pre-Check and No-Pre-Check modes. In the latter mode, useful when speed is more important than accuracy, speech recognition and translation are not checked in advance of transmission; but post-verification is still enabled, since back-translations are still generated and now appear in the bilingual transcripts. These new controls operate separately for English and Spanish speakers, so that, for instance, a doctor can precheck when appropriate while allowing the patient to respond without distractions.

A number of interviewees called for various improvements in the user interface. In response, we supplied large fonts for all on-screen elements (the exact size can be selected); added prominent icons for easier switching between English and Spanish speakers; enabled adjustment of the text-to-speech volume and speed, for easier comprehension; and added a quick way for staff to introduce their system to patients.


We responded to the study by creating options favoring speed, or, alternately, accuracy. Here we illustrate the use of interactive correction for speech recognition as well as machine translation.

In Figure 2, the user has selected the yellow Earring Icon, specifying that the speech recognition should “proceed with caution.” As a result, spoken input remains in the Input Window until the user explicitly orders translation. Thus there’s an opportunity to make any necessary or desired corrections of the ASR results. In this case, the user has said “This morning, I received an email from my colleague Igor Boguslavsky.” The name, however, has been misrecognized as “Igor bogus Lovsky.” Typed or handwritten correction can fix the mistake, and the Translate button can then be clicked to proceed.

In Figure 3, the Traffic Light Icon has also switched to yellow, indicating that translation (as opposed to speech recognition) should also proceed with caution: it should be prechecked before transmission and pronunciation. This time the user said “This is a cool program.” Since the Earring Icon is still yellow, ASR results were prechecked and approved. Then the Translation Verification Panel appeared, as shown in the figure. At the bottom, we see the preliminary Spanish translation, Éste es un programa frío. Despite the best efforts of the translation program to determine the intended meaning in context, “cool” has been mistranslated — as shown by the back-translation, “This is a cold program.”

To rectify the problem, the user double clicks on the offending word or expression. The Change Meaning Window then appears (Figure 4), with a list of all available meanings for the relevant expression. Here the third meaning for “cool” is “great, fun, tremendous.” When this meaning has been selected, the entire input is retranslated. This time the Spanish translation will be Es un programa estupendo and the translation back into English is “Is an awesome program.” The user may accept this rendering, despite the minor grammatical error, or may decide to try again.

Reliability is indispensible for serious applications like health care, but some time is required to interactively enhance it. An approach like this lets users proceed carefully when accuracy is paramount or a misunderstanding must be resolved, but more quickly when throughput is judged more important. This flexibility, we anticipate, will be useful in future applications. Currently, in the quickest mode, for inputs of typical length (ten words or less), the time from end of input speech to start of translation pronunciation is normally less than five seconds on a 2.30 GHz Windows 7 desktop  with 4.00 GB RAM, and faster in a pending cloud-based version.

Next steps

The pilot system has served its primary purpose, having shown that an SLT system including features for verification, correction and customization can be accepted in a serious use case like health care. However, serious limitations remain to be overcome in future versions. Above all, the system must be made scalable. At present, it remains limited to English↔Spanish; it runs only on Windows platforms; and its verification, correction, and customization facilities have been implemented only for one rule-based translation system. Plans are underway to rectify these shortcomings.