Can images be used to enhance machine translation output?

Researchers at the Massachusetts Institute of Technology (MIT), the University of California at San Diego (UCSD), and IBM have developed a machine learning model that just might shake things up in the field of machine translation (MT): VALHALLA.

Throughout the history of MT, models have mostly worked by using text-based information. In recent years however, some researchers have grown interested in multi-modal machine translation (MMT), a paradigm for MT that takes more than one form of input — for example, text and a corresponding image — into account while developing a translation. VALHALLA is an example of MMT that utilizes computer vision to enhance the quality of a translation.

“While there are many different ways to describe a situation in the physical world, the underlying visual perception is shared among speakers of different languages,” the team of researchers behind VALHALLA wrote in a recent paper. “The addition of visual context in the form of images is thus likely to help the machine translation.”

Many examples of MMT are rather time-consuming, as sentences or phrases must be manually matched with pre-existing images. The team of researchers from MIT, UCSD, and IBM has come up with a way to make this more efficient. Using a process called visual hallucination, VALHALLA generates its own image based on the text input.

“To the best of our knowledge, we haven’t seen any work which actually uses a hallucination transformer jointly with a multimodal translation system to improve machine translation performance,” Rameswar Panda, a researcher at the MIT-IBM Watson AI Lab who worked on developing VALHALLA, told MIT News.

When translating the English sentence “A snowboarder wearing a red coat is going down a snow-covered slope,” VALHALLA first generates an image that depicts the scene described in the sentence. VALHALLA then uses this image to enhance the accuracy of the translation, also helping to parse out semantic ambiguities that may arise. While VALHALLA creates its own visual hallucination, other models might require already existing images to be paired with words, phrases, or sentences. 

VALHALLA uses the text input and the resulting visual hallucination to predict the most accurate sentence in the target language. According to the researchers’ paper, the model performs as well as or better than other prominent examples of MMT — this is particularly true when it comes to translating longer sentences, which may have more ambiguous words. In future studies, the researchers hope to explore other modalities that can be used for MMT, such as audio and video.


Andrew Warner
Andrew Warner is a writer from Sacramento. He received his B.A. in linguistics and English from UCLA and is currently working toward an M.A. in applied linguistics at Columbia University. His writing has been published in Language Magazine, Sactown Magazine, and The Takeout.

Weekly Digest

Subscribe to stay updated