sponsored content

— Supported by Pangeanic —

Overcoming the Limitations of LLMs

Pangeanic’s New RAG-Based Approach

I

t has been recently argued that one of the “faults” of large language models (LLMs) is that they don’t understand the output they produce. This is a misunderstanding of generative pre-trained transformers (GPTs), which were not designed to reason but to generate — that is, predict. These models don’t need to understand what they produce to generate human-quality language. Predicting the next token seems good enough to fool us.

Likewise, rule-based machine translation (MT), statistical MT, and neural MT (NMT) never “understood” what they were translating. They followed rules or probabilities. So why would we expect any type of translation system to “understand?”

Thus, the question is not should we use LLMs for translation but how do we make the best use of their contextual and cultural adaptation capabilities for localization while minimizing their true nature (that they’ve been designed to generate).

LLMs like GPT-4 have captured the public imagination. More open-source models are freely available and can be customized for specific tasks. They can generate human-like text, answer questions, and even write code. But can they truly bridge the gap between languages and cultures? Recent research by Pangeanic and others suggests that we might be closer to this goal than we think, despite the well-known limitations of LLMs.

The Challenges of LLMs

While LLMs have shown impressive capabilities in generating fluent text across multiple languages, they’re not without their flaws. As critics point out, these models can produce “hallucinations” — confidently stated but entirely fabricated information. They also lack common sense, often failing to grasp the nuances of context and culture that are crucial in human communication.

Dr. Pilar Orero, a computational linguist at the Universitat Autònoma de Barcelona (UAB) and frequent partner of Pangeanic in European Union (EU) research grants, explains, “LLMs are essentially sophisticated pattern recognition machines. They can produce grammatically correct and even idiomatic text, but they don’t truly understand the world in the way humans do.”

The Context Conundrum

One of the biggest challenges in cross-cultural communication is the varying importance of context across cultures. Anthropologist Edward T. Hall famously categorized cultures as high- or low-context, depending on how much of the meaning in communication is implicit versus explicit. This distinction presents a significant hurdle for MT systems, including those powered by LLMs.

In high-context cultures, such as those found in many Asian, Mediterranean, and Latin American countries, communication relies heavily on shared understanding, implicit meanings, and non-verbal cues. In these cultures, what’s not said is often as important as what is said. The context — the setting, the relationship between speakers, and shared cultural knowledge — carries a significant portion of the message.

Consider the Spanish phrase “lo vamos viendo” (“we will see to it as we go on”). To a native Spanish speaker, particularly from a Mediterranean culture, this seemingly simple phrase can convey a wealth of meaning. It suggests flexibility to challenge ahead, a reluctance to fully commit, and an openness to adapting plans as circumstances change. Orero notes that “it’s a perfect example of Mediterranean improvisation that might leave an American or Scandinavian bewildered.”

Low-context cultures — prevalent in countries like the United States (US), Germany, and Scandinavian nations — tend to communicate more directly and explicitly. Information is more likely to be conveyed in words rather than context. It is worth noting that even within the same language, American Spanish speakers tend to find European Spanish speakers very direct.

This cultural divide can lead to amusing, or frustrating, misunderstandings. Take the example of asking a passerby to take a photo of you. A Finnish person might simply reply “no” without offering any explanation or apology. While this direct response might be perfectly acceptable in Finnish culture, it would leave Southern Europeans stunned. Maria Angeles Garcia (Head of MT at Pangeanic) explains, “In Spain, even if someone couldn’t or didn’t want to take the photo, they would likely offer profuse apologies and explanations. The lack of these social niceties can make low-context communicators seem rude to those from high-context cultures.”

The differences extend beyond casual interactions. As Angeles Garcia and Pangeanic’s Head of Production Ángela Franco verified a few months ago in several business settings, these cultural communication styles can significantly impact negotiations and relationships. Imagine a business meeting in Japan versus one in the US. In Japan, a high-context culture, meetings are often filled with ceremonies and protocols. Direct discussion of the business at hand might be considered crude. Reading between the lines and understanding unspoken agreements are crucial skills.

Contrast this with a typical American business meeting, where participants are likely to “get down to business” quickly, explicitly stating goals, presenting data, and seeking clear agreements or disagreements within the meeting. What’s considered efficient and professional in one culture might be seen as rushed and impolite in another.

These cultural differences in communication style present a formidable challenge for MT systems. How can AI understand that “lo vamos viendo” might require a completely different translation depending on the context? How can it capture the unspoken elements of a Japanese business negotiation?

Angeles Garcia, who has extensive experience in NMT and multilingual MT, points out, “LLMs are trained on vast amounts of text data, which allows them to capture some of these cultural nuances. However, they still struggle with the deeper understanding required to consistently solve these complex cultural contexts.”

Even within a single culture, context can vary greatly depending on the situation. Orero explains, “All societies combine both types of communication. There’s no language that’s entirely independent of context for correct understanding of what it expresses.”

This complexity extends to non-verbal communication, as well. “Mediterranean societies might favor more gesticulation,” Orero notes, “while Nordic cultures might be less expressive gesturally, but perhaps much more subtle. Gestures or tones that might go unnoticed by a non-native speaker could be very revealing to a native.”

The challenge for LLMs and MT systems, then, is not just to translate words, but to translate entire cultural contexts. They need to reflect not just what is said, but how it’s said, why it’s said, and crucially, what isn’t said.

As we push the boundaries of AI in translation, addressing this context conundrum becomes increasingly important. The goal is not just linguistic accuracy, but cultural fluency — the ability to tackle the complex, often unspoken rules that govern communication in different cultures.

A New Approach: Deep Adaptive RAG-Based Automatic Post-Editing

Deep Adaptive, developed as part of a collaboration with Valencia’s Polytechnic Pattern Recognition and Human Language Lab, has long been Pangeanic’s NMT flagship. However, Pangeanic’s most recent research offers an even more promising solution. As part of a national research project, our team has developed a system that combines the power of LLMs with specific retrieval-augmented generation (RAG), vector databases, and agentic verifiers to create more accurate and culturally appropriate translations.

The system works as follows:

  1. An initial translation is generated using a state-of-the-art NMT model.
  2. This translation is then processed by a RAG system, which uses vector databases to retrieve relevant context, terminology, and style information.
  3. Finally, an LLM, fine-tuned for post-editing (PE), refines the translation based on the retrieved information.

Angeles Garcia is the lead researcher on the project, with Franco ground testing every step. They explain, “Our system doesn’t just translate words; it translates context. By leveraging vast databases of domain-specific knowledge and cultural information (what the user deems relevant), we can produce translations that are not only accurate, but also culturally appropriate.”

The Workflow: A Three-Step Process

The system works through a sophisticated three-step process:

  1. Initial Translation: The source text is first translated using a state-of-the-art NMT model. For this study, the team used a heavily fine-tuned version of Meta’s “No Language Left Behind” model, which supports over 200 languages. “We fine-tuned the model on 20 million words of carefully selected, reviewed, and cleansed data from our proprietary repository,” Angeles Garcia notes. “This gave us a strong baseline that already had some domain-specific knowledge.”
  2. Context Retrieval: The initial translation is then processed by a RAG system. This step is crucial for addressing the context conundrum. Orero — who wasn’t involved in the study but has reviewed the findings and will be using them in an EU project for the audiovisual sector — explains, “in essence, the RAG system uses vector databases to retrieve relevant contextual information, terminology, and style guidelines. It’s like giving the translation system access to a vast library of cultural and domain-specific knowledge, which is very important in multilingual automatic subtitling, for example.” The vector database used in this study contained 100,000 terminology pairs and style examples for each language pair, covering domains as diverse as healthcare, software, journalism, marketing, law, hydrology, public administration, and automotive industries.
  3. PE Refinement: Finally, an LLM that is fine-tuned specifically for PE tasks refines the translation based on the retrieved information. “This is where the magic happens,” Franco adds. “The LLM doesn’t just substitute words based on the retrieved information. It uses its deep understanding of how a language works to seamlessly integrate the contextual nuances, terminology, and style guidelines into a fluent, culturally appropriate translation.”

Ángela Franco
Head of Production at Pangeanic

Dr. Pilar Orero
Computational Linguist at UAB

Manuel Herranz
CEO at Pangeanic

Maria Angeles Garcia
Head of MT at Pangeanic

The Power of Vector Databases

A key innovation in this approach is the use of vector databases. Traditional databases store and retrieve information based on exact matches, which isn’t ideal for the nuanced world of language. Vector databases, on the other hand, store information as high-dimensional vectors that represent the semantic meaning of the text.

Manuel Herranz, Pangeanic’s CEO, states, “When the system looks for relevant information, it’s not just matching keywords. It’s finding concepts and contexts that are semantically similar to the text being translated. This allows for much more flexible and nuanced retrieval of information.”

Addressing Common Problems

The RAG-based approach helps mitigate hallucinations. “By grounding the LLM’s generation in retrieved information, we significantly reduce the risk of hallucination,” Herranz explains. “The system isn’t making things up; it’s drawing on a curated database of reliable information.”

The system shows particular promise in navigating the tricky waters of high- and low-context communication styles. Franco notes, “What’s impressive is how the system adapts its translation style based on the cultural context of both the source and target languages and how incredibly fast it adapts to new scenarios, with a very small number of ‘generative’ issues.”

Herranz adds, “When translating a high-context Japanese business communication into English, the system doesn’t just perform a literal translation. It retrieves relevant information about business communication styles in English and uses this to produce a translated version that conveys the same meaning in a way that’s appropriate for a low-context culture.”

Scalability and Adaptability

Another significant advantage of this approach is its scalability and adaptability. Unlike traditional NMT systems that require extensive retraining to adapt to new domains or language pairs, the Deep Adaptive RAG-based system can be updated simply by modifying the inputs (TMX or TSV files) in the vector database.

“If we need to add a new domain or update our terminology, we just update the database,” Angeles Garcia explains. “The core translation and PE models can remain the same, which makes the system incredibly flexible and cost-effective to maintain.”

The results of this new approach are striking. In human evaluations, the RAG-based system was preferred over traditional MT in 91% of cases for German, 93.8% for French, and 87.9% for Japanese.

Franco notes, “What’s particularly exciting is how well the system handles domain-specific terminology and stylistic nuances. It’s not 100% perfect, but MT never was even if you took the output from giants like DeepL or Google Translate. There was always a need for human verification. This is a very significant step towards truly bridging the gap between languages and cultures.”

Future Work and the Human Touch

Despite its impressive performance, the researchers caution that the system is not perfect. “We’re still in the early stages of this technology,” Herranz admits. “It is truly impressive, and current use cases presented at GALA and LocWorld speak for themselves. While it handles many aspects of context very well, there are still some nuances of human communication that will be work in progress for a bit, like adding tone or visual cues. But, hey, this is the language industry, and final publication does require a human eye.”

Future research will focus on improving the system’s understanding of more subtle contextual cues, including non-verbal aspects of communication. The team is also exploring ways to incorporate real-time cultural and current events information to make translations even more contextually relevant.

As the language industry pushes the boundaries of AI in translation, this Deep Adaptive RAG-based approach represents a significant step forward. By combining the pattern-recognition capabilities of LLMs with the vast, structured knowledge of vector databases, we’re moving closer to MT that can truly bridge cultural and linguistic divides.

Herranz concludes, “Our research shows that by combining LLMs with other technologies like RAG and vector databases, and by carefully considering the role of context and culture in communication, we can create systems that come closer to true cross-cultural understanding. It’s an exciting time for the field of natural language processing and for anyone interested in breaking down language barriers.”

Related Articles