Fewer Than You Think: The Hidden Reality of Language Pairs in AI Models

Big‑tech companies often boast about advancing language support in their artificial intelligence (AI) models, many claiming to offer dozens of language pairs. However, this marketing narrative doesn’t tell the whole story, obscuring the fact that most of these models rely on an English‑pivot architecture that precludes true multilingual interoperability. 

For consumers who depend on non-English language pairs to communicate, incomplete or misleading statements about language support can be far from helpful. Companies in Silicon Valley owe it to their global users to be transparent about the data used to train their AI and what effect that has on its multilingual capabilities.

The English-Pivot Bias

Synthetic data is the driving force behind today’s AI models. While there is no official report on how much synthetic data used in AI is strictly English-centric, we know that an overwhelming majority of AI is built on data in the English language. When there is little to no synthetic generation pipeline for non-English to non-English language pairs, the models route translations through English as an intermediate step — resulting in what’s known as an English-pivot bias and lower-quality translation. 

This structural tendency leads evaluation benchmarks, prompting formats, and reinforcement learning reward models to function only the way they do within an English-language setting. And of course, this information often gets left out of marketing campaigns when a new model is released or when languages are added to a product. 

Pair‑Level Transparency

When looking at claims from the top five big tech companies with AI models — Amazon, Apple, Google, Meta, and OpenAI — there is a noticeable pattern of excluding specific information on supported language pairs. Two models — Meta’s No Language Left Behind (NLLB) and Google’s TranslateGemma — do have publicly available model cards, but no company provides pair-level transparency in its entirety. Models are marketed as having been trained on 55 language pairs (TranslateGemma) or supporting over 200 languages (NLLB), for example, but these claims are not verifiable with a full language-pair matrix.

Currently, there are no ISO standards, European Union AI Act provisions, or United States regulations that require companies to disclose language-pair coverage or non-English translation directions. Perhaps it’s no surprise that this information is hard to find.

Why It Matters

For consumers who rely on non‑English communication, the difference between “200 languages” and “200 translation directions” is enormous. The lack of complete information on language pairs can distort the true depth of multilingual coverage these products are able to provide to non-English-speaking consumers.

In the race to support the most languages, AI companies benefit from advertising with ambiguity. After all, simple language counts are easy to market — but at what cost to the consumer? Until the industry adopts pair‑level transparency, multilingual AI will continue to be marketed as global while functioning primarily as English‑centric.

Sydnee Cooper
Sydnee Cooper's expertise spans the language service industry, language access laws, and second language acquisition. She is passionate about raising awareness among global audiences about the impact of languages and cultures on our lives.

RELATED ARTICLES

Weekly Digest

Subscribe to stay updated

 
MultiLingual Media LLC