Translation technology’s big data revolution

You probably know that language data includes speech, text, lexicons and grammars. To have any chance of meaningful depth in a short article, let us focus on textual translation data stored in translation memory (TM). 

As I write, we are wrapping up a major report about the translation technology landscape. The report is a comprehensive analysis of the segment’s past, present and expectations for the future — a first of a kind, involving only technology suppliers. In June 2012, 20 of them gathered in Paris to discuss common goals in terms of visibility, collaboration and interoperability, and from this the report was born. Everyone in the meeting could agree that there were some grounds for cooperation, and the report is an open resource to help promote visibility. However, the case for widescale sharing of translation data was not an agenda item. It should have been. In our assessment, it is difficult to overstate the importance of language and translation data for the global translation technology and services industry.  

Translation technology can certainly be better exploited to capitalize on massive demand to translate content across all industries throughout the many touch points in the consumer decision journey. They may be particularly important in the burgeoning consumer classes in emerging markets such as Brazil, Russia, India and China or the CIVETS markets of Colombia, Indonesia, Vietnam, Egypt, Turkey and South Africa.  

But translation technology providers will need to maintain openness in order to continue the change they have exhibited in the last few years if they and we are to enjoy rich rewards. After a generation of stagnation, where there were only rare examples of firms improving on tired old TM technology, often built with Windows-based editors, we suddenly find ourselves primed for a generation of dynamism.  

Recent changes involve must-have migrations to cloud offerings, with growing use of web services application programming interfaces, tools that enable collaborative translation, integration of machine translation (MT) and TM, and MT with website localization, among others.  

It must be said that much of the initial gusto for these changes came from invaders, like the big guys Google and Microsoft, but also firms such as Lingotek, XTM International and They’ve been joined by other newcomers like Baobab, MemSource, Gengo, Smartling, Straker and a whole host of others. Established players, such as SDL and MemoQ, have reacted well with migrations to cloud and integration with MT. One or two others are beginning to look promising. 

The MT space has been the most dynamic of all, with Microsoft commercializing and an explosion of open source Moses-based providers and users. Long-established players such as SYSTRAN and PROMT have made great strides to hybridize their technologies and distinguish themselves with rounded and rapidly evolving offerings.


Data is core 

TM software was developed as a productivity tool for the translator. It was designed originally to manage the translation of a single document or project. Globalization and translation management systems were added later as a workflow to scale up and make TM tools fit for enterprise use.  

Traditionally, TM tools use the sentence as the basis for full and fuzzy matches. As long as your translation practice concentrates on the revisions and updates of documents and products, the leveraging score of TM tools can be very high.

However, much of the market has changed. Increasing need for rapid-turnaround translation of smaller bits of content has brought developers back to the design table. Their focus is not only on the productivity of the translator, but also the agility of the enterprise.

They must rely on advanced statistical approaches, as already applied in statistical and hybrid MT systems, and they must bring sophisticated linguistic intelligence into the mix as well. They are not looking to leverage TM from a single document or project, but to use as much domain-specific text and data as possible. Their advanced, subsegment leveraging capability can further increase translation productivity. Harnessing big data sets is a core requirement.  

Today, glossaries are built by terminologists: the best-in-class language specialists. It is laborious work and frustrating. Because language keeps changing, the terminologist is always behind and the glossary is often ignored. But, in fact, terminology can be harvested in real time by accessing high volumes of translation data. Synonyms and related terms can be identified automatically. Parts-of-speech can be tagged, context listed, sources quoted and meanings described. The technology to do this work in a largely automated fashion, with linguists and users involved as validators already exists. 

Access to terminology on demand would raise the capability of the translation industry across the board and help fuel innovation among translation technology providers. Tools like Linguee and TAUS Search already show some of what is possible. Well-documented community translation efforts, such as that of Adobe and Symantec, have shown how helpful users can be in defining terminology. The European Commission-funded TaaS (Terminology as a Service) project might produce a next-generation terminology platform, combining the scale of data of Linguee and user feedback of crowd efforts.

Most people accept the inadequacies of MT on the internet, as it often serves their goals, typically for comprehension. Relatively few companies or organizations go through the process of customizing an engine for their content.

Easy access to high volumes of translation data will open up a myriad of opportunities. Fully automatic semantic clustering could be developed to find the translation data that match specific domains. This would make it much easier and cheaper to make industry and domain-specific MT engines. Automatic genre identification techniques could be employed, making it easier and cheaper to ensure that MT engines apply the right writing style. It would be easier to go deeper in advancing MT technology with syntax and concept descriptions. In short, the overall quality of MT would be raised, while also making development faster, cheaper and more accessible. These types of enhancements, improving quality and lowering cost, would make customized MT solutions far more practical for small and medium-sized firms trading across borders. Customized engines typically produce translation quality at least twice as good as that resulting from free online tools. Underserved sectors such as the legal, healthcare, finance and public sector, among others, would benefit tremendously.

MT is the future. As Erik Ketzan says in “Rebuilding Babel: Copyright and the Future of Machine Translation Online,” “Technology may have put man on the moon, but machine translation has the potential to take us farther, across the gulf of comprehension that lies between people from different places.”

In terms of global market and customer analytics, easy access to high volumes of translation and also language data would make it much easier to integrate translation technologies and processes with consumer listening, analytics and social media management. This would enable multilingual sentiment analysis, search engine optimization, opinion mining, customer engagement, competitor analysis and more.

In terms of quality management, the translation industry often struggles to deliver adequately targeted quality, missing the local flavor, the right term or subject knowledge. Source texts are often in bad shape, causing all kinds of trouble for translators or MT engines. If high volumes of language data could be easily accessed, it would be possible to automatically clean and improve source texts for translation. It would be simpler to run automatic scoring and benchmark quality. Consistency and comprehensibility would be improved.

Currently, the lack of interoperability and compliance with standards costs a fortune. Buyers and providers of translation often lose 10% to 40% of their budgets or revenues because language resources are not stored in compatible formats. If it were common practice for the global translation industry to share most of its public translation data in common industry repositories, all vendors and translation tools would be driven toward full compatibility. This would enable the industry to scale up significantly. One example of this is SWIFT (Society for Worldwide Interbank Financial Telecommunication).

Any financial institution wanting to transact across borders uses SWIFT messaging protocols, which are agreed upon by all stakeholders as instruments evolve. One thing bankers certainly have gotten right is recognizing that interoperability is complementary to all and working out a way to hardwire interoperability into the sector. The translation sector analogy would be that we realize the benefits of access to big translation data sets and working together to exploit opportunities. Interoperability would just be a by-product.

Sources of big data 

The world wide web is the most obvious source of big language data. But only the brave and deep of pocket can risk the friction of legal issues and the effort needed to crawl and specifically, clean (process) such data. There are examples of such crawling taking place among small players, but it is difficult to see how these efforts can ever really scale up. 

A far better way is to aggregate TMs. This is what data-driven translation technology firms are already doing or enabling. Clearly, data lock-in is the risk for clients that take part. Only a handful of providers could ever be more than marginal players taking this route. There are platforms such a Microsoft’s Collaborative Translation Platform and Google Translator Toolkit that can collect data as a by-product of use. However, it takes the muscle of a major enterprise to really scale with this approach. 

The best option by far is to access data from open repositories, like TAUS Data, Language Grid and others that are inevitably on the horizon. These platforms provide secure legal frameworks and enable users to spread and share investments in hosting, cleaning and data selection. For any observer it seems a no-brainer that translation technology and service providers should wholeheartedly embrace such utilities by integrating, sharing and exploiting data. It is the only model providing for competition, flexibility and sustainable innovation. But as I find whenever I talk to my parents and their peers about arranged marriage, it is often emotional attachment and sometimes a lack of vision rather than rational decision-making that rules. Sometimes it is only outsiders or people with one foot in both camps who have the tenacity to try new things.  

The paradigm-shifting move to the cloud is taking place across almost all technology segments. This has made it easy for translation technology providers old and new to position and communicate their offerings.  

Whether we like it or not, the big language data revolution is happening. Translation technology firms certainly recognize this. However, there is no data equivalent of software as a service, the de facto standard for cloud offerings. And clearly, translation data is not owned by technology or service providers. There will inevitably be trial, failure and success. There are exciting times ahead. Openness to change is a prerequisite. Giving up control games like technology and data lock-ins may be another. Let’s see what happens as old habits die hard. For more, download a free copy of the TAUS Translation Technology Landscape report to learn more about this and other topics.