Building a roadmap for Big Data TM integration

Konstantine Boukhvalov, Alexander Jimenez
Multilingual January/February 2017

Many corporations and government entities would like to benefit from the Big Data revolution in language processing. Currently, this requires feeding large amounts of data into open or shared technology solutions. However, legal concerns about control of intellectual property — and even questions of national security — often frustrate even the most modest ambitions to “ride the Big Data wave.” However, there is a way to alleviate or eliminate these intellectual property and security concerns, opening the door to wider exploitation of high-value multilingual content....

There are many potential benefits to developing Big Data TM corpora. First, the language services community would have free access to a large repository of legacy translations to leverage when translating new content. Second, instead of being limited to the existing legacy TMs produced by a limited pool of vendors, corporate and government language services would be able to pretranslate their content by leveraging a translation database produced by thousands of translators. Third, the database would be continuously updated and serve as a base for MT solution training.

However, there is resistance to developing Big Data TMs. One of the main resistance points — and a perfectly valid one — is that proprietary and sensitive data could, and most likely would, enter the public domain through TM exchange. The risk of sensitive data exposure is one of the major obstacles to forming a Big Data community that would be willing to contribute TM corpora for global leveraging....