MT security 101

By Jake Schild December 7, 2017

Since its inception, machine translation (MT) has stimulated conversations among language specialists and consumers about the ways in which the technology could completely change the way language is translated.

Neural machine translation (NMT) has been the most recent advance in translation automation technology. When Google added NMT to its translation interface in 2016, the tech company and the news media lauded the upgrade as a major accomplishment.

Based on the model of neural networks in the human brain, NMT is able to use algorithms to learn linguistic rules on its own from statistical models. In theory, neural translation systems can decipher entire sentences at a time, instead of only single words or phrases and therefore can more effectively create a translation based on context.

The technology is poised to shake up the industry as its quality and accuracy continue to improve. But, in the context of translation automation and MT, there’s a conversation that seems to come up much less often: security.

The lack of discussion on the subject raises questions, especially given new data regulations emerging globally, including the European Union (EU) General Data Protection Regulation (GDPR) set to take effect in 2018.

Free, online translation tools can be a major risk to data security, and applications such as Google Translate will be mired in even more ambiguity once GDPR is enforced. Due to a general lack of knowledge regarding neural learning among consumers, it’s likely hard for those outside of the industry to fully understand the risks associated with data processing through cloud-based MT solutions.

In particular, the risks associated with the storage, transfer and processing of MT input data need to be examined, as well as how the use of MT relates to data protection and copyright legislation.

General security risks

MT security risk can be broken down into three general categories: file transfer, storage and processing. These factors don’t have to do with the process of translating words via machine, but they’re all part of the overarching process.

In an increasingly data-driven world, a heavier reliance on technology and less focus on manual processes heightens the possibility of security breaches.

File transfer: During the MT process, files are usually sent to a language service provider (LSP) or submitted through an online interface for translation. This could mean sending files via a translation management system (TMS).

As a consumer, it’s important to know files are being transferred securely between local machines and the translation vendor. It’s best to find a provider with at least 256-bit encryption to ensure sensitive information isn’t hacked or corrupted while it’s moving from one computer to another.

Standard methods of secure file transfer also include Secure Sockets Layer (SSL)/Transport Layer Security (TLS), as well as HTTPS and FTPS.

Storage: Once source files are sent to an LSP, it’s now a matter of where they are stored. Is the hosting server secured? Are there physical backups put in place to ensure the data isn’t vulnerable?

In a recent example of the dangers of online MT, the Norwegian oil giant Statoil found that emails and other internal documents translated with Translate.com were accessible to anyone who searched for them via Google.

Statoil said the company did not have any “sensitive” information that was disclosed, but still asked Translate.com to remove the translations that were leaked. The case of Statoil makes it clear that some MT can come with a great amount of risk.

Processing: Free online translation engines like Google or Microsoft store and process data in order to build up their corpora. Anything submitted to Google Translate is then accessible to the tech giant, which is able to use the data as it pleases.

There are still questions that need to be asked about data processing when working with LSPs as well. Namely, it’s important to see if data is stored, and if it is, where the text is kept.

There’s not only the question of what a translation company does with the data it receives, but also what will become of stored text if a system crashes.

MT and data protection law

In May of 2018, the EU’s GDPR will officially become law. The regulation replaces the data protection directive 95/46/EC and deals with the storage and processing of personal data.

Since it was passed in 2016, there’s been considerable discussion about the upcoming directive and how businesses should prepare for the new rule. In short, the GDPR was approved in order to put stricter limits on how companies could handle the data of EU citizens, attempting to make the process more transparent and accessible.

Among its provisions, the GDPR includes the following requirements:

• Citizens with processed personal data have the right to access that information and know what the purpose of processing that information is.

• The law allows citizens to have their data erased by a processer if it is no longer being used for its original purpose. However, a stipulation, according to the GDPR, is that a data controller can make a decision to erase information based on the “public interest in the availability of the data.”

• The purpose of, and request for, the processing of someone’s data must be provided in an “intelligible and easily accessible” form.

When these rules are theoretically put into place with Google Translate’s terms of service, things get a little confusing there.

Google Translate stores the data consumers submit. From its terms of service:

“When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content.

“The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones. This license continues even if you stop using our Services (for example, for a business listing you have added to Google Maps).”

First, the GDPR states that customers have the right to know what data is being processed and why. When someone keys information into Google Translate, one would think Google would be using the words and phrases submitted to build its text corpora, based on the description in its terms and services.

But still, Google’s definition is vague, as is the provision set out in the GDPR. As part of its “Data Processing Amendment,” Google does state that it will allow users to delete data and that it won’t be used for advertising.

Whatever the case, this kind of ambiguity should be noted by those thinking about using free online translation services.

In late 2016, a poll of 900 businesses in the UK, France and Germany found that 96% of decision makers in IT weren’t prepared for the regulation.

That statistic is surprising, considering what companies have to lose if they’re noncompliant: not following the new rules could result in fines of up to €20 million or four percent of global turnover, whichever is larger.

MT and copyright

The issue of data privacy and MT isn’t the only point of ambiguity when it comes to automated translation engines. Copyright law and MT haven’t been fully reconciled yet, either. This lack of clarity is evidenced by current US copyright legislation.

One phrase that isn’t found in Title 17 of the United States Code is “machine translation.” Title 17 deals with copyright law and mentions “translations,” but not MT.

In the United States, translations are “derivative works” of an original. This means they need to be okayed by the original author or owner of the translation before they’re translated.

If someone translates a paragraph from a copyrighted book, news report, television script or other printed material, they could be violating copyright law in the United States without knowing it. There is no copyright infringement if an item is listed under the Creative Commons or there has been permission granted to copyright the document in question.

Erik Ketzan examines the issue of copyright and MT in his 2006 paper, “Rebuilding Babel: Copyright and the Future of Machine Translation Online.” He brings up the issue of “fair use” and the Digital Millennium Copyright Act (DMCA).

Title 17 states: “Fair use… for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright.”

The statute goes on to explain that determining whether a copyright is determined to be a “fair use” depends on the purpose and character of use, the work’s nature, “amount and substantiality of the portion used in relation to the copyrighted work as a whole.” The effect of the work being used is also taken into consideration.

The issue of a lack of copyright laws regarding MT has already cropped up. In May of 2017, The Globe and Mail published an article by Michael Geist outlining the troubles Canada has had with regard to a lack of MT copyright regulation.

Geist explains that it may not be feasible to supply machines with the necessary learning materials due to copyright. In particular, he says popular books or television show translations would act as strong tools to teach algorithms.

“Given the absence of a clear rule to permit machine learning in Canadian copyright law (often called a text and data mining exception), our legal framework trails behind other countries that have reduced risks associated with using data sets in AI activities,” Geist writes.

Moving forward

As MT continues to evolve, there needs to be a more nuanced conversation surrounding the issues of security and how both data protection and copyright law pertain to translation automation. Until clear guidelines are created to address these topics, consumers will continue to have only a vague set of parameters to work with when using MT.

The push to create comprehendible and extensive regulations surrounding MT should be advocated for with the same vigor used to advance the technology itself. The easier it is for translation industry practitioners and consumers to understand MT’s best practices, and the risks and legal implications that come along with MT, the better suited all parties will be to use it successfully.

It’s important to note, however, that notwithstanding the lack of direction for consumers, there are still steps that can be taken to mitigate security risks. Becoming familiar with upcoming data protection laws like the GDPR and staying abreast of current trends in data security will help to lessen the chance that data is corrupted during the MT process.