What Makes Big Data Fit for NMT Consumption? Even Machines Have Standards

What Makes Big Data Fit for NMT Consumption? Even Machines Have Standards
Manager at KantanMT. Aidan started in the industry in 1991 and has worked with companies such Berlitz (now Lionbridge) CPSL and Siens Translation.

 

The proverb “you can’t make a silk purse out of a sow’s ear” is reputedly more than 400 years old. It long predates the more modern idiom “Garbage In: Garbage Out” (GIGO), a warning from early software development producers. Both mean the same thing; if you have garbage at one end, it is only possible to produce garbage at the other. It is an inescapable law of logic. If you begin building on a flawed premise or using a flawed set of data, it will logically result in a flawed outcome.

In the early days of program coding, it quickly became clear that if programmers wrote bad code, it inevitably produced malfunctioning software. Fortunately, malfunctioning software is easy to identify and can be addressed before any severe damage is done. However, in the age of big data analysis and machine learning, the GIGO rule is just as crucial, and ignoring it can result in a costly and time-consuming crisis.

It is a reality that an increasing number of important business decisions are now based on the outcome of complex data analysis. Data mining has become a global industry.  The production of prompt, exact, and vetted data is now the cornerstone of many products’ development and marketing campaigns. Today, business decisions, sometimes of an existential nature, are based on this intelligence. And just as with a sniper’s aim, a mistake in estimation at the front end is compounded over the mining process and can lead to a widely missed target. 

Either prepare the data – or prepare to fail

“Errors using inadequate data are much less than those using no data at all.” This advice on the challenges of exploiting data hails not from some modern-day data analyst, but Charles Babbage, mathematician and inventor of the analytic engine  in the early 19th century. Today’s tsunami of big data is estimated to be 2.5 a quintillion bytes of data each day, 300 times the 2005 volume: and that figure is growing exponentially each year, according to the Big Data London website. Another 19th century figure, albeit a fictional one, also spoke of the importance of data. “It is a capital mistake to theorize before one has data.” (Arthur Conan Doyle’s Sherlock Holmes.) This enormous ocean of data is made up of text files, photographs, images, customer transactions, social media entries, financial data, and much more. In short, everything that is done online these days goes into that immense filing cabinet in the sky. Of course, data stored in the cloud is useless to the end-user if it is not sorted and filed in a helpful way. The existence of poorly constructed data does have an enormous impact on modern-day companies. IBM, for example, estimates that in the US alone bad data costs businesses $3.1 trillion annually. The drive to achieve challenges such as perfecting processing  performance, understanding variance, and bias, finding dirty and noisy data, preventing concept drift, and exercising care around data provenance and removing any uncertainty are the key tools in ensuring a company is using the cleanest, most efficient, and relevant data available.

Machine translation data

The good news is that bad data can be rescued. Although the data might be outdated, unstructured, unformulated, irrelevant, or corrupted, this is often due to the process that it has gone through, or the way people compiled it. Looking for information in a pile of bad data is akin to sticking your hand blindly into a bin full of sticky notes hoping to find something relevant and then basing an important business decision on your findings. Companies can, and do, rescue their data through a focused, structured, and professional engagement with this challenge. 

Machine translation (MT) works on taking bad data and making it fit for purpose. The data can be generic or customized, depending on your needs. Generic MT engines, like Google Translate, Microsoft Translator, and Amazon Translate are for more general purposes and are not trained with data for a specific domain (e.g., medical, automotive, IT, etc.) or topic. Custom MT engines, on the other hand, are more finely tuned as they are trained with specific domain-relevant data, resulting in a more exact MT output. However, they also come with a higher price tag. Data-driven neural machine translation (NMT) models  — the most often used MT today — are at their most efficient when data is cleared of noise and then trained to meet the needs of the end-user. NMT is an approach built on deep neural networks. There are a variety of network architectures used in NMT, but typically the network can be divided into two components: an encoder that reads the input sentence and generates a representation suitable for  translation, and a decoder that generates the actual translation. 

The proactive approach to perfecting data saves a fortune in time and money spent on a project. Furthermore, cleaned and improved data is fed back into the engine once translated. In some NMT models, human translators work in real-time to improve the MT-produced output. These improved texts are also fed back into the engine to be reused in future projects.

How to fix it? 

Like most problems in business, a well-planned approach to the problem will result in avoiding the pitfalls described above. The following steps are elementary yet essential when tackling the use and translation of big data and exploiting that most valuable of company assets. 

  • Design a tightly controlled dataflow pipeline from the ingestion of data to the interrogation of it for business intelligence.
  • Map your data to meet your business needs.
  • Regularly test the integrity of your system and content.
  • Employ professional data engineers and data analysts. This is not a part-time post for an intern.
  • Feed reliable and tested data into a process if you want to ensure quality data coming out.
  • Employ the best business intelligence tools that will allow you to perfect your data sources.
  • Ensure that your translation company supplies data cleansing and training services.
  • Ensure that newly translated data — an asset — is directed back into the NMT engine you use.

However, if you have numerical data, scientific formulae, and product specifications such as measurements with a lot of part numbers there is a high diversity of individual tokens as opposed to simple words. This type of data takes a little bit longer to normalize because you must train the engine, which you can do using something like GENTRY programming language and named entity recognition (NER) software. Your NMT provider must take time to follow this process because, if the engine doesn’t recognize the data, it can’t manage it during the translation phase. Some MT companies will train your  data for free. There are three areas where an NMT company can help in improving data by making it suitable, relevant, and of proper quality. The better MT companies have ready-to-go tools for dealing with these issues and are happy to offer the service.

Big data is a fact, and it is an asset that can often only be exploited using NMT services. In the past, the idea of translating the volume of data traffic now flowing would have been laughed at. With the constantly improving MT services now available at the click of a button, that data can be unlocked to provide the owner with an invaluable asset. However, companies would be better using no data at all than relying on bad data or dirty data to run a business. Fortunately, the tools and skills are there in MT companies to ensure that data used in translation projects can meet the quality and suitability grade. When treated with respect and managed effectively, data can open a whole new world of markets. And if, as a company, you don’t believe you will need to embrace the science of big data and the power of MT, the following might change your mind: 

“Every company has big data in its future, and every company will eventually be in the data business.” – Thomas H. Davenport, cofounder, International Institute for Analytics.