Localizing travel-based user generated content

By Andrew Rufener April 13, 2017

That online channels are absolutely key to travel and tourism isn’t news anymore, it’s a simple fact. Equally, the fact that social media and user generated content such as hotel reviews and user comments play a key role in helping and guiding fellow travelers isn’t new either, and neither is the fact that these reviews influence buying decisions. However, in a 24/7 global economy, the online community increasingly expects this content to be available in their language and that is a challenge the localization industry has been struggling with for a while. We know how to get good-quality human translations done, but how do you deal with high volumes (at times, billions of words a day) of user generated content that needs to be translated into many different languages in near real-time at very low cost — something that can only be done with machines?

Taking a look at the use cases

The use cases in online travel and tourism are as broad as the business itself, but there are some common ones.

Localization of traveler reviews. No serious online travel site can do without social media integration and traveler reviews. Many sites also localize reviews from high volume markets to smaller markets by localizing reviews; to give an example, an English review might be translated to Thai to bolster the Thai website and increase the attractiveness to local customers.

Localization of special offerings and “low value” content. Online travel sites carefully distinguish offerings that have a high value attached to them, such as the descriptions of top hotels and destinations. Even more, a lot of analytics are performed behind the scenes to ensure that traffic is “sticky” and revenue is maximized. Yet at the same time, every site has “lower value” content that drives less revenue and does not necessarily warrant high-quality human localization, but has value nevertheless. Another content type is short-term special offers that can drive significant revenue yet can’t always be localized by humans in time for all markets.

Customer support and chatbot localization. There are multilingual customer support chat solutions supported by human agents that only speak a single language but are supposed to support a multilingual clientele. Also, increasingly, chatbot solutions require real-time high-quality translations to support the use case, often including the requirement to be able to detect the customer language before localizing.

Big data analytics. Last but not least, big data analytics ingest increasing volumes of data but only a few solutions are capable of dealing with textual data in addition to numeric data, let alone supporting many different languages or having entity processing capabilities. Progressively, these applications require the combination of language processing, machine translation and machine learning to be able to provide the desired insights.

With very few exceptions, most of these use cases cannot be supported by human translations in a timely or cost-effective manner, but they can be addressed with sophisticated machine-based solutions that augment the human localization capabilities and very much depend on them for quality control.

The challenges

The challenges associated with processing this kind of data are plentiful, but some key challenges can be grouped together.

The first challenge with any machine-based localization solution is managing expectations. Just because a machine is applied to perform the localization, quality will not magically be perfect nor can the machine compete with humans on quality. As we’ve seen, most machine-based use cases are simply not feasible using humans processing the data, but do require humans training and maintaining the systems to ensure, maintain and improve quality.

Quality is an obvious challenge. Every platform desires to have the best quality content at the lowest possible cost. What is often forgotten with these kinds of use cases, however, is that even for a human translator to provide good translation, he or she needs domain expertise, language pair expertise and guidelines such as style guidelines, glossaries and so on. With travel, very often geographic information plays an important role as well, hence having access to the relevant information sources on how to translate or transliterate places and street names, for example, is equally challenging, even for a human. The machine faces the same challenge. Therefore, unless the machine is custom trained with the right domain and auxiliary data, the resulting translations will be of poor quality.

In the case of user generated content an additional quality dimension is added: source content quality. The source quality is unknown, for example, an English language review provided by a user with limited English language skills results in a poor source and thus increases the localization challenge.

Another common challenge, especially for content-rich sites, is the sheer volume of data to be processed; millions of words a day, sometimes billions of words a day when backlogs are processed, is not uncommon. This demands systems and workflows capable of handling such very high volumes but also scaling up and down on demand in order to control infrastructure cost.

The actual translation is only one step in the localization process. Localization processes for travel and related content often require complex pre-processing and post-processing to perform required conversions, deal with titles that are nothing more than a list of keywords, identify and transliterate locations on the fly, recognize and convert currencies, measurements, apply required styles, glossaries and finally return the content to the calling application in good order.

Machine-based solutions

As with most business process support systems and IT solutions, the analysis of the use case is critical to developing a good understanding of needs and possible approaches. This required understanding must include data processing workflow; system integration requirements; specifications and understanding of the type and format of content being processed; the required output quality and format; as well as all financial parameters related to the project. In this context, it is difficult to realize these use cases pose both a localization and an IT challenge and need to be treated as such. For example, the needs of a customer support chat application aiming to provide customers with high-quality multilingual support is significantly different from the need to localize customer reviews wherein only a portion of the content will need to be placed on the target site. The first use case requires very high-quality real-time translations to ensure the customer is provided with good-quality support, and in the latter case content that is detected as being poor-quality source material may be discarded, assuming that the system providing the translation can actually score the data quality in real-time. The following steps form a high-level guide on how to proceed with such projects.

Understand the use case and its dimensions and build the right team. Every use case is unique, and while similar building blocks can be reused, understanding the use case end-to-end is crucial to success. Understanding system integration, workflow, localization and conversion aspects, content sources, quality requirements, volumes and cost parameters are just some of the details required and will also help everyone understand the skills the project requires in order to succeed. When bringing in external partners, define their roles and integration points clearly and ensure they have the required subject matter expertise to perform their task and interact with your team effectively.

Design the architecture: workflow and the machine translation engine and system requirements. Don’t cut corners in the early stages of the projects; having well-defined requirements and a solid architecture that is fit for purpose is a fundamental prerequirement for success. As part of the architecture and design also include all the analysis related to data; for example, what metadata, source and target data will be provided or is needed to adequately process the content? Also consider what style guides and glossaries exist and need to be used. Do entities need to be recognized and processed in a specific manner or does the system need to score the content quality?

Develop a custom engine and workflow and ensure the required mode and capacity are available. Once the requirements are understood and the architecture and data sources have been defined, the custom workflow and engines can be implemented. And yes, to achieve quality output, professional customization is required! This means that the workflow, the non-translation capabilities and the engines themselves need to be customized.

Test. Test, improve and test again. With complex content, such as travel related text and specifically user generated content, testing is required to ensure that all data formats are processed correctly before going into production.

Monitor and create a feedback loop. Once released, the work is not done. Source content changes over time. Terms change, locations change, patterns change and continuous quality assurance is required to monitor quality. Ensure that there is a proper feedback loop. Often translation quality will improve quickly from an initial engine and workflow release as issues are found when volume production data is processed. The issues found will be addressed and the engines will be retrained. While efforts are largely front-loaded, over time the required efforts will be reduced as the system matures.

An increasing number of use cases require and are enabled only by machine-based solutions and are very feasible. But like all complex IT projects, solid analysis and planning are key to success.