Language technology drives quality translation

By Aljoscha Burchardt, Arle Lommel, Georg Rehm, Felix Sasaki, Josef van Genabith & Hans Uszkoreit April 9, 2014

Language technology is becoming increasingly important as organizations try to deal with the explosion of digital content and increasing demands for localized versions of this content. Fifteen years ago organizations that published content in more than ten languages were considered to be unusual, and those that dealt with more than 50 could probably have been counted on one hand. Today, however, it is not uncommon for organizations to produce content in dozens of languages, and increasing numbers are now dealing with in excess of 200 languages to one extent or another.

This large-scale change has driven interest in language technology because the human-oriented approach that worked well when FIGS (French, Italian, German and Spanish) was considered sufficient for international business have difficulty scaling to deal with 50 or 100 languages. Additional factors driving the shift include the rising importance of user-generated content and social media; the need for multilingual business intelligence; and the exponential increase in the volume of content that has been enabled through digital technologies.

The translation and localization industry has long used technology in the form of translation memory (TM) and terminology management systems, but for a variety of reasons it has not embraced other forms as readily. Most language technologies today have been deployed as monolingual applications without the multilingual support required by translators.

Machine translation (MT) is currently the best-known example to the public at large, driven largely by the success of free services pioneered by AltaVista’s Babelfish and then made truly mainstream by Google Translate. The translation and localization community’s acceptance of MT for production purposes has been considerably more reluctant and cautious, but even here it is making significant inroads. This increasing acceptance is leading to more interest in other types of language tech, such as grammar checking, personal assistants (such as Siri or Google Now) or opinion mining.

Considering just MT for the moment, it is no secret that it has not always lived up to the claims of proponents, some of whom have been predicting for at least the last 50 years that near-human translation quality is always just ten to fifteen years away. However, in the last decade, more and more organizations have embraced MT as a pragmatic way to help meet their translation requirements, often in combination with human translation and post-editing.

MT is currently at a crux: existing technologies have delivered great rewards, but their rate of progress has slowed as they have matured. This is not to say that technologies such as statistical machine translation (SMT) have played out, but rather that the “easy” gains in productivity and quality have been made and further improvements will require increasing amounts of effort. Many of the advances in recent years have come from combining various approaches and tools to take advantage of the strengths each has to offer. Examples of such combinations include hybrid MT systems (that combine the deep linguistic knowledge of rule-based systems with statistical systems’ ability to learn from existing translations) and systems that integrate TM, terminology management and MT into translation cockpits run by human translators.

Where MT has continued to have trouble, however, is in matching the quality expectations set by human translators. Although many human translators use MT output as a reference in their translations, MT is alternately treated as a source of jokes and as an existential threat by many translators. At a recent international translation conference, developers of MT in particular were publicly compared to the “makers of the atomic bomb” for trying to take away translators’ livelihoods. But setting such hyperbolic comments aside, it is clear that large amounts of content presently go untranslated because it is not cost- or time-effective to use human translators, especially when it is impossible to predict in advance what content will be needed when and by whom.

Other language technology applications have the potential to contribute significantly to the quality and success of MT, but so far many of them have been implemented only in standalone applications, often developed and maintained as research tools. This lack of interoperability has proved to be a barrier to integrating these tools, as production users seldom have the expertise needed to take often poorly documented open-source projects and combine them. However, the recent development of the Internationalization Tag Set (ITS) 2.0 specification provides one way for tools to interact in a standards-based approach that does not require users to hard code workflows by adapting software to every use case.

These issues are especially critical in the European Union (EU), with 24 official languages, 38 recognized minority languages and substantial communities speaking other non-European immigrant languages. Because EU law requires that EU citizens be able to communicate with their government in the official language(s) of their countries and that they be able to access the law in those languages, the EU invests huge sums of money in translation.

As META-NET’s recent study “Europe’s Languages in the Digital Age” pointed out, despite this investment, at least 21 European languages face the real possibility of “digital extinction.” With little or no language technology development, speakers of many of these languages find themselves unable to communicate using their own languages and are forced to use foreign languages (usually English) or to stay silent. In addition, the discussion about issues of pan-European concern remains largely separated into language communities. As a result, there is lively debate among French speakers about the role of nuclear power in the post-Fukushima world, and German speakers have similar discussions, but they are not speaking with each other unless they engage with each other in English. But what are monolingual speakers of Basque or Maltese to do? If they are to take their place as equals in global society, they will require the assistance of integrated language technology that goes beyond just MT.

High-quality MT

In the last decades, translation quality has emerged as a major business issue for international businesses. Because so much of an organization’s public image depends on the quality of the text they produce, they all claim to want the highest quality. Unfortunately, “quality” itself has been an elastic concept that often amounts to subjective and highly variable impressions from individuals. A joint survey conducted by the EC-funded QTLaunchPad project and the Globalization and Localization Association (GALA) revealed that over 60% of language service provider (LSP) respondents either used an internal quality model or had no (formal) quality model at all. These results show that we, as an industry, still lack any systematic approach to the quality question.

Laying aside the problem of a universal quality definition for a moment, Europe is the region where the lack of fast, affordable quality translation hurts most. Although MT systems keep improving, as can be observed in the performance increase of popular, freely available online translation services, their output is generally unusable for almost all outbound translation demands, and often now even as a source for cost-effective post-editing.

The problem is that most current translation services follow a one-size-fits-all-approach. Commercial LSPs and large institutional users of MT instead need to be able to tune MT effectively and at low cost to their production requirements. They need automatic tools to recognize MT output in at least three categories:

MT output that can be used as is. Such output may not be perfect, but instead meets requirements and specifications.

MT output that can easily be fixed to meet specifications (such as post-editable content).

MT output that should be discarded. In many cases it is faster and more efficient to translate from scratch rather than to post-edit bad MT output.

When tools are able to identify these quality groupings for a variety of scenarios, rather than providing a generic score for a large batch of translations; delivering entire texts of unknown quality to readers; or expecting humans to post-edit all content, then the tools can work with humans to leverage their strengths. Such strengths include resolving the meaning of difficult passages; translating expressions not previously seen by statistical implementations; or translating texts where artistry or careful attention to nuance is needed.

By focusing on how humans interact with MT output, the QTLaunchPad consortium is advocating for a paradigm shift. Instead of simply adjusting existing MT systems to produce marginally better results, it calls for a novel, human-centric approach to MT. This approach systematically addresses the goal of producing quality translations and takes into account the needs and priorities of LSPs, translators and requesters of translations. To pave the way toward a human-centric high-quality MT paradigm, the QTLaunchPad consortium has cooperated with stakeholder organizations such as GALA, Fédération Internationale des Traducteurs (FIT) and LSPs to develop tools and technologies that support this vision:

The QuEst system provides fully automatic assessment of translation quality. Such assessment can be conceived of as a rough equivalent to the match rates that TM systems use to indicate how close a match is to what has already been translated and can guide translators in decisions about whether to accept, post-edit or reject sentences.

Improvements to the open-source translate5 tool for editing and reviewing translations.

An extension of the META-SHARE language technology exchange repository for MT research and development.

The Multidimensional Quality Metrics (MQM) system for assessing quality with accompanying tools (such as a translation quality score card and an implementation within translate5).

The last item addresses the long-running disconnect between various one-size-fits-all systems for assessing translation quality. Rather than imposing yet another list of errors that all translations must avoid, MQM works within a framework defined by the ISO/TS-11669 standard to link quality assessment to translation requirements defined at the earliest stages of the translation process. It builds on existing specifications, such as SAE J2450 and the LISA quality assurance model, to create a flexible model for defining quality metrics, which vary to meet specific needs (see Figure 1). By using a shared framework, users can compare their results and tune them to their needs rather than being forced to use a generic model that may or may not apply. It also addresses issues in both source and target texts to allow the causes of problems to be identified and fixed.

MQM was developed to address both human and MT and bring them both under one set of quality metrics. While MQM will not replace MT metrics such as BLEU and METEOR, it is currently being used to drive research in the quality barriers that impact MT to identify those factors that help differentiate quality translations from those that are not usable. For example, it has helped identify certain grammatical features that are of particular concern and that correlate most strongly to human assessments of quality. The focus of this research has been on high-quality and “almost good” translations. Here it is enabling research rather different from traditional MT research, which has tended to emphasize quality improvements at the low end of the quality spectrum. The QTLaunchPad project is currently in the process of releasing MQM to the GALA CRISP program, where it will be maintained and developed by the industry as a free and open specification.

A second project, QTLeap, is focusing on improving MT by providing better integration between SMT methods and deep linguistic knowledge. SMT systems seem to have reached a point where it is difficult to achieve further quality improvements in a purely data-driven way. Despite widespread recognition of the advantages that linguistic knowledge can add to statistical methods, there has been a relative deficit in principled research in this direction. This lack of research is partially due to the fact that SMT systems that focus on the textual surface with little linguistic knowledge have done comparably well to this point. But theoretically, systems that focus on structure and meaning should be able to deliver better results and be less sensitive to the particularities of individual languages.

In order to pave the way for higher-quality MT, the goal of QTLeap is to deliver an articulated methodology that explores deeper, more semantic language engineering approaches such as using the sophisticated formal grammars that have become available in recent years. This new approach is further supported by progress in lexical processing that has been made possible by enhanced techniques for referential and conceptual ambiguity resolution, and supported by new types of datasets recently developed as linked open data. In order to ensure that the MT developments include performance improvements in a realistic scenario, the project consortium includes a company in the process of making its PC help desk services multilingual. The goal is to have a monolingual helpdesk database and to machine translate user requests and answers from the database.

Taken together, QTLaunchpad and QTLeap are pointing toward a new future in which MT takes the best of current developments and combines them into a human-centric approach.

Content analytics

Although translation is the most visible language technology application — because we immediately realize it when we cannot access a web page or enjoy a YouTube video in a foreign language — it is not the only application that impacts our industry. MT can be used only for content that is already known to be relevant, but cannot directly assist us in cases where we do not know that certain content is relevant. We are used to browsing through the first few hits on Google and other search engines in order to find answers to questions, but we are seldom aware of or care about the content we don’t find because we do not see it, especially if it is “hidden” in multimedia content, database applications (the so-called “deep web”) that cannot be found through simple keyword searches, especially when multilingualism is a factor.

Thus we may miss the most relevant content simply because it is in another language that does not exist in a textual format. For example, someone in the Basque Country may be looking for particular information on nuclear energy policy in Europe but not find the information because it is in a German-language YouTube video. Because such content will not be found, it will also generally not be translated. Similarly, if we have technical problems, the answers may exist in user forums, but finding the right answers from hundreds of incorrect, outdated or simply irrelevant search results is already a significant problem, even before different languages are factored in.

The answers to some of these difficulties can be found in recent developments in content analytics, a set of technologies for making sense of data. This definition is as broad as the diverse set of application scenarios that content analytics applies: sentiment analysis, business intelligence, opinion mining, intelligent web search and many others.

There are some basic technologies that are common to all content analytics applications. Named entity recognition helps to identify unique concepts and allows for disambiguation (“Paris” the city vs. “Paris” the mythological figure). Relation extraction identifies relations between entities (the city “Paris” is located in the country “France”).

The actual implementation needed for content analytics technology is often quite language specific. It is no surprise that technology support for English is predominant. One current challenge in content analytics is to make such implementations available for a wide range of languages.

More and more multimedia content is being created on the web, which leads to another challenge: current content analysis technologies and implementations are largely limited to working with textual data. These tools will need to be enhanced to cover non-textual content such as audio. For example, consider the query, “find me all movies with kiss scenes at the end.” So far no video on demand portal has such information available, but content analytics can help create it. Since the quality of audio or video analysis is still rather limited, multimedia content analytics also rely on textual information available, such as subtitles and closed captions.

It now becomes clear that for improving multilingual and multimedia content analytics, one needs information about the content. More and more structured data sources are currently being created, without necessarily having content analytics in mind. A prominent example is Wikipedia’s infoboxes. These provide structured information that can serve as seed information to improve content analytics. Wikipedia is also useful since it provides links between languages. In this way, it can build the path to truly multilingual content analytics.

Other structured and partially also multilingual knowledge resources that are relevant to content analytics include Freebase, the Wikipedia-based Wikidata effort and BabelNet. In terms of linked data, a cloud of more and more data sets is being created, and a growing (although still rather small) portion of this linked data is multilingual.

The main challenge for realizing the opportunities for content analytics and linked data is that, so far, the relevant communities are not aware of each other. Language experts have multilingual data available in various forms, such as lexicons, term bases and TMs. Linked data specialists create structured data out of resources such as Wikipedia, resulting in DBpedia, the linked-data counterpart to the resources created by language specialists. Finally, providers or consumers of multilingual and multimedia content may have ideas about requirements for processing multimedia items, but are generally not aware of the possibilities that content analytics may give them.

In all these cases, different groups are facing the same issues in different ways. Language specialists don’t know how to convert data into multilingual linked data since established approaches to achieve this conversion do not exist yet. Linked data specialists, on the other hand, are generally unaware of the requirements imposed by multilingual data and often design systems that do not work properly with it. In any event, there are not yet many content analytics applications that make use of linked data in the way described here.

Linked data and multilingual/multimedia content analytics is the core topic of the LIDER project (http://lider-project.eu/). LIDER will build the path toward a linked-data cloud of linguistic information to support content analytics tasks in unstructured multilingual cross-media content. The LIDER consortium consists of key research groups in the realm of both language technologies and linked data. With input from various industries, LIDER is creating a roadmap around industry-focused content analytics use cases with a view toward defining needed research steps. To ensure that it gathers input from a range of communities, LIDER’s outreach and dissemination efforts are taking place via the World Wide Web Consortium (W3C) and its MultilingualWeb initiative. This initiative has proven highly successful in bringing together diverse groups of people interested in multilingual issues.

Developing language technologies

In an effort to combat Europe’s linguistic fragmentation and to support the goals of the European Commission toward a single digital market, the EU funded the development of META-NET, comprised of 60 research centers in 34 European countries dedicated to the technological foundations of a multilingual, inclusive and innovative European society. META-NET created the Multilingual Europe Technology Alliance (META), with more than 750 organizations.

META-NET worked to support monolingual, crosslingual and multilingual technology support for all European languages (Figure 3). The future paths laid out in its Strategic Research Agenda (SRA) for Multilingual Europe 2020 are connected to application scenarios that will provide European research and development with the ability to compete with other markets and subsequently achieve benefits for European society and citizens as well as opportunities for the European economy. Two themes focus upon core technologies and resources for Europe’s languages and a European service platform for language technologies.

The goal of many of these projects and currently planned actions is to turn META-NET’s joint vision into reality and enable large-scale opportunities for the whole continent.

An important aspect of META-NET’s suggestions centers around the idea of providing high-quality translingual technologies instead of focusing on tools for gist translation. Projects are already working actively on the topic by systematically identifying barriers for quality translation and pushing their boundaries. In addition, META-NET has worked to lower the barriers to access for current language technology applications and resources through META-SHARE, an online portal that provides access to these resources.

After years of development in disconnected projects, language technolgy is finally being adopted by users around the world to meet their requirements for access to content and to interact around the world. While there is still a long way to go, a variety of developments, many of them centered in Europe, are starting to break through the barriers. As new projects appear, the shift is toward a user-centric perspective and toward adoption and integration.