ITS 2.0: Next generation multilingual content production

By Christian Lieske, Arle Lommel & Felix Sasaki March 3, 2014

Market analyses show that commercial translation currently represents an annual market of $21–$26 billion. However, the current lack of interoperability, automation and standards in multilingual content production is costly. To quote Kimmo Rossi of the European Commission, “Especially in Europe, commercial translation represents a huge annual market, which enables hundreds of billions of euro of cross-border business, including new opportunities on the growing digital single market. The lack of standards for exchanging information about translations is estimated to cost the industry as much as 20% more in translation costs, amounting to billions of euro.”

The W3C Internationalization Tag Set 2.0 (ITS 2.0), a recently published standard developed by the W3C MultilingualWeb-LT Working Group, enhances the foundation for more efficient multilingual content production. Through the LT-WEB project, the European Commission supported the development of ITS 2.0. ITS 2.0 will increase translation efficiency by providing mechanisms to integrate automated processing of human language into various process phases more easily. At the heart of the mechanisms are standardized meta items, so-called “data categories” related to domain information, for example. See a short explanation of the data categories at www.w3.org/TR/its20/#basic-concepts-datacategories.

Although ITS 2.0 targets core web technologies such as HTML5, its reach and applicability are wider. It can also easily be used in plain database scenarios, or with technologies such as ePUB (for electronic books) that draw on HTML5.

ITS 2.0 bears many commonalities with its predecessor ITS 1.0, but provides additional concepts that are designed to foster the automated creation and processing of multilingual web content. ITS 2.0 focuses on HTML and on XML-based formats in general, and can leverage processing based on the XML Localization Interchange File Format (XLIFF), as well as the Natural Language Processing Interchange Format (NIF).

Even though ITS 2.0 is a young standard, various implementations and end-to-end scenarios have already been realized. They show that ITS 2.0 can reduce production time and enable cost savings in management and translation.

Holistic view on multilingual content production

It is often helpful to break down multilingual content production into two dimensions: a static one that is related to source content, and a dynamic one that is related to multilingual production. The static dimension, for example, relates to content authors who may have to mark part of their content, such as a trademarked term, as “do not translate.” The dynamic dimension connects to the area of machine translation (MT) that may need to leave content with specific features untranslated.

Although ITS 1.0 made no assumptions about possible phases in a multilingual production process chain, it was slanted toward a simple three-phase write, internationalize, translate model — a model that clearly is inadequate in scenarios where there is no time for an internationalization phase. Thus, ITS 2.0 explicitly targets a much more comprehensive model for multilingual content production — one that more thoroughly addresses both static and dynamic necessities of today’s multilingual content production. The ITS 2.0 model comprises support for multilingual content production phases such as internationalization; preproduction such as terminology marking; automated content enrichment for something like automatic hyperlinking; segmentation; leveraging existing translation-related assets such as translation memories (TMs); MT; quality assessment or control of source language or target language content; and so on.

A real-world example follows the extraction and translation of HTML5 content using the open-source, cross-platform Okapi Framework. HTML5 is the latest incarnation of HTML, the internet’s dominant format for browser-based textual content. Multilingual content production for or with HTML5 involves various challenges and opportunities. ITS 2.0-related implementations such as the Okapi Framework address these challenges.

The sample scenario starts with an HTML5 document. This document travels through various phases, including pre-translation with TMs and MT, and ends with an XLIFF document. The Okapi Framework uses various ITS 2.0 data categories in the overall process. The complete scenario encompasses the following phases:

Extraction of translatable text from the HTML5 document

Segmentation of extracted text into units (easy to process in the subsequent steps and for the human translator)

Leveraging using TMs

MT using an online system

Quality checking MT output

Generation of an XLIFF file for translation

As indicated, the ITS 2.0 data categories add value to the Okapi-based processing by increasing efficiency and enabling significant cost savings. In particular, differing data categories are used. The Translate data category expresses information about whether a selected piece of content is intended for translation or not. This is relevant for leveraging, MT and the XLIFF generation step — parts that are not translation-relevant can be treated in special ways. This information can be added automatically or manually during the content creation process.

The Elements within Text data category expresses how the content of an element is related to the text flow. Does it constitute its own segment, such as a paragraph? Is it part of another segment, such as an emphasis marker? This information is relevant for extraction. Elements are either treated as separate flows or as inline codes, and confusion about this point can lead to faulty segmentation.

The Domain data category identifies the topic or subject field of a text. This is relevant for translation-related processing such as machine translating: domain information can, for example, be used to choose an appropriate engine for the text.

The Storage Size data category specifies the maximum storage size of a given piece of content. This is relevant for quality checking to detect whether the original content or the machine translated content exceeds the maximum size. This information is particularly important when data may be stored in a database such as a content management system (CMS).

The Localization Quality Issue data category describes the nature and severity of an error detected during a language-oriented quality assurance process. This information is generated during quality checking and can be stored as part of the XLIFF file to ensure that problems in the translated text are addressed.

Important ITS 2.0 features

Two design principles/intrinsic features of ITS 2.0 are noteworthy. First, the data categories are independent of each other; ITS thus has a specific notion of “modularization.” Second, they can be used in multiple processing steps. Accordingly, once a tool set like Okapi has added comprehensive ITS 2.0, automated efficient and cost-effective processing of human language can be done in a great variety of scenarios.

The modularity is important for two implementation-related reasons. The barrier to entry for using ITS 2.0 is kept low since support for a single data category opens the door for reaping ITS 2.0 benefits. Also, starting to use the standard can be simple and does not require major changes to tool chains — although it may enable major changes. Since ITS 2.0 addresses the entire content life cycle, not all data categories are needed or make sense for any given stage or task.

A number of tools have already implemented ITS 2.0 data categories and thus automation is already possible, although it will improve as more tools start to use the standard. There are, of course, additional examples of how the standard data categories can add new functionality to workflows or simplify and standardize existing functionality. To begin with, the Localization Quality Rating can be used by an automatic translation quality assessment tool to provide an estimate of translation quality. A subsequent process might use this information to decide whether a text can be published as is or whether it needs to be sent on for additional editing and revision processes.

As another example, a named entity recognizer integrated into an authoring tool could automatically detect the use of specialized terms in text and flag them using the Terminology data category. Even links to online information about terms could be provided. Later on, a translation process could use this information to ensure terminological correctness. Additionally, in an XML format that contains both source and target texts, the Target Pointer data category can be used to provide instructions on where translated content should be added back to the original file, thus eliminating the need to develop a custom filter for the file format.

While ITS 2.0 can be used manually (for example, an HTML coder might add a “translate=no” attribute to an HTML element), the real strength of ITS emerges when it’s used with automated processes. In addition, ITS 2.0 can provide the “glue” that helps to couple processes and systems. While existing standards are good for moving translatable content and linguistic resources, they do not address metadata as thoroughly as ITS 2.0.

Typical ITS 2.0-based automation involves one system that generates ITS metadata and another one that consumes it. Often, the metadata generation/content enrichment starts at the authoring phase, so considerable work went into the design of ITS 2.0 to ensure that it is applicable and usable by CMSs. Processes involving MT are ideal use cases because their results can be dramatically improved by using appropriate linguistic resources such as MT engines, training corpora and terminology databases. When ITS is used properly, it allows MT systems to select these resources automatically and hence to deliver better results.

Benefits of ITS 2.0

ITS 2.0 has tremendous potential to improve the status quo of multilingual content production by taking a holistic view of the various production steps and components involved. Via persistent, standardized metadata, ITS 2.0 reduces costs, improves quality and speeds up production, thus addressing all three sides of the classic cost-quality-speed triangle by reducing barriers and “friction” that impact all of these areas.

Until recently, the information that can now be conveyed in a standard format through ITS 2.0 was either provided manually (such as in a set of written instructions about a project) or generated and attached using proprietary methods. Neither of these methods scales well in an enterprise context, and they thus represent suboptimal solutions.

A project may go through hundreds of steps and hands before completion. Thus, there are many opportunities for manual instructions to be lost, misinterpreted or ignored, especially when the people creating these instructions are not in direct contact with technical staff or individual translators. Even when the instructions are successfully transmitted and understood, doing so takes time and effort. A note such as “Make sure that the phrase ‘Acme Holographic Presenter’ remains in English” helps ensure the proper outcome, but requires someone to verify that the instruction has been followed and to inform all parties of this instruction.

Manual instructions also cannot address the needs of automated processes where no human may be available to interpret instructions. If instructions require manual intervention, they create bottlenecks in the process or points of failure.

Using proprietary digital formats can help to automate processing, but possibly requires adaptation of all tools in the process around the relevant markup. This may be a significant implementation task, especially when multiple functionally equivalent but technically distinct markup formats need to be supported. In addition, multiple formats make it difficult to develop consistent solutions since, even when they accomplish essentially the same thing, there may be minor differences that prevent them from being truly interoperable.

ITS 2.0 addresses these requirements, and by using it, processes can be easily automated around a known and shared format. Furthermore, if such a format is designed to work with core XML and internet technologies, as ITS 2.0 is, it provides an easy path to adoption. So as ITS 2.0 is implemented and adopted, it holds the potential to simplify the multilingual content production process greatly. As content creators implement ITS 2.0, it will be easier to ensure that their intent is clear and that their requirements are met without the need for manual assurance.

Quality will improve with ITS 2.0, both directly (through quality-related data categories) and indirectly, through improved data handling (especially in automated situations) and the ability to specify intent from the earliest stages of content creation. The Localization Note data category allows information to be added that ensures that special requirements or information needed for translation remains with the text it pertains to so that translators will take it into account.

ITS 2.0 also facilitates data enrichment processes that automatically add information to content. This is called named entity recognition, and identifies items such as names, addresses and dates that can be linked to more information and which should be translated in particular ways.

The information contained in ITS markup thus helps ensure that translated text meets requirements. In this way, ITS 2.0 takes translation away from an ad hoc, unstructured process toward a highly structured and verifiable process.

All of these features make ITS 2.0 a potent technical solution to a major business problem. Compared to a previous nonautomated solution, a case study with the Spanish tax authority showed that using ITS to facilitate automation delivered a significant reduction in time to deliver content (up to 60%) through increased process efficiency, and cost savings in management and translation (between 15% and 40%). It also eliminated localization platform and format dependency, since ITS 2.0 solutions can be easily moved to whatever technology provides the best results.

While the 60% time reduction and 15-40% cost reductions will not be typical for most cases (given the start from a relatively unautomated baseline), case studies within the MultilingualWeb – Language Technology working group at the W3C have consistently shown that ITS 2.0 implementation delivers significant business advantages to users, in addition to improving quality and lower technical barriers.

Relationship between ITS and other formats and tools

Real-world multilingual content production clearly comes with the requirement to accommodate existing realities. ITS 2.0 addresses this with a focus on the following:

HTML, HTML5: Whereas ITS 1.0 was more or less restricted to XML, ITS 2.0 can be applied to HTML in general, and HTML5 in particular.

XLIFF: XLIFF is related to several important ITS 2.0 usage scenarios. Although ITS 2.0 has no normative dependency on XLIFF, a nonnormative definition of how to represent ITS 2.0 data categories in XLIFF 1.2 or XLIFF 2.0 is being defined within the Internationalization Tag Set Interest Group.

TBX: TBX, an ISO standard for the exchange of terminological data, can easily be combined with ITS 2.0 in order to facilitate TBX processing.

NIF: NIF is a Resource Description Framework (RDF)/Web Ontology Language-based format that aims at interoperability between natural language processing tools, language resources and annotations. ITS 2.0 provides a non-normative algorithm to convert XML or HTML documents that contain ITS metadata to the RDF format based on NIF.

Provenance: Provenance helps to record the identity of agents that have been involved in the translation of the content or the revision of the translated content. ITS 2.0 does not define the format of provenance information, but recommends that an open provenance or change-logging format be used.

ITS 2.0 is already implemented, not only in commercial systems, but also in open-source frameworks such as the Okapi Framework and in a module for the Drupal CMS. ITS 2.0 is also supported in the free BlueGriffon HTML editor, allowing HTML authors/web developers to work with ITS 2.0 in a standards-based HTML5 environment. Inclusion in such free or open-source projects lowers the barriers to using ITS 2.0 and thus creates a viral effect for the deployment of the standard. As usage increases, cost and time savings will scale as well since more tools will become “plug and play” around an ITS backbone.

ITS 2.0 has been developed with contributions from a wide range of stakeholders, including content creators, localization experts, tool developers and language technology researchers.

With ITS 2.0, a new foundation for automated processing has been created in a short timeframe. The implementation-driven approach of the W3C allowed ITS 2.0 to be finalized with production-ready implementations available on the commercial side and open source side. The first language service providers are already providing new and enhanced services based on ITS to their clients.

The ITS Interest Group serves as an open forum to discuss ITS 2.0. Its wiki at www.w3.org/International/its/wiki/ also provides additional information on implementations/support and usage scenarios such as the Okapi-based workflow.

Additional work based on the outcome of ITS 2.0 has already started. One example is the European Union-funded Multidimensional Quality Metrics (MQM), part of the QTLaunchPad Project (7th Framework, contract 296347). Designed to be compatible with ITS 2.0, MQM builds on ITS 2.0 to provide a powerful, flexible approach for the assessment of translation quality. In addition to MQM, there is active discussion in the W3C’s ITS interest group about additional data categories that will support next-generation automation and emerging technologies. Furthermore, efforts are underway to build even more bridges between the localization and semantic web communities.

As more and more implementations of ITS 2.0 surface, deployment will become easier and more companies will take advantage of ITS 2.0. Workflows will become increasingly automated using ITS 2.0, and content creators, localizers and other participants in the international content production chain will work together more efficiently, lower costs, improve quality and experience fewer process problems.