xml:tm — a new approach to translating XML

By Andrzej Zydroń August 18, 2014

XML has become one of the defining technologies that is helping to reshape the face of both computing and publishing. It is helping to drive down costs and dramatically increase interoperability between diverse computer systems.

From a localization point of view XML offers many advantages: a well-defined, rigorous syntax backed up by a rich tool set that allows documents to be validated and proven; a well-defined character encoding system that includes support for Unicode; and the separation of form and content, which allows both multi-target publishing (PDF, Postscript, WAP, HTML, XHTML, online help) from one source.

Companies that have adopted XML-based publishing have seen significant cost savings compared with proprietary systems. The localization industry has also enthusiastically used XML as the basis of exchange standards such as the ETSI LIS (previously LISA OSCAR) standards: Translation Memory eXchange (TMX), TermBase Exchange (TBX) and Segmentation Rules eXchange (SRX), as well as Global Information Management Metrics eXchange – Volume (GMX-V). OASIS has also contributed in this field with XML Localization Interchange File Format (XLIFF) and Translation Web Services (TransWS). In addition, there is the W3C Internationalization Tag Set (ITS).

Another significant development affecting XML and localization has been the OASIS Darwin Information Technology Architecture (DITA) standard. DITA provides a comprehensive architecture for the authoring, production and delivery of technical documentation. DITA was originally developed within IBM and then donated to OASIS. The essence of DITA is the concept of topic-based publication, construction and development that allows for the modular reuse of specific sections. Each section is authored independently and then each publication is constructed from the section modules. This means that individual sections only need to be authored and translated once, and may be reused many times over in different publications.

A core concept of DITA is that of reuse at a given level of granularity. Actual publications are achieved through the means of a map that pulls together all of the required constituent components. DITA represents an intelligent approach to the process of publishing technical documentation. At the core of DITA is the concept of the topic. A topic is a unit of information that describes a single task, concept or reference item. DITA uses an object-orientated approach to the concept of topics encompassing the standard object-oriented characteristics of polymorphism, encapsulation and message passing.

The main features of DITA are:

Topic centric level of granularity

Substantial reuse of existing assets

Specialization at the topic and domain level

Meta data property based processing

Leveraging existing popular element names and attributes from XHTML

The basic message behind DITA is reuse: write once, translate once, reuse many times.

xml:tm

xml:tm is a radical approach to the problem of translating XML documents. In essence, it takes the DITA concept of reuse and implements it at the sentence level. It does this by leveraging the power of XML to embed additional information within the XML document itself. xml:tm has additional benefits which emanate from its use. The main way it does this is through the use of the XML namespace syntax. Originally developed as a standard under the auspices of LISA OSCAR, xml:tm is now an ETSI LIS standard. The xml:tm standard is a perfect companion to DITA — the two fit together hand in glove in terms of interoperability and localization.

xml:tm was designed from the outset to integrate closely with and leverage the potential of other relevant XML based localization industry standards.

As previously mentioned, xml:tm is a perfect match for DITA, taking the DITA reuse principle down to sentence level.

xml:tm mandates the use of SRX for text segmentation of paragraphs into text units.

xml:tm mandates the use of Unicode Standard Annex #29 for tokenization of text into words.

xml:tm mandates the use of XLIFF for the actual translation process. xml:tm is designed to facilitate the automated creation of XLIFF files from xml:tm enabled documents, and after translation to easily create the target versions of the documents.

xml:tm mandates the use of GMX-V for all metrics concerning authoring and translation.

xml:tm facilitates the easy creation of TMX documents, aligned at the sentence level.

The Open Architecture for XML Authoring and Localization (OAXAL) is an OASIS service-oriented architecture technical committee standard that shows how xml:tm integrates with other localization industry standards to form one elegant system.

At the core of xml:tm is the concept of text memory. Text memory comprises two components: author memory and translation memory.

XML namespace is used to map a text memory view onto a document. This process is called segmentation. The text memory view works at the sentence level of granularity — the text unit. Each individual xml:tm text unit is allocated a unique identifier. This unique identifier is immutable for the life of the document. As a document goes through its life cycle, the unique identifiers are maintained and new ones are allocated as required. This aspect of text memory is called author memory. It can be used to build author memory systems which can be used to simplify and improve the consistency of authoring.

Figure 1 shows a simplified example of how xml:tm is implemented in an XML document. The xml:tm elements are highlighted in blue to show how xml:tm maps onto an existing XML document. The original text reads as follows:

xml:tm is a revolutionary technology for dealing with the problems of translation memory for XML documents by using XML techniques to embed memory directly into the XML documents themselves. It makes extensive use of XML namespace.

The “tm” stands for “text memory.” There are two aspects to text memory:

1. Author memory

2. Text memory

When an xml:tm namespace document is ready for translation, the namespace itself specifies the text that is to be translated. The tm namespace can be used to create an XLIFF document for translation.

XLIFF is another XML format that is optimized for translation. Using XLIFF, you can protect the original document syntax from accidental corruption during the translation process. In addition, you can supply other relevant information to the translator such as translation memory and preferred terminology.

Figure 2 is an example of part of an XLIFF document based on the previous example of Figure 1. The magenta-colored text signifies where the translated text will replace the source language text, as shown in Figure 3.

When the translation has been completed, the target language text can be merged with the original document to create a new target language version of that document. The net result is a perfectly aligned source and target language document. Figure 4 is the translated xml:tm document in Spanish. The translated text will be as follows:

xml:tm es un técnica revolucionaria que trata los problemas de memoria de traducción en documentos XML usando técnicas XML e incluyendo la memoria en el documento mismo.

“tm” significa “memoria de texto.” Hay dos aspectos de memoria de texto:

1. Memoria de autor

2. Memoria de traducción

The source and target text is linked at the sentence level by the unique xml:tm identifiers. When the document is revised new identifiers are allocated to the modified or new text units. When extracting text for translation of the updated source document the text units that have not changed can be automatically replaced with the target language text. The resultant XLIFF file will look like Figure 5.

Matching

The matching described in the previous section is called “exact” matching. Because xml:tm memories are embedded within an XML document, they have all the contextual information that is required to precisely identify text units that have not changed from the previous revision of the document. Unlike leveraged matches, perfect matches do not require translator intervention, thus reducing translation costs.

xml:tm provides much more focused types of matching than traditional translation memory systems. The first is exact matching, where author memory provides exact details of any changes to a document. Where text units have not been changed for a previously translated document we can say that we have an exact match. The concept of exact matching is an important one. With traditional translation memory systems a translator still has to proof each match, as there is no way to ascertain the appropriateness of the match. Proofing has to be paid for — and is typically about 60% of the standard translation cost. With exact matching there is no need to proofread, thereby saving on the cost of translation.

xml:tm can also be used to find in-document leveraged matches, which will be more appropriate to a given document than normal translation memory leveraged matches. Additionally, when an xml:tm document is translated, the translation process provides perfectly aligned source and target language text units. These can be used to create traditional translation memories, but in a consistent and automatic fashion.

For in-document fuzzy matching, during the maintenance of author memory a note can be made of text units that have only changed slightly. If a corresponding translation exists for the previous version of the source text unit, then the previous source and target versions can be offered to the translator as a type of close fuzzy match.

The text units contained in the leveraged memory database can also be used to provide fuzzy matches of similar previously translated text. In practice fuzzy matching is of little use to translators except for instances where the text units are fairly long and the differences between the original and current sentence are very small.

Additionally, in technical documents you can often find a large number of text units that are made up solely of numeric, alphanumeric, punctuation or measurement items. With xml:tm these can be identified during authoring and flagged as nontranslatable, thus reducing the word counts. For numeric and measurement only text units it is also possible to automatically convert the decimal and thousands designators as required by the target language.

Process automation

You can use xml:tm to create an integrated and totally automated translation environment. The presence of xml:tm allows for the automation of what would otherwise be labor-intensive processes. The previously translated target version of the document serves as the basis for the exact matching of unchanged text. In addition, xml:tm allows for the identification of text that does not require translation (such as text units comprising solely punctuation, or numeric or alphanumeric only text) as well as providing for in-document leveraged and fuzzy matching.

In essence, xml:tm has already pre-prepared a document for translation and provided all of the facilities to produce much more focused matching. After exhausting all of the in-document matching possibilities any unmatched xml:tm text units can be searched for in the traditional leveraged and fuzzy search manner.

The presence of xml:tm can be used to totally automate the extraction and matching process. This means that the customer is in control of all of the translation memory matching and word count processes, all based on open standards. This not only substantially reduces the cost of preparing the document for translation, but is also much more efficient and cost effective. A traditional translation scenario can be seen in Figure 6, whereas in the xml:tm translation scenario, all processing takes place within the customer’s environment, as seen in Figure 7.

Because xml:tm mandates the use of XLIFF as the exchange format for translation, the XLIFF format can be used to create dynamic web pages for translation. A translator can access these pages via a browser and undertake the whole of the translation process over the internet. This has many potential benefits. The problems of running filters and the delays inherent in sending data out for translation such as inadvertent corruption of character encoding, document syntax or simple human work flow problems can be totally avoided. Using XML technology it is now possible to both reduce and control the cost of translation as well as reduce the time it takes for translation and improve reliability.