Insights into the future of XLIFF

By Christian Lieske July 5, 2011

It has been two years since the Organization for the Advancement of Structured Information Standards (OASIS) — after several years of development — has approved the XML Localization Interchange File Format (XLIFF) as a standard. The First International XLIFF Symposium at the end of September 2010 provided insights into the standard’s present and possible future.

The XLIFF Technical Committee, the steward of the standard, has realized that the symposium provided a unique opportunity to gain insights into the present and possible future of the standard. Accordingly, the committee has gone through all of the contributions and distilled input for three different questions related to XLIFF: What is the status quo related to implementation and adopting? Which ideas or recommendations exist for enhancements or future versions? Which general observations can be made?

Status quo

A major finding of contribution implementations that surveyed the existing XLIFF was that major tool vendors work with proprietary formats and extensions. Ultimately, the tools consume and produce different flavors of XLIFF, sometimes not even XLIFF version 1.2, which is the only version approved as an OASIS standard. The negative impact of this on interoperability is aggravated by the fact that vendors are often not transparent with regard to supported features, limitations and extensions. Several implementations cover only a subset of XLIFF, and some, believe it or not, do not even handle XLIFF correctly.

A sad observation related to implementation was that the number of XLIFF implementations is low. In particular, open- source tool support (Virtaal is one notable exception) is often poor or non-existing. This is notable since a number of the commercial tools available are costly, insufficiently tested and are only for experienced users. Contributions that analyzed the reasons for implementation issues referred to the broad coverage and vagueness of XLIFF. Among the observations was that representations are non-orthogonal and that the same thing can be represented in different ways. Also, processing requirements are not clearly defined. Tools don’t always know how to modify XLIFF, and extensibility features have been abused since a proprietary approach rather than a standard mechanism was used. Some members of the symposium’s audience surmised that the dominance of localization and internationalization stakeholders contributed to the shortcomings.

Despite the rather gloomy picture painted by those who surveyed or analyzed XLIFF implementations, most adoption and usage stories were on the positive side of things. Niall Murphy explained, for example, how mergers or acquisitions can be handled efficiently with the unification that can be built on top of XLIFF. In a similar vein, JoAnn Hackos, Bryan Schnabel and Rodolfo Raya showed how XLIFF can help to address certain issues related to the translation of DITA files. Steve Dept, Andrea Ferrari, Britta Upsing and Heiko Rölke reported about the successful use of XLIFF in a large scale OECD content translation project.

Future versions

The XLIFF Technical Committee has been gathering requirements and ideas for XLIFF enhancements (under the umbrella of XLIFF 2.0) for some time now. The symposium provided significant additional input for this work. The suggestions fall into three categories: simplification, clarification and extension. Explanations and examples are provided in the remainder of this article.

For any of the three categories of suggestions (simplify, clarify, extend) a few general recommendations were made. First, do not reinvent the wheel. If, for example, a suitable approach for general annotations already exists — for example, as a trade industry standard — then strongly consider using it. Second, acknowledge that new approaches are available for describing resources and providing metadata (for example, the Resource Description Format from the W3C). Also, try to stay backwards compatible, establish stricter rules for the use of extensions and define clear conformance rules. Take a look at all of the extensions that the toolmakers have enacted and at the existing XLIFF files; use this as a clue for what to put into the core for a future XLIFF version.

Simplification

Although the XLIFF specification is not a heavyweight, it is extensive and sometimes not easy reading. Accordingly, many contributors to the XLIFF Symposium recommended that future versions should make it dead easy to understand how to create or manipulate a minimalistic XLIFF file. This would facilitate the creation of tools that annotate or process XLIFF files. It would also assist in understanding how XLIFF can be used to support a specific process, such as one that combines machine translation with human post-editing. Specific ideas for approaching simplification included modularizing the XLIFF specification and schema in such a way that implementers and users can pick and choose what they really need to implement an XLIFF profile. As an example, annotations for human consumption aren’t needed everywhere. Accordingly, some implementers may choose to exclude the “human annotation” module from the profile they implement. Other ideas included defining a schema that defines a minimalistic XLIFF and providing a modular specification, and customized, role-specific user guides with explanations on how to minimize markup and tags.

Clarification

Quite a number of presenters at the symposium demanded clarifications. They explained that, for example, processing requirements, such as attribute values that tools have to change, currently are not always obvious. In a similar vein, the presenters mentioned that the provenance of XLIFF’s data categories and values has to be traceable: Where are the elements, attributes and values from and why have they been included in XLIFF? If that’s clear, implementation and use become easy since it’s easier to figure out what one’s own XLIFF profile really needs to include. In case different representations are possible, it should be clearly indicated what’s deprecated; ideally, the possibility to use multiple mechanisms for the same representation would be eliminated to arrive at so-called orthogonality and the corresponding canonical representations.

One specific suggestion to assist not just in clarification but also in the aforementioned simplification was to organize XLIFF in terms of processing phases and data categories.

Possible phases were extraction/filtering, constraint setting, internation-â€¨alization, automated linguistic proces-â€¨sing, human translation, localization, â€¨reviewing, inclusion of reviewing results, â€¨workflow events, tool-specific events, â€¨technical quality assurance checks (such as XML validation) and packaging. Possible data categories and data category clusters included payload (unsegmented or segmented content, translations), string length constraints (minimum or maximum lengths), resource type (different user interface controls such as labels and buttons), inlines (such as ph or x), identifiers (for processing), names (the identifier used in the native format — key in a Java property file), notes (explanations and other admonitions), internationalization, domain/subject area, relationships (for instance, between strings belonging to the same user interface menu) and creation (generator and creation date).

A side effect of looking at processing phases and data categories would be the simplicity of creating XLIFF modules, which could correspond to data category clusters, and XLIFF profiles, which could correspond to modules needed to support processing chains.

Extension

The content processing world, XLIFF’s domain, has changed since XLIFF’s birth. Thus, no one was surprised to learn that XLIFF should cover new requirements and to acknowledge, learn from or reuse other standards. To the surprise, however, of the XLIFF Technical Committee, the list of suggested extensions was quite long and comprised the following:

Create an object model — similar to, for example, the XML DOM — to allow programmatic access that is transparent to the underlying data format. In addition, create a container-based representation format (perhaps similar to the ODF or OOXML zip formats). Possibly, even provide an open-source library that allows for the creation and modification of XLIFF.

Include mechanisms to represent or refer to original data (such as binary data for accurate, secure representation of user interfaces). Furthermore, allow for the representation of or reference to validation rules (such as in Schematron). In addition, cover flexible annotations for human consumption of any XLIFF element or attribute, which might be an enhanced “note” element. In a similar vein, allow for the annotations for machine processing of any XLIFF element or attribute. For any annotation, allow for a version history. The annotations should not only be applicable to the core XLIFF payload (unsegmented content, segments, sub-segments, source and target) but also to referenced resources such as linked glossaries. Include a mechanism for metadata after acceptance of “alt-trans.”

Make it easy to manage changes in content in successive source versions, to track changes and to handle linguistic variants due to plurals, gender and the like. Enhance provisions for project information and quality management information (instructions and so on). Add a capability for declarations related to character encoding.

Mention best practices and conventions for including or referencing terminology data, which might be the W3C Internationalization Tag Set mechanism to tag terms or to link to. Allow for a “concept-based” approach to translation, such as with concept-based terminology allowing the source to contain a concept identifier; related to the xml:tm approach. Approach authoring from a semantic modeling point of view, rather than from a text writing point of view.

Consider XLIFF as format for translation memory exchange or even general content exchange.

The XLIFF Technical Committee intends to use the insights as guidance for ongoing work on XLIFF, possibly as soon as the Second XLIFF Symposium planned for autumn 2011.