Is XLIFF positioned correctly?

By Jaime Mateos March 7, 2011

XML Localization Interchange File Format (XLIFF) was first standardized by OASIS in 2002. Since its inception there have been debates in the localization community on how to position XLIFF to maximize the benefits of having a standard file format. This is not a trivial question insofar as the answer influences the goals laid out for the standard.

One option is to position XLIFF as a bridge between development groups and the localization function — internal or external. In a software localization workflow (Figure 1) the input files are normally supplied by development teams through some type of content management system (CMS), usually a version control system. These files can be in any kind of file format, but are generally localization-friendly formats such as .properties, .po or .dll. The initial step in the workflow involves extracting the localization-relevant information, such as translatable content and sizing information. This requires specialized parsers for each of the input file formats. All software localization tools come with a number of parsers for common file formats, and most support a plug-in architecture to incorporate custom-built parsers catering for proprietary formats. Once the localizable content has been parsed, the next steps in the localization workflow can happen. At a very high level this normally includes a pre-translation and a translation phase, plus a test/fix cycle where localization defects are detected and corrected. Finally, the translated versions of the input files are generated and added to the development group’s CMS. Target file generation, such as parsing, requires software components to handle each supported file format.

In this context, XLIFF can be used as a funnel between the different file formats used by development groups and the software localization tools (Figure 2). This option introduces additional overhead, as two extra transformations are now needed: between the input files and XLIFF and, at the end of the workflow, between XLIFF and the localized versions of the input files. But this also introduces a number of advantages. One of them is to ensure a basic level of internationalization support. Producing an XLIFF file ensures that the localizable content has been identified and separated from the developer’s source code. It also provides localization with a consistent and standard character encoding such as Unicode. XLIFF uses UTF-8 or UTF-16. Any encoding conversions are handled by the transformation to or from XLIFF.

This basic level of internationalization support is already achieved by localization-friendly formats such as .properties or .po files, but it is not a given for proprietary or application-specific file formats, such as the ones used by some installer technologies.

Another advantage is that having a single consistent input to the localization workflow can open opportunities for increased automation. This advantage is more relevant for companies with a diverse set of development groups using different programming frameworks. This is normally the case for large independent software vendors (ISVs) where there may be over a hundred file formats that require localization.

Using XLIFF as a bridge between development and localization also brings some opportunities. Specifically, it allows localization departments to push the burden of the extra transformations onto the development groups. This is how some large ISVs are using XLIFF. One benefit of this approach, beyond the obvious one to the localization departments of offloading some tasks to development, is to make clear that development is responsible for internationalization and as such it needs to ensure that the product is localizable in an efficient manner. In exchange, it gives development groups greater flexibility when choosing the development framework or tools. They are no longer limited to the list of “supported by localization” file formats. So long as development groups can supply and accept XLIFF files, they are free to pick any emerging technology or nonstandard file formats.

There are some drawbacks to using XLIFF positioned in this way. No matter who owns the transformations to or from XLIFF, these transformations add extra overhead to the workflow. This translates into extra complexity as well as added processing, and the organization as a whole, regardless of which specific function or team, will have to contend with it.

Parsers for most of the file formats used in localization are readily available and have been for a long time. Many of these file formats have been designed for localization and, even though they may lack the sophistication of XLIFF, have widespread and mature parser support. In fact, XLIFF tries to solve many of the same problems that were tackled by .po, .resx or .properties files. That’s why, ideally, if XLIFF is positioned between development and localization, it should be a substitute for those file formats and not an intermediary. But the localization file formats used for Java (.properties), C# (.resx) are tightly tied to their programming language specifications, and it is very unlikely they’ll change. The only attempt of doing so that I’m aware of involves a project that aimed to substitute .po files with XLIFF, but so far it hasn’t gained significant traction.

Between localization tools

A different way to position XLIFF is as a bridge between software localization tools (Figure 3). In this way an XLIFF file would not act as an interchange file format between the files from development and the software localization tools, but among the tools themselves.

Each software localization tool uses its own kind of file repository to store the contents of a localization project. A project repository, such as a .ttk file created with CATALYST, includes the localizable information extracted from the input files, the translations and metadata such as translation status for each string, annotations and skeleton files. These project repositories conform to proprietary file formats owned by the localization tools vendors. In this context, XLIFF can be used as the interchange file format that allows content transfer of one proprietary project repository to another. To support this positioning, a software localization tool needs to provide export/import functionality between its native project repositories and XLIFF.

Used in this way, XLIFF becomes the basis for tool interoperability. The advantage of this positioning is lessened tool lock-in with the corresponding flexibility to choose the right tool depending on factors such as the project characteristics, training needs and so on.

A related advantage is that the different players involved in a software localization project can select software localization tools independently. Generally, localization service providers (LSPs) have to use the localization tool chosen by the client. Tool interoperability facilitates scenarios where the LSP and the client companies can use different tool sets and still be able to effectively transfer project information.

Using XLIFF as a bridge among software localization tools does not preclude using it also as a bridge between development and localization, but it doesn’t require it. In fact, this use of XLIFF poses no restrictions on the localization workflow.

Between localization functions

Most software localization tools available in the market today are desktop applications designed as one-stop solutions, but widespread increases in network bandwidth as well as the rising use of cloud computing are creating the basis for new ways to deliver localization services. In this new model, localization functions such as parsing, leveraging, segmentation, machine translation, quality assurance and so on can be provided over the network (whether an intranet or the internet).

This scenario (Figure 4) is closely related to the one presented in Figure 3. In fact, it can be seen as an evolution, where software localization tools are replaced by software localization functions delivered as web services. This emerging framework promises a more flexible architecture, increased automation and reductions in process management overhead through the use of automated workflows. But it will require reimplementing existing functionality, moving it to the cloud.

Web services, either SOAP or REST, exchange information encoded as XML documents. In this scenario XLIFF becomes the transport payload used by software localization web services.

It is also possible to use XLIFF as a localization project repository, though the mention of this positioning is somehow anecdotal. Though some localization tools use .xlf files as their native repository, trying to use XLIFF in this way showcases the limitations of XML as a database file format: lack of indexes, lack of efficient storage, lack of transactions and so forth. When using XLIFF in this way, performance constraints impose limits to the size of projects a tool can efficiently handle.

The main software localizations tools, commercial or in-house, all use binary formats for their project serializations. These may be proprietary formats or database repositories, but they provide better scalability and are all optimized for the types of operation expected of a modern integrated localization environment.

A look at the standard

The latest version of the XLIFF specification is 1.2, released in February 2008. The XLIFF Technical Committee has also released three related documents as committee drafts: the representation guides for HTML, Java Resource Bundles and .po files.

The different ways in which XLIFF can be positioned make different demands on the standard. If XLIFF is used as a bridge between development and localization as previously described, the data that needs to be represented is the localizable content included in the input files. There are many such input file formats with a great deal of variability on what localizable data they include, how the data is represented and the associated metadata. Some, such as .properties files, can only describe string-tables made up of simple pairs of identifier-string; others, such as .rc or .resx, can describe string-tables and other resource types such as dialogs/forms and menus. Thus, they not only include strings, but also other localizable information such as sizing and position information (Figure 5).

These file formats not only differ in what they include but also how they represent localizable data. For example, it is common for software strings to include variables that are replaced at run-time with live data. These variables become non-editable codes inside the strings and are represented differently depending on the file format ({0} or %s). The same apply to other nontextual elements such as format specifiers or formatting codes.

Finally, the associated metadata incorporated in these file formats is also different. While some lack a standard way to supply it, others, such as .po files, have specified ways of conveying information such as a translator’s comments and flags, all of which have localization value as in this .po fragment:

# This is a translator comment

#. c-format

#: install.h:26

msgid “%% .fff file not found.”

msgstr “%% .fff file not found.”

To support this variability, XLIFF needs to provide both a comprehensive array of options and also the flexibility to pick and choose among them. This is reflected throughout the standard but specially in the <trans-unit> element. This element has 29 attributes of which 28 are optional. It is also reflected in the list of inline elements that can be included in the <source> and <target> elements. There are eight types of inline elements catering for codes with/without begin and end tags, codes that cross <trans-unit> boundaries or not, and so on.

This flexibility allows the specification to support a large number of file formats. But it also creates the potential for interoperability problems. Filters extracting content from the input files into an XLIFF document can represent the same extracted data in different ways — all of them compliant. This is the problem that the representation guides for HTML, Java Resource Bundles and .po files are trying to solve. They specify a unique way to translate the localizable data inside those file formats into an .xlf file. However, the need for representation guides points to a bigger problem. Many file formats lack a representation guide, and there will be new formats coming as new programming languages appear. One option can be to expand the number of representation guides and to keep updating the existing ones. Unfortunately, this poses a considerable maintenance burden. Another option is to use “vanilla” XLIFF, that is, for filters to dismiss any localizable data beyond the source strings and just use the minimum mandatory elements and attributes required to produce an XLIFF document. This is certainly an option for file formats that only describe string-tables. However, doing it with file formats capable of describing other resource types would limit the functionality that localization tools can offer. For example, dismissing the form information contained in an .resx file (position, size and so on) would prevent a tool from providing a form visual editor.

So XLIFF’s flexibility has a double edge; it expands the range of formats supported by the standard, but it increases the scope for interoperability issues.

If XLIFF is used as a bridge between software localization tools, the data that needs to be represented in XLIFF are the contents of a localization project. This includes the information extracted from the original files and also, potentially, the results of the pre-translation, translation and the test-fix cycle. XLIFF has a rich and flexible feature set, but because of its flexibility, it is difficult to see how interoperability can be achieved without a representation guide. A localization project repository will always contain information that is tool specific. Tools differentiate themselves through their feature set, and this is a basis for their competitive advantage. XLIFF does not need to relay all of these uses, but should reflect the consensus on the core information that all tools need when importing a localization project. A representation guide for a localization project would provide a clear target for the development of the import/export functionality needed to use XLIFF as a intermediary between software localization tools.

It is interesting to note that there are two different scenarios where XLIFF could be used to transfer a software localization project between different tools: intra-project (during the life of a project) or inter-project (in between the end of the localization cycle for a version of a product and the start of the localization of the next version). The demands on XLIFF are different in each case. There is metadata, such as history logs, available fuzzy matches and so on that, while relevant during a localization project, are often discarded when the project data is archived at the project’s end. Generally, the information archived at the end of a project is a subset of the overall information used during a project.

Support for standards is generally regarded as a valuable feature. Accordingly, many software localization tools include support for XLIFF (Table 1). In some cases the support extends to accepting XLIFF files as translation memories (for pre-translation), but generally the support consists of including XLIFF as one of the supported input files. This is the support required to use XLIFF as a bridge between development and localization and, in practical terms, involves providing a parser as well as generating target XLIFF files.

However, none of the tools listed in Table 1 offer the type of export/import functionality to/from XLIFF that can be used for round-trip conversions between their project repository and an XLIFF document. That is, no software localization tool attempts to use XLIFF as the basis for tool interoperability. There are a number of possible reasons for this. From a business perspective it may be safer to support XLIFF only as an input file; also, there are questions about the maturity of the standard. In any case, the end result is that XLIFF usage is limited to that of intermediary between localization and development, but the benefits of using XLIFF this way are limited, which may explain the corresponding limited uptake of XLIFF across the industry. Issues such as increased overhead or lack of representation guides make XLIFF a less than obvious choice except, maybe, for niche file formats without existing parser support.

A move towards positioning XLIFF as the interoperability standard for software localization tools would open some new and interesting possibilities. In the short term it would lessen tool lock-in by reducing the cost of transitioning between tool sets. Eventually it would allow the different participants in the localization workflow to choose the tools that better suit their different needs, independently of each other. It could also become the basis for delivering software localization web services.

However, for this to happen the standard needs to provide an unambiguous target that minimizes the scope for incompatibilities. As it is, XLIFF is too flexible to provide such a target. This flexibility is reflected in the structure, the extensibility mechanisms and the user defined values.

Defining a clear target for localization tools vendors can be achieved in several ways. Maybe the standard can be changed or maybe the standard can remain flexible but offer a representation guide. Either way, this would allow XLIFF to increase its value to the localization industry.