XLIFF mixed with XML functionalities

XML and XML Localization Interchange File Format (XLIFF) are often used for transporting text between systems, especially content systems and translation systems. But even though they’re great formats, there can be issues. The file format specification might not have been fully understood, or perhaps a user is trying to do things with these formats they were not intended for.

The technical project manager or localization engineer then might have to spend some time explaining to the client what a translatable file should look like or what a translation tool would expect from a specific file format. They will need to be able to spot the issues, understand why they are issues and potentially communicate this to non-technical employees on the customer side.

From my many years of practical experience finding the mistakes and making files behave in translation tools, I would like to share some of the most common issues I have seen. It might help someone provide their customers with some help on how to improve the translatability of their files.

Extension and content need to match

It is true that XLIFF is XML-based, as the name itself states. But simply changing the extension from XML to XLIFF does not really do the trick, and will lead to an error message or no content of the file being imported into a translation tool (Figures 1 and 2).

A translation tool selects the filter to use for importing the content because of the file extension as well as the content of the file. If they do not match, the tool will be confused.

The same is true when an XLIFF file uses the *.XML extension. Again, the extension and content don’t match, and the wrong filter will be suggested for import (Figure 3). If nobody checks the filter, there might be a lot more content (which is not translatable at all) showing up in the translation tool than expected.

You might think that these mistakes aren’t very common, but the latest example I received (from October 2019) shows that it still happens.

The file was called Translate.xlf.txt, with a description from the client that the translator should use a text editor for translation. The project manager asked me whether this file could be processed in their translation tool. They had tried to import it with a TXT filter (obviously) and did not like what they saw in the translation editor.

XML: Don’t touch the codes

Even though an author or developer knows how an XML needs to be structured and might have defined the structures themselves, it is never a good idea to change anything inside an XML file manually. The most common issues I have seen in recent examples are:

∙ Missing elements, i.e. the ending tag of tag pair is missing.

∙ Incomplete deletions — for example, when an attribute was not fully deleted () and the closing quote mark of the deleted attribute was left in the file.

∙ Element names that don’t match (How to set up a filter for XML files)

∙ Elements (tags) where they are not allowed. This might also happen in an authoring tool if the structures have not been defined correctly.

∙ Incomplete content. The content of a file had been copied but was missing the last few closing tags for the file to be valid.

XML is a container, but not for just anything

XML is often used as a container for text in another file format, mostly HTML. There are tried and tested ways of adding HTML content into an XML file, like using the CDATA section or entities for HTML codes.

CDATA section

Embedding HTML-coded text within a CDATA section will let any tool processing the XML file see this content as pure text. Angle brackets will not be interpreted as the beginning marker of an XML element.

Entities

Any character in the text that could be interpreted as part of an XML element is written as an entity.

If these mechanisms are not known or understood, the developer might come up with an idea on how to mask HTML codes in the XML file so that they don’t interfere with the XML structures.

HTML brackets were substituted with square brackets so that the XML-processor would not try to interpret these things as tags.

Unfortunately, this is nothing a translation tool will understand directly. The one who prepares the tools for this file will need to know about regular expressions and how to use them to convert text into tags for the translation process.

And although it is absolutely legitimate to put HTML-encoded text into an XML file, this only makes sense for text formatting, but not for structural information. Figure 4 contains not only formatting for text, but also the information of the setup of the table. The developer at least knew that this should be outside of the translatable content and surrounded the layout information with square brackets. But again, square brackets are not in themselves a marker that a translation tool would understand as “this does not belong to the text” information.

XML: Creative entities

When looking at the way an entity is written, it seems that it only needs an ampersand, the name of the character and a semicolon. This prompted one creative developer to “create” his own entity from a carriage return line feed character (CRLF).

As such entity descriptions do not exist, a translation tool does not really know how to deal with this and will import them as pure text.

XML specification could be misunderstood by nonnative English speakers

Some time ago, I had a somewhat heated discussion with a developer whose XML files would show an error message upon import into a translation tool: “Error at…. The file does not seem to be a valid XML file.”

Apparently, the XML instruction that certain characters (namely < > ‘ ” &) need to be encoded as entities when they are used in the text itself is written like this: “For interoperability, valid documents should declare the entities amp, lt, gt, apos, quot, in the form specified in 4.6 Predefined Entities…”

For a German, “should” translates into “can be, but does not have to be.” Because of this, the developer only encoded four of the five characters as entities and maintained that the fifth one did not have to be an entity, because the specification only says “should.” When I mentioned that the translation tools would expect all of these characters to be encoded as entities, his response was “then your tools are doing it wrong.” His own point of view, of course, but not very helpful for the translation process.

Actually, not all tools will express an error here — some will just internally convert the characters to entities and also export them as such. On the other hand, this could then lead to trouble back in the content system, if no entity was expected there.

XML is text-only

Often, the text that ends up in an XML file comes from another system. That system might use control characters for line breaks and such (Figures 5 and 6). These control characters are not allowed by XML and therefore lead to error messages during import. Unfortunately, most translation tools stop at the first mistake that they find. Once the mistake has been corrected, the import will stop at the next control character and so on. It would be nice for our tools to create a list of such control characters to let the user correct all of them in one go instead of having to play around with multiple imports.

XML: Incorrect structures

Although the structure of Figure 7 looks nice and clean at first glance, a translation tool will not import this file.

Why? Because the XML specification states that the name of an element must not start with a number.

Unnecessary elements in XML

In this case, the original format was text-based. For layout purposes, it was then converted to something containing RTF codes for special characters. This then had been (at least it looks like it) converted again into some kind of XML. The RTF codes for the special characters were converted to HTML codes (Figures 8 and 9).

Although the XML file does import into the translation tool, the text does not look very nice. The translator can of course leave out the tags if the special character (here the a-umlaut in German) is not needed in the translation, but would be left with hundreds of tag error messages for missing tags during the quality check. In addition, the translation memory (TM) would be filled with many segments containing tags that are unnecessary (Figure 10).

I am assuming (unfortunately we could never find out explicitly) that either the customer will need to convert the XML back to the RTF layout, or that the developer did not really understand what XML and Unicode could do for them.

XML: Inline tag with different requirements

Another example is the use of an element as an inline tag, but with different requirements as to what the tags should do. The tag in question was named. When it appears within the…structure, it should be treated as any inline tag, as it represents formatting. But when it appears within the…structure, suddenly only the text between thetags should be open for translation; the rest of the segment should not appear for translation.

In this case, an XML filter will not be able to deal with the requirements. There is no way to tell a filter to treat a certain tag in different ways within the same file. The best solution would be to use different tag names for different requirements. The next best solution was to use a text filter instead of an XML filter and regular expressions to define the translatable content (Figure 11). As the sequence in which regular expressions are listed in the filter of the translation tool has an impact on the text that is processed, the definitions were:

• Extract text between thetags first.

• Extract then the text betweentags.

• Extract then the text between the remainingtags.

• Convert thetags between thetags so that they show up as inline tags. Let’s move on to XLIFF, as this is a format that is increasingly sent to translation directly created out of the source content system, such as a content management system.

XLIFF: Bilingual setup required

XLIFF was created to contain content in the source and the target language so that this bilingual content could be transported between systems, mainly for the translation process. It follows that the XLIFF file contains information concerning which language is the source language, and which is the target language. The following example does state the source and target language — unfortunately, not a specific one.

It is understandable that a translation tool will not know what to do with such a file.

HTML

Unlike XML, which can be used quite easily for HTML content, XLIFF is not well-suited for this. The translation tools don’t expect XLIFF to contain HTML, and therefore they don’t usually offer a way to mark up the HTML-type content correctly. Another issue that I often see when XLIFF contains HTML is that one source segment in the XLIFF file contains several pages of HTML-type text. As HTML breaks are not recognized inside XLIFF, this means all that text between thetags makes up one big long segment.

The solution could be to use a search and replace action, copy the content from between thetags to the tags and use a regular XML filter to access only the text between the tags. In this case an HTML filter could be added to mark up and segment the content correctly.

XLIFF mixed with XML functionalities

XML allows a lot of information inside attributes, like the length restriction for the translation or a comment. Some of these attributes might not be supported by an XLIFF filter. Figure 12 shows an XLIFF file that contains an attribute with information whether the text should be translatable or not, namely translate=”no” (which an XLIFF filter can deal with) and also an attribute for the length restriction (which an XLIFF filter has no way of using).

The XLIFF filters in our tools might evolve to allow for such settings, but until then, a workaround is necessary. One option is to translate the file using a regular XLIFF filter, which would not allow you to use the length restriction. Or copy the source text between the target tags, use an XML filter and use the available settings there to indicate what text is non-translatable, and also retrieve information on length restriction.

When users are new to XLIFF

The latest example (October 2019) of an XLIFF file that was sent to me to evaluate its translatability combined several issues and showed that the intention and setup of XLIFF were not fully understood. Can you spot them all in Figure 13?

1. The XLIFF file contains the attribute for the source language, but none for the target language. XLIFF files have a bilingual setup and need to state both.

2. The source language attribute actually shows the ID for the target language (source-language=”en”), but the source language in this file is actually German.

3. The…tags contain text to tell the translator where to type the translation. The client (as became obvious from their instructions on how to translate the file in a text editor) wanted to make it clear where the translation should appear. When imported into a translation tool, the marker would appear in the target language column, thus usually preventing the translation tool to insert matches from the TM automatically, because the segment already contained text.

4. The closing tags,andat the end of the file were missing, making the file invalid.