Combat DITA gotchas with an LBOM

The documentation department of an enterprise-scale software company sent a large document to a vendor for translation. The full document included a collection of individual files in DITA format. The company received the translated set of files back, and then built the composite output document.

To their surprise and disappointment, the “translated” document contained numerous untranslated passages. What went wrong? Did the translation company make a mistake? No, the vendor fully translated every submitted file. So, why were passages left untranslated in the published output document, scattered throughout otherwise correctly translated text?

Software localizers are familiar with this kind of issue that often stems from hard-coded strings. But there aren’t hard-coded strings in flowing-text documents. Or are there? Did the problem have something to do with the DITA format? And, more importantly, how could the company avoid this problem in the future?

To understand the root cause of this problem, we need to take a quick look at DITA. DITA stands for Darwin Information Typing Architecture. DITA is a way of organizing information. The organization of the information is dictated by its usage. DITA files are XML files. Authors categorize information chunks by enclosing them in XML tags that provide semantic context for the information contained within them. Typical DITA tags are items such asthat identifies reference material that could support execution of a task, orthat contains body text describing a concept, orthat designates document metadata that frequently does not require translation. Three primary advantages of DITA are consistency, repeatability of information structure, and facilitation of content reuse. The granularity of DITA files allows authors to recombine information chunks in many different document types in the same way that recombining Lego pieces produces different structures.

DITA methodology breaks documentation projects into many component files. A master document called a ditamap controls assembly of the files for publication. Breaking up publications in this way provides great advantages for reuse and recycling of content in a variety of output formats. The downside in the context of localization is that there are many files to manage. The file management burden introduces risk of forgetting or overlooking requisite files. The way in which ditamap files work exacerbates this risk. Therefore, understanding the limitations of ditamaps vis-à-vis localization is important.

Ditamaps can contain several types of information. The most basic information type is a list of files included in the published document. For that reason, a translation project manager’s assumption to base a document translation project on the document’s ditamap is not illogical. Unfortunately for the project that inspired this article, the assumption proved to be erroneous.

A ditamap does not necessarily list every file that comprises the full document. Therefore, it is highly likely that a DITA map, in and of itself, will be insufficient to serve as a definitive basis for a translation project. It is a good start, but not necessarily a good end. Alongside a list of topic files, a ditamap can contain references to other ditamaps or any other type of file, usually via a link. But that is still not the whole picture.

There are additional content types not captured in ditamaps as file references that may need inclusion in the translation project. These include:

Conrefs. Conrefs are pointers to topics or chunks of text from other topics. For example, a conref might point to a specific paragraph in a different file that the publishing process automatically includes in the calling file when generating the output document. The DITA tag defines a conref. Ditamaps do not list conrefs. Authors embed conrefs in topics.

Xrefs. Xrefs are cross-reference pointers that contain links to other files. Authors embed xrefs in topics just like conrefs. They are not present in ditamaps. The DITA tag defines a cross-reference.

Variables. Topics can contain variables. The output process populates variables with values during publication. The variable list may be independent of the DITA map.

Conditional text. Conditional texts are chunks of text embedded in topics. Inclusion or exclusion depends on conditions defined for publication. The translation project specification must account for conditional text.

Document-specific information defined within the ditamap itself. The document title is an example. The tag in a ditamap defines the title of the document.

Graphics files. Graphics files sometimes contain localizable content. Sometimes, they do not.

Conrefs were the culprits in the ill-fated translation project. In an abstracted way, conrefs are analogous to hard-coded strings. The presence of hard-coded strings in software code can be obscure because programmers bury them in the code content. Detection is possible only after building a localized or pseudolocalized product, or in specialized localization tools.

In the case of conrefs, translation project managers must carry out specialized detection. Because flowing text documents can be large, visual inspection of a pseudolocalized version may be too time-consuming to be feasible. Furthermore, if a publishing system assembles the component files, it is necessary to verify that the publishing system itself does not introduce nontranslated content into the final output. A targeted visual inspection may be useful to insure against this eventuality. Pseudolocalization can also be useful for a visual inspection. In this case, pseudolocalization into a language that differs substantially from that of the source content makes the most sense. For example, pseudolocalizing English into Japanese for a visual review will cause any remaining rogue English in the output to pop out visually.

Despite the empirical value of visual inspection, it is better to use a more scientific search method to detect presence of file references that are external to the ditamap. The tagged nature of the XML file format provides the needed vehicle. Text search algorithms that parse through a file set to find and list embedded conrefs and other tags are not difficult to write. This process is ideal for automation. But, even in the absence of automation, many common and inexpensive utilities such as NotePad++ or grepWin support searching through directories to find character strings such as tags. Regular expressions generically define the search strings.

Then, once the project manager creates a definitive list of all files in a project, there is the question of how to document the results. A localization bill of materials (LBOM) provides the necessary repository. An LBOM is a document that lists every component of the project required to produce a localized version of the source product. The LBOM may also contain metadata about the individual components. This type of metadata provides guidance to the translation resources about how to treat specific components.

Experience has shown that client-side employees who prepare localization packages for vendors can forget or overlook components. Failure to provide all project components can negatively affect project outcomes in a variety of ways:

• Vendors estimate cost and delivery timeframes based on incomplete data because components are missing. Incomplete data results in cost overruns and delivery delay.

• Often, customers discover missing components only after receiving project deliverables and building the product. The product therefore contains deficiencies such as user interface strings or published content displayed in the source language. Possibly, a product even fails to operate correctly after extraction and translation of hard-coded strings. In the case of documentation, “functionality” refers more to the user experience. A mixed-language documentation definitely creates a poor experience.

• Late discovery of missing components can mean that vendors must localize them quickly and with insufficient time for quality assessment. This introduces the risk of poor translation. At the very least, the necessity to enlarge the project scope late in the process introduces stress for all participants.

To be a fully comprehensive and dependable document, the LBOM for a DITA project must list every file that needs attention during the localization process. The most dependable method for ensuring LBOM completeness is to create a list of all files within the project directory to which conrefs and xrefs point. Then create a list of all files in the project directory required to build the source product. Compare the two lists. If the comparison exposes missing conref or xref files, add them to the project and list them in the LBOM.

In most cases, authors seamlessly integrate conrefed content into the calling document. This usually mandates localization of these components. However, documentation specialists might choose to exclude external references from localization. In this situation, the LBOM should clarify that the project includes nonlocalizable files required in the project for publication. Proactive clarification in the LBOM reduces the risk of confusion and query burden. For example, localization managers may decide to exclude screenshots and graphics from translation. Nevertheless, these files do need to be present in the archive to enable complete generation of the output document.

Translation of DITA-based documents will become more prevalent in the future. Many forward-looking documentation departments have already adopted DITA. Their explorations into structuring content to facilitate reuse have yielded benefits in consistency, turnaround time and flexibility. Reduction of translation costs through reaping of better matches from translation memory is another payoff. The really big payoff, though, lies in the ability to reuse translated files or DITA components as-is, without even sending them out for translation. In the past, localizers needed to learn to be wary of hard-coded strings. Today’s localizers must learn to compile complete DITA-based projects, including files accessed through conrefs and other DITA conventions, and to document them using a comprehensive LBOM.