Focus
Content localization at Teradata
Tak Takahashi has been working on internationalization and localization at Teradata San Diego as a globalization technical lead for more than 20 years. He holds a degree in architectural engineering. He is also an architect and a computer engineer certified by the Japanese government.
Tak Takahashi
Tak Takahashi
Tak Takahashi has been working on internationalization and localization at Teradata San Diego as a globalization technical lead for more than 20 years. He holds a degree in architectural engineering. He is also an architect and a computer engineer certified by the Japanese government.
he Teradata globalization team is responsible for localizing user interfaces, online help and user documentation for all our products, including client tools, database administrator tools and analytic applications, into ten languages. Some of the lessons we have learned along the way may be useful for startups.
For over ten years, the globalization team has used localization tools we developed in-house. We researched new localization technologies, including translation management system (TMS) and neural machine translation (NMT) options. We also worked with several TMS vendors and localization service providers (LSPs) to see if we could substitute our in-house solution with one of the new technologies. We have yet to find an ideal solution, so we continue to use our in-house localization tools and processes.
While commercial TMS solutions provide substantive translation and translation project management functionality, they come up short in terms of Teradata’s pre-translation requirements. We have found this to be true especially when translations are migrated from an earlier version to a newer version. Translation tools use text-matching techniques, such as exact-text match and fuzzy-text match, which are different from migration tools.
The pre-translation process is different from the translation process. Translation looks at the product name and detailed software information, including resource file path, unique resource key and the source English text in the resource file. If the fields in the old and new versions of the localization kit match, the content is migrated.
Product_name -> File_Path -> Resource_Key -> Source_Text
Limitations of pre-translation with text matching
In all examples, the text on the left of the colon is the unique resource key and the text on the right is the source text for translation.
To understand the limitations of pre-translation with text-matching, let us look at three real-life examples from product resource files:
"SAVE": "Save"
"CANCEL": "Cancel"
DELETE_CONFIRM: 'Are you sure you want to delete this variable?'
These examples have a high probability of being translated by a TMS successfully and without an issue.
Now consider the following examples illustrating the complexity of translating English words and strings that do not necessarily exist in other languages:
tjsDataGridFooter2=of
calendar.to=to
calendar.of=of
OF_THE_FOLLOWING_TRUE: 'of the following are true:'
"FROM": "From"
fromDateHeader:"From"
toDateHeader:"To"
stringSearchOptionHasLabel:"Has"
gridViewHeaderStartDatePostFix:" Str Dt"
gridViewHeaderEndDatePostFix:" End Dt"
"QUARTERFORMATSTRING": "[Q]Q"
From a software internationalization design point of view, the above resource strings are poorly designed, cannot be easily localized and require reviewing the user interface (UI) to translate.
Of note are the following issues in the examples, all of which require knowledge of the UI to translate, and have a high probability of being translated incorrectly:
- Words such as “of” and “to” do not exist in many languages, or have different meanings depending on the surrounding words.
- The “of the following are true:” string is an incomplete sentence, implies the string concatenation is based on English and is difficult to understand without context.
- Abbreviations such as “Str Dt” and “End Dt” are based on English, need verification of meaning and are interface-specific.
- “From” can reference a location, an email address and so on. If this referred to an email address, a preferable string would be “Email Sender.”
- Many of these strings are specific to a single product or function and cannot be reused.
For strings translated for specific products, we need to retain these translations for future releases.
The solution for us was to design and develop a translation migration tool that we used initially for UI localization and later found useful for DITA translations.
Complex resource keys
To perform translation migration programmatically, the system must identify the unique resource key within a single resource file. In the case of Java property files, it is not difficult to identify the unique resource key and associated source text for translation, as in the example below:
validator.method.required=Thisfield is required
validator.method.remote=Fix this field
validator.method.email=Enter a valid email address
In the case of JavaScript (JS), TypeScript (TS) and JavaScript Object Notation (JSON) resource files, we may see a complex resource key. To identify a single resource string, the parser must navigate into the nested resource key, as in the following example:
"help":{
"common":{
..............
},
"channels":{
..............
},
"interaction_points":{
"id":"interaction_points",
"name":"Interaction Points",
"pages":{
..............
},
"message_groups":{
"id":"message_groups",
"title":"Message Groups",
................
To locate the “title"
element, the parser must search the nested resource key from the top level to the bottom, locating each element in the following progression:
"help"->"interaction_points"->"pages"->"message_groups"->"title"
In addition to the resource key, the file path is needed to migrate content. In the localization kit, the development team creates and organizes folders and sub-folders to group features or functionalities. Under those folders, you may see the same file name, as in the following example:
..\\canonical\\ts\\en.ts
..\\admin-ui-language-pack\\ts\\en.ts
Translation migration involves managing and using the complex resource key along with the product name and resource file path in the translation memory. We addressed this by developing a migration tool that does this.
Translating structured elements in DITA
Our information engineering team uses IXIASOFT DITA CMS to develop and author content for user documentation and online help. In support of localization efforts, IXIASOFT DITA CMS supports the localization id (ixia_locid
) as a CMS-specific user attribute for many DITA elements. The localization id is generated automatically by CMS as unique numbers within a single DITA source file. From a technical perspective, the localization id in DITA is identical to the unique resource key in the UI resource file and works for translation migration.
The following example illustrates the DITA source text and extraction of a < title >
element with no inline elements:
DBAs can adopt one of two strategies for managing permanent, spool, and temporary space:
- AMP level only. The per-AMP space quota is the maximum permissible AMP-level space. This is the default.
</li ixia_locid="6">
....
Below is the extraction of < title >
element:
Here is a more complicated example where the parent element has two inline elements:
<Target_Trans_New>【<wintitle "104" text=Canary Queries】の横の【<image "103"】をクリックします。</Target_Trans_New>
<ixia_locid>"102"</ixia_locid>
<DitaTag><cmd </DitaTag>
…..
<Source_Eng_New>Canary Queries</Source_Eng_New>
<Target_Trans_New>カナリー クエリー</Target_Trans_New>
<ixia_locid>"104"</ixia_locid>
<DitaTag><wintitle </DitaTag>
Validation Comments | Description |
---|---|
{✖: alert action=アラート アクション} | NMT failed to use Teradata’s preferred translation |
{✔: action=アクション} | NMT successfully used Teradata’s preferred translation |
{✖: canary=カナリー} | NMT failed to use Teradata’s preferred translation |
{✔: period=期間} | NMT successfully used Teradata’s preferred translation |
Table 1: NMT+ defined validation comments.
In this case, the parent element < cmd >(id=102) and the child element < wintitle >(id=104) are translated independently; < image >, being an image, is not translated. As the element type implies, < wintitle > is one of the UI resource strings. As such, the existing UI translation is reused during DITA pre-translation. If these inline elements are changed in the new version, for example from < wintitle > to < ui control > but the inline structure remains the same in the parent element, the translation of the parent element should be migrated in the new version. This concept of structured element-based translation is critical for translation migration, as well as pre-translation including NMT in DITA.
Figure 1: The localization process involves a number of steps.
To translate DITA structured content, we designed and developed a customized version of NMT, which we call NMT+ for DITA (NMT+). This automated tool is specifically designed to address the ordering issue encountered in structured element-based translations. It is interesting to note that the parent translation in the above example has been machine-translated while translation of the child element < wintitle > was inherited from the UI translation memory.
Using NMT+, we can specify the preferred terminology and resulting translation. Let’s look at the following example:
<Source_Eng_New>This example task describes how to create an alert action that is based on an expired logon-timeout period using a canary query.</Source_Eng_New>
<Target_Trans_New>このサンプル タスクでは、カナリー クエリーを使用して、ログオン タイムアウトの時間切れに基づくアラート アクションを作成する方法について説明します。
<ixia_locid>"204"</ixia_locid>
<DitaTag><shortdesc </DitaTag>
<L10N_Comment>MT{✖: alert action=アラート アクション}{✔: action=アクション}{✖: canary=カナリー}{✔: period=期間}[LSP]_translation is updated</L10N_Comment>
NMT+ translates the original source text, validates the returned translation and adds validation comments to localization comment (in other words, L10N_Comment) for post-editing. Table 1 shows the NMT+ defined validation comments.
LSPs can review the validation comments, post-edit the translation and finalize the translation. After post-editing, LSPs add their [LSP]_translation is updated
comments. We can then confirm that the NMT+ translation has been reviewed and finalized in post-edit.
In developing NMT+, we learned ways in which to improve DITA translation results and how to better customize NMT. We figured out how to:
- Define and manage words that cannot be translated and establish rules for these words
- Define and manage Teradata-specific terminology
- Integrate structured element-based translations and refine pre-translation
Manual and automated translation processes
Recently, our product release cycles have changed significantly, with the advent of expedited content releases available with agile software development and on the cloud. Many products are now released with greater frequency, with monthly or weekly releases being typical. Understandably, the frequent product release cycle affected the globalization team.
Until recently, our globalization team used a manually-intensive process to translate content. With the advent of frequent releases, we found that the manual process was not sustainable. To accelerate translation, the globalization team needed to develop an automated solution, which we released in late 2018.
Figure 2: The automated process that translates UI resource files and delivers the weekly language pack.
Despite the availability of the automated solution, we continue to use the manual process for many products that are released less frequently or in maintenance mode.
The localization process, whether manual or automated, involves the steps found in Figure 1.
When the content is approved and ready for localization, the development team or information engineering team creates a zipped file that contains any of the following datatypes: YML, TypeScript, JavaScript, JSON, Java Property, .NET, TMX, CSV, IXIA Dita or HTML.
When alerted to availability of the file, the globalization team manually unzips the file and prepares it for the cleanup process.
The source file is cleaned up before any processing is performed. For UI resource files, the end of the record (EOR) marker is uniformed to two bytes, carriage return and line feed. With this unification, the parser can easily parse the source file and identify and extract the resource elements. Uniformed EORs make post-processing easier.
With DITA content, formatting characters, extra spaces, tabs, carriage return, line feeds and so on are removed from the source text, except five reserved tags. This way, only the actual content is translated.
Next comes the pre-process. The parser processes the source file, and identifies and extracts each translation element (translation unit) individually. Inline elements are identified and represented in the parent element. The translation element is then stored in the relational database, which is used as a translation memory. The DITA parser performs the same job for the DITA XML source file.
Existing translations saved in the translation memory are migrated automatically to the new version based on the product name, resource file path, resource key and source text. After migration, the globalization team identifies new and modified strings needing translation.
During the pre-translation stage, the localization tool uses SQL to search the translation memory of the relational database for English text with similar or identical text. If found, the existing translation is reused. We support exact-text match, translatable-text match and fuzzy-text match.
The MT tool translates the remaining strings that are not migrated or pre-translated.
After all translation processes have completed — including post-editing by LSPs — the globalization team runs post-processing and generates the localized resource file or DITA source file.
The localized files are then zipped and delivered to the development or information engineering teams for release.
All new translations in each translation cycle are overwritten to the translation memory and prepared for the next translation cycle.
We use our own database (DB) to help in the translation process. The UI resource DB on Teradata is a tool for UI developers. This database is generated using the translation memory used for translation migration. UI developers can freely access the database to search existing UI resources, and use them on any new UI screens to ensure consistency and reduce translation costs.
Until recently, each step in the translation process and each tool required a globalization team engineer to start the procedure. With the new command-line interface, this process has been automated with three scripts running at scheduled times.
Figure 2 illustrates the automated process that translates UI resource files and delivers the weekly language pack.
Looking forward
Depending on the product release and development cycle, in 2019 we plan to implement automation tools and processes for other products, including DITA content. We will further customize NMT to improve the terminology list and dictionary that contain the preferred translations for our products. And lastly, we plan to develop an automated procedure to target translation elements that require light post-editing.