Building a roadmap for Big Data TM integration

Many corporations and government entities would like to benefit from the Big Data revolution in language processing. Currently, this requires feeding large amounts of data into open or shared technology solutions. However, legal concerns about control of intellectual property — and even questions of national security — often frustrate even the most modest ambitions to “ride the Big Data wave.” However, there is a way to alleviate or eliminate these intellectual property and security concerns, opening the door to wider exploitation of high-value multilingual content.

One proposed solution is built around a new technological concept: a multilingual redaction database (MRD). The MRD would allow content creators to apply FlexData (defined later in this article) to any sensitive subsegments to allow for easy substitution with custom or generic metadata, as defined by the content creator. The content creator can also define user levels that limit how much of the content the end users can view. The context of the surrounding content would remain clear even in redacted, sensitive information, allowing for the use of computer-aided translation (CAT) tools on any source document that contains sensitive data. This approach would ultimately allow for Big Data contribution.

Sensitive content subsegments would be defined in the content authoring system and entered into an MRD. The MRD would function like other CAT features such as translation memories (TMs) and termbases. Once the source-language content has been “marked up” with the MRD entries and all access roles have been defined, the source content would be exported into an exchange format (a flavor of XML) for ingestion into a CAT environment (Figure 1).

The translation team would have two options for processing this content in a CAT environment. In the first option, the translation team would have full access rights to all content, including the accompanying MRD. For the second option, the sensitive source-language MRD data would be translated separately by the team with full access rights, but the redacted content would be translated by another team with less access rights (Figure 2). The first option allows for tight security controls, but it does not offer content authors any flexibility. The second option allows for the same tight security controls, but it offers the content authors the flexibility to process the redacted content with a separate team for potential cost and time savings.

Sensitive areas of the source and translated content would be presented differently based on the access level of the target audience. After translation, the target document would display only redacted content with source-language MRD markup to users with lower access rights. If the MRD entries have been translated as well, users with the appropriate permissions could view the entire target document, including the sensitive information. Moreover, applying this preparatory redaction method to TMs would also allow for the creation of Big Data-ready TM corpora.

The current state of affairs

TMs are among the most valuable digital assets for the language technology industry. One of the latest trends in managing and leveraging translations is maintaining TMs in a cloud environment. Freelance linguists, language service providers (LSPs), and internal corporate or government linguistic resources upload their TMs to a freely accessible Big Data TM.

There are many potential benefits to developing Big Data TM corpora. First, the language services community would have free access to a large repository of legacy translations to leverage when translating new content. Second, instead of being limited to the existing legacy TMs produced by a limited pool of vendors, corporate and government language services would be able to pretranslate their content by leveraging a translation database produced by thousands of translators. Third, the database would be continuously updated and serve as a base for MT solution training.

However, there is resistance to developing Big Data TMs. One of the main resistance points — and a perfectly valid one — is that proprietary and sensitive data could, and most likely would, enter the public domain through TM exchange. The risk of sensitive data exposure is one of the major obstacles to forming a Big Data community that would be willing to contribute TM corpora for global leveraging.

Another key detractor is the perceived competitive advantage gained by TM ownership. Prior to Big Data, ownership of well-maintained, domain-specific TMs was widely and rightfully accepted as giving clients and localization providers a competitive edge. It has been several years since the introduction of the Big Data approach, but the concept of sharing linguistic data assets with the global community to achieve greater success is still a challenge. While we are not addressing this issue in this article, we believe it is important to mention that TAUS has been playing a key role in advocating the competitive advantage of shared data and services through The Human Language Project and implementation of the Data Cloud for collecting, sharing and leveraging multilingual data within the global translation community.

To address the issue of sensitive data exposure impeding the progress of Big Data TM, we have developed a comprehensive concept that could address sensitive data handling for Big Data TM integration. As mentioned previously, our proposed solution is called a multilingual redaction database (MRD) and includes the following:

-Data protection via externalization and restricted access levels based on content sensitivity.

-Application of a linguistic action script for content preservation and usability.

-Generation of an MRD for storing sensitive data with a multi-layered taxonomy and customized user access rules based on data sensitivity.

-Introduction of the concept of revolving metadata, or FlexData.

As a by-product of this approach, it would also be possible to perform linguistic-action analysis on Big Data-ready content.


Before proceeding to the proposed method, let’s identify the main obstacles standing in the way of Big Data content integration.

First, we need to address source content generation and management. Let’s use the extreme case of redacted content.

In the case of redaction, after the original digital or hard-copy content is analyzed for sensitivity, two versions of the content are generated: the original unredacted content with the sensitive data exposed, and the redacted version of the content with the sensitive data removed or encrypted. While in certain cases encryption may provide some useful context, fully redacted data precludes the content from being useful for Big Data TM leveraging. To achieve leveraging, a clear link is required between the original and protected content. Moreover, a multilayered security solution may be required to accommodate multiple data access levels.

The current state of language technology, including CAT and machine translation, does not provide adequate support for translation and management of sensitive data, short of managing it in a classified or similarly protected environment. As a result, multilingual corpora resulting from translation of sensitive data cannot be openly shared.

Similarly, translations of redacted data — where the sensitive data is either removed or encrypted using a proprietary technology — have no context, and would not be usable for Big Data TM integration and leveraging.

Building blocks:

Existing technologies

When a human translator translates in a CAT environment, the source content is segmented into smaller chunks (i.e. sentences) based on predefined segmentation rules. These chunks are called source segments. Source segments are translated inside the CAT environment into target-language target segments. Source and target segments are linked together into translation units and stored in a TM database for future leveraging. When the TM is populated, the CAT environment compares the source segments of subsequent translations to the translation units stored in the TM and populates the corresponding target segments if a match is available. A key technology that was recently introduced in CAT solutions is sub-segmentation. Sub-segmentation is used by the CAT environment to identify smaller chunks within source segments and compare/match them against the smaller chunks of source segments of translation units in a TM.

To maximize consistency in key terminology in content authoring and CAT environments, content creators and translators rely heavily on auxiliary databases they populate and maintain. Terminology sets are often generated by content creators and translated prior to translation of the main content. When the translator works in a CAT environment, he or she automatically receives suggested translations for key terminology that the system pulls from the database. Term entries may contain additional metadata in the form of linguistic, organizational and business information.

Building blocks:

New concepts

Taxonomy and metadata concepts are already widely utilized in content authoring and CAT solutions, including in TMs and termbases that utilize a strict data-metadata hierarchy. We believe the benefit of these technologies could be expanded by introducing the concept of interchangeable metadata hierarchy or FlexData. Unlike the usual approach where we have a main entry/content supported by a single or multiple layers of metadata, FlexData would give both the main content and the metadata the ability to play either a main content or a metadata role, based on defined business rules.

For example:

“John Smith is a department manager who manages accounting.”

A conventional data structure would look like this:

   Term (main entry/name): John Smith

        Business Role (metadata layer 1): department manager

         Department (metadata layer 2): accounting

The FlexData approach would treat all data within the hierarchy equally. The data structure would be defined based on business rules such as user access rights, or could reflect a combination.

   Term (main entry/Business Role): department manager

       Name (metadata layer 1): John Smith

         Department (metadata layer 2): accounting

   Term (main entry/Department): accounting

       Name (metadata layer 1): John Smith

        Business Role (metadata layer 1): department manager     

   Term (main entry/Business Role, Department): depart-                                                       ment manager, accounting

       Name (metadata layer 1): John Smith

Unlike a strict data hierarchy, the FlexData approach allows data to be manipulated to meet specific data query needs.

In addition to playing a key role in the MRD approach, FlexData can bring a new level of efficiency to conventional termbase solutions by addressing morphological variations of termbase entries. While morphological variations of terminology can be defined in the current termbase technologies, there are only two viable ways to use the information. First, add the variations to the entry’s metadata, or second, define multiple termbase entries (one for each variation). This limitation restricts termbase usability (matching/recognition) and terminology management. Introducing FlexData will enable the creation of complex termbase entries in which all required morphological variations can be defined, and each variation can be searched, transformed into a main entry and proposed as a term match.

For multilingual content generation, we first need to address management of the original source-language data. Luckily, most of the necessary technology is already available. Currently, any well-developed content management system (CMS) provides support and functionality for a glossary or a termbase. Contemporary termbase technology allows users to store key terminology and create a multilayered cloud of taxonomy and metadata around the term entries.

In conjunction with the FlexData approach, integrating the data into either existing termbase technology or a secondary termbase-like linguistic segment database with multilingual support, a content creator should be able to set up a data entry structure that would allow for all required redaction substitutions. The key difference between the proposed MRD approach and the existing termbase technology is that the MRD would not suggest a term, but rather substitute it with the appropriate data value from the MRD entry (see Figure 1).

Example (segments requiring redaction are underlined): “We all know and love John Smith, an accountant at ACME from Las Vegas who joined our community two months ago.”

Main Entries: John Smith, accountant, ACME, Las Vegas.

Main Entries with FlexData

   John Smith





      Business affiliation

   Las Vegas

      Business location

The key to the proposed solution is to ensure that any combination of the metadata elements can be treated as the main entry. That way, the content creator can readily access the context and apply substitution markup as needed.

 In order to successfully support multilayered redaction, the content authoring environment would need to provide support for sub-segmentation as well as automatic in-content MRD entry substitution based on set business rules. The system also would need to support MRD entry query and substitution.

Here is an example based on the John Smith information: After identifying all the sensitive data and creating MRD entries, a master content authoring user (a source language speaker with full data access rights) defines user types based on data access levels.

USER1 access: Original segment.

USER2 access: Name, business role, employer.

USER3 access: None.

Based on the data access rights configured for each user, the system would substitute the data elements with the appropriate FlexData values and give each user access to the content as follows:

USER1: “We all know and love John Smith, an accountant at ACME from Las Vegas who joined our community two months ago.”

USER2: “We all know and love John Smith, an accountant at ACME from {Business Location} who joined our community two months ago.”

USER3: “We all know and love {Name}, an {Business Role} at {Employer} from {Business Location} who joined our community two months ago.”

As can be seen above, regardless of the level of access, all three user types would be able to work with the content without missing the benefit of context.

When the content is ready for translation, it could be exported as an original document with redaction markup (tags) or as a tagged/marked-up interchange format (XML) with or without the MRD export as discussed above. The MRD export could be translated separately by linguists with a higher level of access (whose time is also usually more expensive) to provide further data security. Since translators would have access to contextual reference (substitutions of the redacted content segments), the overall quality and speed of translation of redacted files should improve considerably. The resulting translations of redacted content and standalone MRD entries could be readily combined and used in the content authoring environment to create target-language content that maintains multiple levels of data access (see Figure 2).

In principle, redacted/marked-up content and MRD entries could be translated using existing CAT technologies with slight customization of parsing filters. However, in order to achieve the full benefits of the proposed MRD approach, CAT technology must be further developed to support MRD. The requirements would be a hybrid of TM and termbase functionality support (see Figure 3). In addition to existing CAT technology, the solution would require:

-CAT Connector/Application Program Interface (API) to MRD with the ability to access and edit source- and target-language entries.

-Substitution of MRD tags with the original content after pretranslation for better contextual reference (based on each user’s access rights).

-Markup of translated redacted elements in the target-language segments and populating target-language MRD entries.

-Generation of TMs with redacted original content replaced by MRD markup.

TMs produced from CAT environments utilizing this approach would address the community’s data security concerns and provide leverage to justify Big Data integration.

Further discussion of this concept may identify additional benefits that could be achieved by standardizing MRD entry structure, including substitution mark-up. The localization community will need to band together to create this technology and ride the Big Data wave.