Multilingual taxonomies

By Heather Hedden April 10, 2018

Translation and localization is needed for content, but may also be needed for access to the content, such as website or app menus and navigation. In some content-rich sites and services, navigation is extended to become a detailed set of categories. A structured set of terms or labels for categories is also called a taxonomy. Taxonomies are becoming more common, whether on websites, intranets, portals, SharePoint, content management systems or digital asset management systems. The internationalization of content and shared information makes multilingual taxonomies necessary.

While structured navigation has aspects of a taxonomy, a taxonomy is more than that. Terms from a taxonomy are used to tag or index content items in various combinations, to support retrieval from a content set much larger than what can be served by navigation alone. While navigation menu labels each link to a single landing page, taxonomy terms are tagged to multiple relevant pages or content items, so that each term retrieves a result set, rather than a single page. Each content item (HTML page, database record, digital asset) may be tagged with multiple taxonomy terms, as relevant. Thus, if a user selects a second taxonomy term, the result set can be narrowed to just those items tagged with both taxonomy terms. A taxonomy can also be structured into different sets of terms for different aspects of the content (such as content type, location, audience, source or department) which can serve as refinement filters for users. This is also known as faceted search and is commonly seen on large ecommerce websites.

Terms from a taxonomy also enable users to retrieve relevant content from a large content set with greater comprehensiveness and precision than mere keyword searching. A search feature can match text strings, not concepts. Desired content may be missed in search, if the user enters a search word that is not mentioned in the text, but a synonym for it is. Undesired content may be incorrectly retrieved, if the user enters a search word that has multiple meanings, and the content is about a different meaning of the word. A taxonomy is not merely a structured set of terms, but is also a controlled set of terms, where each term represents an unambiguous concept. Concepts often have synonyms associated with them in the taxonomy, so that a match on any synonym of a term will retrieve the same result. A multilingual taxonomy allows users to select a search term in their language and then access tagged content in a different language.

A taxonomy thus brings together the benefits of both browse navigation, with its organized display of terms, and a controlled vocabulary, with its synonyms. The taxonomy is structured, typically with a hierarchy of broader terms and narrower terms, which could extend several levels deep. In some cases, there may also be associative relationships to indicate related terms (see also…). This way, the user can discover topics of interest. A structure with related terms, synonyms and less of a hierarchy is often called a thesaurus instead of a taxonomy.

Multilingual taxonomy uses

Multilingual taxonomies have been developed and implemented for various purposes, including the following:

• By international organizations and agencies with a large volume of documents and content in various languages that need to be accessed by users of different languages.

• For ecommerce websites with audiences who speak different languages.

• On intranets of large multinational organizations or companies, which may have internal documents or employee profile information in different languages.

• For the indexing and retrieval of documents in library and research databases, which include periodical articles in different languages.

An example of an international organization multilingual taxonomy can be found on the publications section of the Inter-American Development Bank website (https://publications.iadb.org), which funds development projects across Latin America. Users can look for publications by browsing taxonomy topics in English, Spanish or Portuguese. The Spanish or Portuguese taxonomies become the option when the language of the web page is switched (from the collapsed menu).

Ecommerce is perhaps the fastest growing area of multilingual taxonomies. An example is Office Depot Europe, which has a taxonomy of product categories in nine language versions across its 13 Office Depot and Viking brand regional ecommerce websites, which were formerly completely separate, unrelated taxonomies. The product offerings and thus their categories are mostly the same in all countries and so are the attributes, such as features, color, size and brand, so these have been standardized and made equivalent across languages. Languages vary not only between country sites but also several country sites are bilingual, including that of Germany, www.viking.de (which has both German and English), Belgium, www.vikingdirect.be (both French and Dutch) and Switzerland, www.vikingdirekt.ch (both German and French).

Libraries have been providing subject indexes to periodical articles since before the days of computers, although now they are online. Users can look up subjects and names in the taxonomy (also called thesaurus), which have been indexed to articles. An example of a library periodical article database in a different language involved developing an electronic collection of journal articles, user interface and search topics (thesaurus) all in Spanish for use by Spanish speakers at libraries in the United States. The user interface provides the option of searching across all subscribed databases, including the Spanish-language database. The Spanish-language thesaurus, originally based on translations of terms from the English taxonomy, has been developed and maintained to reflect the content of its Spanish-language products. It also keeps a close relationship to the English-language taxonomy and through English to a smaller Portuguese-language taxonomy through equivalency links to the English-language terms. This close relationship helps to ensure that hierarchies and new term additions for all three languages are updated quickly and efficiently.

Multilingual taxonomy design

A distinction needs to be made between taxonomies with translations and truly bilingual/multilingual taxonomies. A taxonomy that merely has translations can only be browsed in one language, because the relationships and hierarchy only exist between terms of the primary language. The corresponding translated terms enable the retrieval of content indexed to those terms to be in different languages. Thus, the user interface side is in one language, and the indexing side may be in multiple languages. The term translations function in the direction from the indexing language into the user interface language.

A truly multilingual taxonomy is where each term has an equivalent term in one or more additional language, and there are also structural relationships between the terms within each language. To the users of each language, it appears as a fully navigable taxonomy in their own language. Despite the different language versions of the taxonomy, the same shared set of content can be retrieved from any language version of the taxonomy. Term translations must be exact equivalents that go in both directions. The ability to support relationships between terms within additional language versions is a common feature of commercial taxonomy management software.

A multilingual taxonomy can be modeled in two different ways:

1. Separate language taxonomies linked together through their equivalent term translations

2. A single taxonomy based on concepts that each have displayed labels in different languages

The structure of separate taxonomy versions in different languages that are linked to each other by links between equivalent terms is a common form for a multilingual taxonomy that starts out as a taxonomy in one language and then has another language version added to it. There are two approaches to creating such a bilingual or multilingual taxonomy: (1) start with one taxonomy and add translations and links to the translations for each term in another language, or (2) start with two taxonomies in different languages on the same subject area, scope and level of detail, and then map the equivalent translation terms to each other. For terms that lack a translation, a new translation equivalent term may be added. The choice of approach depends on what taxonomies are available to start with. With either approach, it is possible that some terms will exist in one language and not in the other, when equivalent translations do not exist. This could be acceptable if there is content in one language only for the term, and there is no need for a user of the other language to access it.

Depending on the taxonomy management software used, the relationships between terms may be identical (mirrored) across all languages, or they may be allowed to vary. If relationships between terms are allowed to vary, hierarchical relationships ought to be the same across languages, except in cases where the term exists in one language and not in the others. Associative (related-term) relationships, if any, are usually the same but could occasionally differ for nuanced linguistic or cultural reasons of what is considered “related.” Figure 1 is an example of terms in the multilingual ILO Thesaurus, where relationships are identical in both language versions. BT here means broader term, NT means narrower term and RT means related term. However, the English has a variant (UF-used for), whereas the Spanish does not. The language indicators of FRE, SPA and EN also function as relationship links. Terms of the same relationship type are listed alphabetically.

The other model for taxonomies is based on a unified set of concepts, rather than terms. Each concept has its own unique, unambiguous meaning, and relationships are between concepts. A concept also has any number of labels, which are terms the user may see. In this model, the labels may be in different languages. A concept has a single preferred label in each language and any number of alternative labels for each language. If translation is incomplete, it is still possible to have a concept in one language and not in another language.

In this unified concept model of a taxonomy, hierarchical and associative relationships are always the same between terms within all languages. This model has become more common with the growing adoption of the SKOS (simple knowledge organization system) framework of the World Wide Web Consortium (W3C), a common data model for sharing and linking knowledge organization systems (such as taxonomies or thesauri) via the Semantic Web. SKOS is based on the concept model and supports multilingual taxonomies.

Figure 2 is an example of a concept in each of the English and French views from the UNSECO Thesaurus, built on the SKOS model. Alternative labels (synonyms) are listed, in italics, not just for selected display language but for all of the languages, presented as the various labels for this one concept.

In both kinds of multilingual taxonomy models (separate but linked vs. unified concepts), the variously called synonyms, alternative labels, nonpreferred terms, entry terms and so on will vary for each term in each language. A word in one language has a different set of synonyms than its translation in another language. In a taxonomy, these are not really synonyms, but rather terms with sufficiently equivalent meaning within the scope of the taxonomy and context of the subject area to be treated and managed as the same concept for content indexing and retrieval. In Figure 2, there are more alternative labels for the Spanish concept of academic buildings.

Creating a multilingual taxonomy

Creating a multilingual taxonomy takes considerable time and effort, so, to be practical, the approach depends on what taxonomies or partial taxonomies already exist. Sometimes taxonomies already exist on the same subject matter in more than one language, and the taxonomies are mapped term-by-term to establish equivalencies. Where equivalencies don’t exist, new terms can be added where missing, or perhaps, upon evaluation, the term is not actually needed and removed. Relationships between terms also need to be aligned.

Often, a taxonomy exists in only one language and is then translated into other languages. As previously mentioned, for a bidirectional multilingual taxonomy, the translations must be fully accurate in both directions, not just in one direction. It’s also more challenging to translate terms (single words or short phrases) without the context of a sentence, so translation is more time-consuming and costs more per word than regular document translation. Translations must be manual, not automated, and the translators should have an understanding of the role and function of taxonomies. Providing the translator with the full details for each concept in the source language (alternative labels, relationships and any scope notes) is also necessary for the translator to understand the intended meaning of each term. Terms that lack exact equivalents can present time-consuming challenges.

While it is common for organizations to build a simple taxonomy in columns in Excel and then import it into their content management system, the complexities of a multilingual thesaurus require a dedicated software tool. There are various options for taxonomy management software which support multilingual taxonomies.

Software that models multilingual taxonomies as separate mirrored taxonomies with equivalency links to terms across languages include the following:

• MultiTes (www.multites.com), whose product name reflects the capability for multilingual thesauri/taxonomies is the most affordable, at $295 for a single Windows user. MultiTes is focused on thesauri, rather than taxonomies, so there is no hierarchical view in the user interface, only as generated reports. Web development kits and enterprise development kits of MultiTes are also available.

• Synaptica (www.synaptica.com) is web-based (software as a service or inside the firewall), which supports the mirrored, equivalent-link multilingual taxonomies. Synaptica also has a feature to “map” two different taxonomies which can be used in the process of linking two pre-existing language taxonomies.

• Data Harmony Thesaurus Manager, from Access Innovations (www.accessinn.com) is available in both web and client-server versions. An add-on product supports machine-aided indexing (MAI).

Software that models taxonomies on SKOS with a single concept and multiple language labels include the following, which are all large-scape web-based, either software as a service or inside the firewall:

• PoolParty (www.poolparty.biz) from the Semantic Web Company, a quickly growing company in Austria.

• Coreon (www.coreon.com) is a software product from Germany that may be of particular interest to the foreign-language community, because it offers both taxonomy management and terminology management features.

• Smartlogic’s Ontology Editor (www.smartlogic.com) is also for building taxonomies. An ontology is a more complex kind of taxonomy with customized relationships and classes of terms, which also conforms to the W3C’s Web Ontology Language (OWL) specifications for interoperability.

• TopBraid Enterprise Vocabulary Net (www.topquadrant.com) from TopQuadrant.

Most of these software products offer application programming interfaces or add-on connectors to implement the taxonomy in the content management system where terms are tagged to content items. In all cases, these software tools can export taxonomies as XML for importing into various kinds of content management systems.

Maintenance of multilingual taxonomies

A common problem with any taxonomy and particularly with multilingual taxonomies is long-term maintenance. Taxonomies and their various language versions may be created, as projects, by contractors, consultants or freelancers, who are not needed and thus not available for the low-level ongoing maintenance of a taxonomy. A taxonomy, unlike a document, is not simply created and translated and then is done. A taxonomy needs to represent the body of content it is indexed to and serve the needs of its users. Over time, the body of content will change, as new content items get added and others removed, and even the users may change, if additional markets are added. The unified concept (SKOS) model for a multilingual taxonomy is easier to maintain across languages than separate linked taxonomies. Changes or deletions in concepts or their relationships extend to the other language versions. If a new concept is added, however, it must be remembered to create additional language labels.

Once a taxonomy is created, it can be used to index/tag content items either manually, in an automated manner, or with a combination of automated indexing with human review (machine-aided indexing). Indexers can be speakers of any single language of a multilingual taxonomy, but they should know the language of the documents they are indexing. The indexing task can contribute to taxonomy maintenance by revealing the need for adding new terms, merging similar terms, better linking of missing terms and better defining of misused terms.

In brief

Creating a multilingual taxonomy is not easy, but the benefits of its implementation can be significant. Challenges include the need for fully bidirectional translations and translations of terms out of context, finding translators who understand the role and usage of taxonomies, and having a linguistic resource available for the low-level maintenance of the taxonomy. These challenges are more intellectual than technical, and there are translators and taxonomists who welcome such challenges. Translating a document extends its access; translating a taxonomy goes much further, by extending the access to many documents and other digital content.