Terminology as a knowledge asset

At the tekom conference in Wiesbaden last November, Jochen Hummel delivered a presentation about a new system being developed by Coreon GmbH for managing terminologies and other linguistic assets. Branded with the tagline “knowledge meets language,” Coreon will “bridge the gap between the language and the knowledge world.” The way to achieve this is to manage terminology databases (termbases) and taxonomies in one system, to infuse existing “flat” termbases with hierarchical semantic relations. Coreon calls the result a “knowledge database” or knowledge base.

I applaud any initiative that recognizes terminology data as a knowledge asset. Provided that it is properly structured, terminology data is indeed a knowledge asset in a most discreet and repurposable form. Bridging so-called flat terminology and structured concept models is a welcome approach in the multilingual communication field. Yet for far too long, terminology management systems designed for computer-assisted translation, which have dominated the landscape of terminology tools, have not bought into this concept. The functionality necessary to produce a hierarchically structured knowledge base is missing from these systems, even though for years some users, myself included, have been advocating for extending translation-oriented termbases into knowledge-rich repositories.

While the tekom presentation generated some enthusiasm, the concept of structuring terminology data in the form of a knowledge base is not new. Over two decades ago, Ingrid Meyer, Juan Sager, Eugenia Knops and Gregor Thurmair predicted that applications beyond translation would benefit from richly-structured terminological resources. With this prospect on the horizon, models, methods and formalisms for structuring terminology in the form of a knowledge base have been the focus of much research and development. In 1994, Ingrid Meyer and Douglas Skuce coined the term terminological knowledge base, and developed the management tool CODE. Indeed, relations between terminology and knowledge engineering have been recognized as far back as 1991, if not earlier.

For a database that structures linguistic representations of knowledge, the term terminological knowledge base seems more suitable to me than simply knowledge base, which can have a broader interpretation. A knowledge base is a technology used to store complex structured and unstructured information used by a computer system. As such, a knowledge base can store various kinds of data including nonlinguistic representations such as formulae and numbers. In a corporate environment, a knowledge base could, for example, include various company assets such as manuals, procedures, policies, best practices, reusable designs and code.

What is the difference between a run-of-the-mill termbase and a terminological knowledge base? In discussing the notion of a knowledge base, Coreon refers to “knowledge that cannot be put into rows and columns.” This type of knowledge contrasts with “flat” knowledge, which is usually represented in tabular format. It is a well-known fact that spreadsheets — a tabular format — are still today a popular medium for storing, accessing and exchanging terminologies for translation purposes. Even if you use a proper terminology management system, chances are that Excel or CSV is one of the main import and export formats. This fact alone reveals that the data in the system, no matter how detailed and structured it might appear, is still essentially “flat.” The types of terminological data that are not easily accommodated in tabular formats include links, or relations, between data structures in the system, such as to relate two or more concept entries or terms. Furthermore, these relations can be broken down into various types: hierarchical or not, generic (is-a), meronymic (or partitive), associative, cause/effect, agent/patient and so forth. Other hierarchical structures such as multi-level subject field taxonomies and subsetting categories, as well as conditional dependency relationships, are equally difficult if not impossible to represent as flat, linear information.

We should clarify the meaning of “tabular” format intended here. What we are referring to are spreadsheets and their various representations (tab-delimited, comma-separated and so on), where all the information relating to a specific terminological record or entry is contained in one row. This should not be confused with the tables in a relational database. Provided that it is appropriately modelled with a sufficient number of interrelated tables, a relational database is a powerful architecture for developing a terminological knowledge base. Indeed, IBM’s termbase is built on DB2, a database management system, and it contains many features of a full-fledged knowledge base. On the other hand, a terminology management system may claim to be robust because it uses a relational database, and yet there is only one table in the model. Such systems — and they do exist — are no better than spreadsheets.

Understandably, terminology management systems that are developed for use in computer-assisted translation focus on delivering functions needed by translators and tend to neglect more sophisticated features. Translators typically do not need information about how one concept relates to another or how concepts are organized into a semantic network. They simply want to know the translation of a specific term. Likewise, terminology management systems that are developed for controlled authoring applications focus on the needs of content producers, such as ranking synonyms according to preference and providing usage notes. Such systems are adequate for their designed intent. But they lack features for developing a terminological knowledge base or even, for that matter, any terminology resource that is multipurpose. In spite of this fact, I have seen cases where application-specific terminology tools are marketed as a solution for enterprise-level terminology management, and this is where the trouble starts. Enterprise-level terminology management is knowledge management, or at least it should be. It needs to take a broader approach to produce multipurpose linguistic resources. That is why I have always advocated against using application-specific terminology tools for developing and managing terminology as a knowledge resource at an enterprise or organizational scale; they limit the return-on-investment of termbases by constraining their repurposing potential.

Some 15 years ago, one language technology company, Interverbum, noticed a void in the terminology tools market: not a single tool was available that was designed specifically for developing an enterprise-scale terminological knowledge base that could equally serve multiple purposes. The company proceeded to develop TermWeb, the first platform-independent terminology management system with an open-architecture allowing integration with any existing enterprise application. But what was also novel was its support of bidirectional hierarchical concept relations of various kinds (generic, meronymic, associative); multilevel subject-field taxonomies (Figure 1); granular subsetting mechanisms, a comprehensive array of unique identifiers that allows efficient management of synonyms and homonyms; and visual rendering of concept systems — sometimes called a concept map (Figure 2).

These functions form the core of a knowledge base of the kind Coreon is now talking about. Interestingly, TermWeb supports Excel as an import/export format, but here, the hierarchical knowledge relations are filtered out; they are “flat” and therefore cannot be represented in tabular formats. To preserve the hierarchical knowledge relations, TermWeb supports another more robust format for importing and exporting, TermBase eXchange. According to its marketing material, Coreon is also following suit.

Coreon’s value proposition is based on the assumption that existing multilingual termbases are simplistic and flat. While that is certainly true for most of them, there are some notable exceptions. One must be clear that multilingual terminological knowledge bases do already exist. The early ones were developed in research settings, such as Ecolexicon (University of Granada) and DiCoInfo (University of Montreal). HowNet (www.keenage.com) is a Chinese/English terminological knowledge base developed over the past ten years, and comprises in excess of 70,000 concepts. AGROVOC (aims.fao.org/standards/agrovoc) is the multilingual thesaurus of the United Nations Food and Agriculture Organization, containing over 32,000 hierarchically-related concepts. EuroVoc (http://eurovoc.europa.eu/drupal/) is a multilingual ontology-based thesaurus covering the activities of EU institutions, with nearly 7,000 interrelated concept entries. SNOMED-CT (www.ihtsdo.org/) is a comprehensive systematically-organized collection of medical terms. It comprises over 300,000 concepts, interconnected with over 1.3 million links of various kinds, and is available in five languages with several other language versions in progress. In the private sector, both Microsoft and IBM developed termbases that include hierarchical concept relations, to name just a few. IBM has long been repurposing its termbase in various applications beyond translation, thanks to a solid design that enabled it to do so. Other global companies are no doubt contemplating the same.

Why would an organization develop a terminological knowledge base? What purposes do they serve? What are the benefits? In a nutshell, they facilitate global communication, content management and access to information by increasing semantic interoperability. For example, EuroVoc (Figure 3) enables more accurate documentary searches by standardizing the vocabularies used to index documents. IBM experimented with using synsets from its termbase to extend search queries from beyond word forms to actual concepts. A hierarchical series of concepts can act as traversable nodes in a faceted search. SNOMED is a vital component for safe and effective communication and reuse of meaningful health information in over 50 countries. A terminological knowledge base is also a knowledge acquisition tool — a concise electronic encyclopedia — which new employees, vendors and business partners can use to quickly get up to speed, and seasoned employees can use to increase their productivity. This gives a company more flexibility in making strategic business decisions that can deliver significant economic and competitive benefits.

I would also like to stress that a terminological knowledge base need not necessarily contain only “terms.” On the contrary, it can and should include linguistic units of various kinds in addition to industry-specific terms: proper nouns such as names of products and their properties, marketing and branding concepts, and even some expressions from so-called general language. A terminological knowledge base is also the ideal repository for a product classification scheme and other enterprise taxonomies. At a recent webinar hosted by SDL, I suggested that the word terminology is too narrow for what is actually managed in commercial termbases. Driven by business needs, these types of termbases can and often do include any piece of text that is shorter than a sentence and is likely to be repeated in company materials. Some termbases even include sentences, but I maintain that, aside from a few specific exceptions, this is not a good idea. Upon further reflection, perhaps linguistic knowledge base or conceptual knowledge base are more suitable names for a technology that manages language-based units of knowledge. As the boundaries of what we have conveniently been calling terminology and terminology management shift to address new opportunities brought about by technical innovation, we may also need to rebrand those notions themselves.