Knowledge-aware terminology databases can help translators and improve AI
Francois Massion is managing director of Dokumentation ohne Grenzen (D.O.G) GmbH. He is a visiting professor at Beijing Foreign Studies University and has a teaching assignment in the field of AI and NLU at Shanghai Foreign Studies University. He is a member of the advisory board of the Deutsches Institut für Terminologie.
lthough the term artificial intelligence (AI) was coined more than 60 years ago at a Dartmouth conference in 1956, it has only recently reached the radar screen of translators, interpreters or language workers. Since then, most language professionals have been struggling to assess which effects AI will have on their professional lives.
It is too early to measure the full extent of this transformation, but we know already that the impact of AI will be twofold. On the one hand, AI will revolutionize the work of language professionals by taking over some of their tasks and assisting them in many others. On the other hand, it will create new demands and service opportunities. Translators and other language specialists will be able to augment AI systems with their unique knowledge and skills. This will be particularly the case in the field of intelligent terminologies, aka ontoterminologies and knowledge-rich terminologies, which have become part of the tool landscape. Intelligent terminologies belong to the family of augmented translation technologies that are inspired or driven by artificial intelligence. They model knowledge by creating conceptual networks using relationships. These intelligent terminologies are of great benefit when it comes to discovering knowledge hidden in documents and translations.
Terminology is an integral part of the work of translators or interpreters. To a great extent the challenge of translation amounts to understanding the meaning of special terms and finding their equivalents in the target language. Terms can be ambiguous, erroneous or have no match in the target language. The reasons can be multiple: a concept does not exist in the target language (take for example the Japanese “capsule hotels,” unknown in many other cultures) or one language structures the reality differently, giving the translator several translation alternatives depending on the context. This is the case with the French verb télécharger, which can be translated either as upload or download. As a result, translators or interpreters spend much of their time researching specific terms and their equivalents. Many organizations or companies are therefore building multilingual terminologies to improve communication and support the translation process.
Today, terminology repositories managed by language service providers or language departments are concept-based, meaning they start from an abstract concept and collect for each language respectively all terms (words, abbreviations or phrases) that describe this concept. The theoretical foundation of this approach is the semiotic triangle of reference, as first published by Ogden and Richards in their book The Meaning of Meaning (1923), a book that still deserves being read today. This general approach is shared by most terminologists and is well documented in multiple standards such as ISO 704:2009.
Over the years, various schools of thought have criticized the shortcomings of the semiotic triangle from different perspectives, pointing to aspects of cognition, intention and communication. However, they failed to deliver a pragmatic terminology model which could be used by practitioners in their daily work. The semantic triangle doesn’t indeed offer the flexibility to describe “soft” or variable features of a concept — such as the communication situation, the context, the objectives, the experience, the culture and so on. The semiotic triangle itself is static and represents a frozen definition of a concept. This is the reason for which terminology entries do not always reflect the specific situation in which a term is used. It is not uncommon for translators to reject a translation suggested by their terminology tool because it does not fit the context or for quality assurance technologies to report terminology errors in a translation even if the term used by the translator is actually correct (the so-called “false positives”).
Most terminology databases used today do not provide a mechanism to respond to different usage situations. The generally accepted understanding of concept-building among terminologists is that concepts have one definition, and this definition summarizes the key features of the concept. However, realistically, there is more than one way to define a concept. First of all, reality is not perceived the same way by everyone: culture, language and individual experience play an important role here. The purpose of terminology work is also important. While in many cases, terminology entries decontextualize concepts and formulate definitions that are valid for as many users as possible, others pursue specific legitimate goals when they select and define terms. As an example, you can look at a bike as a means of transportation, as a tool to improve your fitness or as a product that you sell to customers. Depending on your intention, the elements of your definition and the equivalents in other languages will vary.
Another factor that influences the use of a concept and its definition is its degree of granularity, or in other words, its degree of precision. In some languages, a concept is more finely structured than in others. An example of this is the classification of cars in the US and in the EU (compare the US “subcompact car” and the European “B-segment small cars”).
Thus, we have to consider a concept as a generalization for a wide range of possible uses for an object or an idea. This is a good starting point, but in the end what matters is how terminology can help us to understand terms in real situations. To do so, we need additional information beyond the definition.
What are the possibilities? It is of course possible to explain the representative situations in which the term can be used in a comment field, but this kind of information is only available to humans and cannot be effectively processed by software applications. The first option is to work with attribute fields in which standard values represent typical usage contexts of a term. These can be the type of documentation (such as marketing material, user interface, legal documents), the department of a company (sales, production, development) or the intended target audience (such as government, end user, medical specialist), to mention a few. These attributes work well, but also have their limitations.
The good news is that there are more options. The central element of intelligent terminologies is the relations they use to connect concepts. The idea is that people do not understand terms in an isolated way, but always together with other terms. The meaning of a concept depends to a large extent on the context in which it occurs. This is a phenomenon with which machine translation algorithms are struggling because they use statistics and favor the rule of the largest number. They select the most frequent meaning or translation that yields the highest value in an algorithm, while a less frequent word or meaning may be the best option in a specific context.
Relation graph for entry 1 (“income”), differentiating it from entry 2, the same term in English.
Relation graph for entry 1 (“Einkommen”).
Relation graph for entry 2 (“income”).
Relation graph for entry 2 (“Ertrag”).
For example, try to translate the word container using this definition: “an object for holding or transporting something.” Unless you see it or have a detailed description of the situation, you have no chance of knowing exactly what a container is. It could be a box used to transport a few books, a larger box used to ship goods overseas, or could also be a recipient for a liquid. Depending on this, the translation will be very different. But if you see it associated with other terms (as in: “Place the container of dough on the table and pour a cup of water into the glass.”) you will probably find the right translation.
This phenomenon has been studied for almost a century by different scientists, whether they come from the cognitive sciences, linguistics, computer sciences or neurosciences. “You shall know a word by the company it keeps,” as English linguist J. R. Firth noted. Recently, neurosciences publications explain the building of semantic networks in the brain, shaped by the cognitive experience of individuals. There is even a “semantic atlas of the brain” which is the result of research work published in 2016 by a team of UC Berkeley researchers and available online. In the field of cognitive linguistics, Charles J. Fillmore developed his theory of frame semantics in the 1970s by modeling frames as a recurring use of related terms.
Intelligent terminologies are organized as collections of concepts that are interconnected through relations. Different types of relations are used, such as hierarchical relations, part-whole relations and associative relations. The type of relations depends on the subject matter. A medical scientist will need other relations than an automotive engineer. In general, knowledge-based terminology systems will display a concept map with one concept in the core and related concepts around it (see Relation graphs). Translators or interpreters can use this information to visualize the context of the concept they are trying to understand and thus get valuable additional input.
As far as multilingual terminologies are concerned, relations between terms and concepts can be very helpful to identify the right translation for terms that have more than one equivalent in the target language. For example, the English term fish has two translations in Spanish: pescado and pez, depending on whether the fish is a dish on your plate, or is alive. Each usage situation of the Spanish translation can be linked to a different concept frame like restaurant, plate and waiter in the one case and sea, water and swarm in the other case. This is not only very useful for translators, but also for quality assurance technologies that can automatically interpret these relations and report a wrong translation in a specific context.
It is a challenging task to build semantic relations between concepts as this requires time and in-depth domain expertise. One way is, of course, to have subject specialists use their personal knowledge to connect concepts one by one. This approach is the best in terms of quality, because the knowledge modeled in the terminology database has been hand-picked and validated on-the-fly by specialists. However, this can be very time-consuming and requires a high investment. Companies or organizations with a large number of concepts can combine different methods to achieve the same result in a more affordable time frame and budget.
On the other hand, they can use different tools and methods of natural language processing (NLP) and AI to identify terms that are used together in the same context. Machine learning algorithms analyze word embeddings that are vector representations of words in context. With this type of information, they discover semantic relations between concepts, such as words that influence each other, have a similar meaning or behave in the same way. Co-occurrence matrices are also used in NLP to identify related words.
Four successive steps are required to build intelligent technologies:
1. In the beginning there is a collection of terms extracted from reference documents.
2. These terms are then merged into concepts. For example, software and application are merged into the common concept of an object used to “(instruct) a computer to do specific tasks” (www.techopedia.com).
3. The concepts are enriched with additional information and metadata — a definition, an illustration, a status or usage attributes.
4. The concepts are linked together according to predefined relation categories. Hierarchical categories usually reflect some sort of classification or taxonomy.
Typical usage contexts can be modeled either with the help of attributes or of relations between the term and other concepts as is the case in situations where more than one translation is possible.
Intelligent terminologies can be used in multiple ways. Here are some examples:
- To discover and visualize knowledge hidden in documents or translations.
- To store knowledge.
- To check the correct use of terms and translations.
Intelligent terminologies are particularly useful in situations where information needs to be extracted, as is the case with a document to be translated. A document as such is only a collection of words. Before they start translating, the translators must analyze the text — identify the subject, spot the ambiguities, recognize the relations between words, understand the concepts transported by the text. This process can be time consuming, especially when dealing with large documents. However, this process can be accelerated with intelligent terminologies. Terminology entries can be regarded as building blocks of knowledge and can therefore automatically make knowledge visible in unstructured text using techniques such as annotation and highlighting. An annotation tool uses terms and metadata from the terminology database to highlight or mark up terms in the document. Relations between the concepts in the database and the term attributes make it possible to do the following:
- Visualize a context for connected relevant terms, such as income > equity > taxes (as opposed to the income > sales > goods, that may require a different translation in some languages).
- Highlight terms with a special usage attribute (such as prohibited translation).
- Highlight categories of terms based on their properties (such as: type of task, text classification, UI object or dialog, part of speech).
There are several tools available on the market for text annotation. Some of them can tap intelligent terminology databases directly. An annotated text is very valuable for translators, interpreters or researchers who try to quickly identify and retrieve the most important information and information categories in a document.
Annotation can also add markup to content for further processing by diverse applications. In this way, intelligent applications such as chatbots or smart assistants can “understand” annotated content, recognize the elements with relevant information and output the required results. This can already be seen in areas such as technical support for products or in marketing and sales, when connected products such as flight, hotel and rental car are offered to the user as a package.
Similarly, semantic markup from intelligent terminologies supports authors who wish to check the content they produce. For example, they can see which concepts are linked through hierarchical relations, whole-part relations or cause-effect relations and check whether they have forgotten to mention important information in a technical manual.
Translators or quality assurance technologies connected to intelligent terminology repositories can check context-dependent translation variants and identify mistranslations in a particular context.
Intelligent terminologies are still relatively new. Existing solutions differ in the variety of relations they model and the methods they use to implement them. In addition to the lack of a common denomination for this type of terminology repository, there is currently no standard format for the exchange of data that would ensure the interoperability of intelligent terminologies. The TermBase eXchange standard cannot represent relations, and the Research Description Framework-based vocabulary organized with the Simple Knowledge Organization System can only be used for a limited range of relations.
There is still some work to be done, but the exciting thing is that intelligent terminologies have emerged and that they are changing the paradigms of terminology work. Ontologies and terminology databases have long lived separate lives. Intelligent systems that try to model and understand natural language — natural language understanding — usually use statistical and probabilistic algorithms and ontologies that are designed to be processed primarily by software applications. On the other side, terminology databases are directed at humans. With the rise of intelligent terminologies, a new category of terminology products has arrived on the market that combines both approaches and offers challenging opportunities for translators and knowledge workers.
Big data and especially language-related big data suffers from the fact that existing algorithms and methods are not very good at processing the many minute facets of natural language. Ontologies can indeed be very efficient in this respect, but they require expert resources like knowledge engineers to build them and this can become complex and prohibitively expensive. Intelligent terminologies can fill a gap here, and create entirely new service opportunities for language specialists while helping them to perform their work as translators more efficiently.