Preserving linguistic diversity in the digital world

The revolution devours its children, as the saying goes. In the digital world, one of these threatened children is the diversity of the languages spoken and written around the world. With the ever-advancing process of globalization, in recent decades a general culture has been promoted around the world in which the English language predominates.

For many less widespread languages it has been difficult to exist and survive in light of this. But when is a language really in danger? According to UNESCO, a language is endangered if its speakers never use it or use it in ever decreasing areas of communication and it can no longer be passed from generation to generation.

On the one hand, a standardized common language promotes some important principles of the digital economy, such as economies of scale, reductions in transaction costs and increases in efficiency. On the other hand, digital technologies can contribute to the preservation of linguistic diversity and therefore enable economic benefits such as the ability to address each customer in an individualized, more intensive manner. The direction in which the pendulum will swing not only depends on the commitment of political players or the affected language community, but also the question of how quickly progress will be made in the area of translation automation technology. Robust translation solutions are needed for endangered languages, which can ensure their survival.

Global linguistic diversity

Between 6,000 and 7,000 languages are spoken around the world. 97% of the global population speak around 4% of the languages and, conversely, around 96% of the world’s languages are spoken by 3% of the global population. Just 3% of all languages are endemic to Europe. According to theAtlas of the World’s Languages (UNESCO), there are 128 languages in the European Union (EU) that are classified as endangered. By contrast, there are a total of 24 officially recognized languages that are working languages in the EU. In addition, the EU has over 60 indigenous regional and minority languages, five of which are recognized as semi-official. These include Catalan, Galician, Basque, Scottish Gaelic and Welsh.

The other endangered languages do not have any official status in the EU. Take Vilamovian, a severely endangered language in southern Poland with 70 speakers, or Transylvanian Saxon, an endangered language of Romania with an estimated 50,000 speakers, according to UNESCO.

Despite its rather limited influence on the educational and language policy of each of the member states, the EU is actively involved in preserving multilingualism and the promotion of language skills, although endangered languages face many challenges in the EU. META-NET (2012) states that the relatively highly developed minority languages, such as Basque and Catalan, are part of the high-risk group in terms of their future viability. With the mutual help of the language communities, there are opportunities to preserve these minority languages, for example if they are used in social networks.

There is broad consensus in research that the preservation of linguistic diversity is necessary in order to maintain the flexibility and adaptability of a species. The spectrum of cross-fertilization is decreasing to the extent that languages and cultures are dying out and, as a result, evidence of human intellectual accomplishments is being lost. Languages are an expression of identity, which are characterized by the shared traits of the members of a group. Community and cultural identity contribute to security and the status of a collective existence. What’s more, languages have a role that the internet has come to rely on today: being a repository for history and knowledge. From a sociocultural perspective, a language contains a way of thinking and being.

However, preserving linguistic diversity is also worthwhile from an economic perspective. The ability to communicate in several languages is of great benefit to all employees in organizations and companies. It promotes creativity, breaks down cultural clichés, encourages unconventional thinking and can contribute to the development of innovative products and services. People who can master two languages equally well have proven advantages in terms of mental performance, which are particularly expressed in divergent thinking, creativity and communicative sensitivity.

Overcoming language barriers

It is difficult to predict how the information society will develop in the future. But one thing is certain: the global economy is confronting us with an increasing variety of languages and their speakers. We can react to this by standardizing languages, as has been done in recent years, or we can address this challenge with technological solutions.

Every day gigabytes of text are sent around the world that are not composed in our native language. The European Commission has established that 57% of internet users in Europe purchase products and services digitally that were not communicated in their native language. English was the most commonly used foreign language, followed by French, German and Spanish. A few years ago it was still possible to talk about English as the lingua franca of the internet, but the situation has changed dramatically. The volume of content in various other European, Asian or Arabic languages is growing steadily. Which language stays and which must go is being renegotiated — on a political, media, cultural and technological level.

While the traditional media in Europe have contributed to the fact that certain languages are no longer published, the internet and new technologies could contribute to their preservation. With an estimated 80 languages, rich linguistic diversity is traditionally one of Europe’s most important cultural treasures. And in order to ensure this treasure is preserved, new technological stimuli are needed in the area of language technology. The demand for solutions of this kind is huge. Ideally, they will work as silently and independently as software solutions in areas such as energy, logistics and trade. Whether written or spoken, digital language technology could be the key for a global/European society in which people work together, do business, share knowledge and have discussions without barriers. Language technology developments are already working virtually unnoticed in complex software systems to remove barriers, from Google Translate to Word’s spellcheck to online translation services. And the market for language services is growing steadily. The sales volume of the global language services market is estimated to be $37 billion. This year, the volume is set to increase to $47 billion. However, in all the enthusiasm for automated translation software, it is often forgotten that translators and interpreters do not translate from one language into another in a linear way. Cultural and emotional aspects such as humor and irony are as yet difficult to convey and automate by machine.

Content in more languages, on more channels, in more formats — that is the present situation for language service providers (LSPs). The future promises continually rapid change in the sector. As well as the ability to create good translations, LSPs must now not only have excellent technical competence, adaptability and high integration speeds, but also be able to take into account and manage the complete value-added chain. Those that do not invest greatly in technical innovations, whether developed in-house or purchased, will have no chance in the LSP market in the medium term. However, those that stand out in terms of content and technology will not only stay in the market in the long term but also open up new markets for themselves. Localization is the magic word here, and the decisive factors for this are digital technologies for the automation of translations.

Translation tools as language memory

If you were browsing for the first time, looking for solutions for the automation of translations, you’d quickly come across translation memory (TM) systems, which have been in use for many years. The principle behind them is that translators save their translations in TM systems, a kind of digital language memory, which is constantly fed with new translations. If a similar text is translated in future, the system compares the segments in the new source text with translations already carried out and shows potential matches. Building up a TM has the advantage that the translation process becomes quicker and more efficient with each new translation. The longer you work with such a system, the larger the database of previously translated text segments, which many translators calculate at a reduced price per word. Not only do these systems make translations in all languages even more efficient, but they can also make a significant contribution to the preservation of endangered languages. For example, once a TM system for a specific geopolitically closely linked language pair, such as English and Welsh, or Russian and Belarusian, has grown to a certain size, part of the vocabulary is conserved and its use is economically interesting. The more specific the cultural context and literary nuances are, the more important the human factor still remains. However, traditional TM systems are just the precursor for what is generally known as neural machine translation (NMT) with the aid of artificial intelligence (AI).

Digital language technologies can now optimize all translation processes, content production and knowledge management for European languages. These new technologies also include machine translation (MT) based on AI. So far, these systems have mainly operated using statistical methods. One example of this is the Google Translate feature. This statistical machine translation system analyzes as many bilingual items of text as possible. It “remembers” the correlation of frequently used and closely associated words and grammatical forms in source and target texts. On the basis of this statistical knowledge, the translation is then created.

In MT, a distinction is generally made between rule-based and statistical systems. The former process languages according to linguistic rules. The words, syntax and grammar of a source text are analyzed, classified and shown in a tree diagram. Following the analysis, these elements are transferred into a target language structure and into a target sentence. The process of precisely breaking down the natural language of a human in a digital way is full of variables. Statistical methods can calculate probabilities for translations, whereas with the rule-based method you also need to know the foreign language. In both cases, the system must be trained and its “knowledge” constantly developed.

NMT: opportunities and pitfalls

NMT is a more advanced method of translation. The combination of big data, deep learning and neural technologies, with the aid of large amounts of data, allows the systems to be trained to understand diverse relationships between source and target languages and not just direct proximity relationships. The structure is based on the workings of the human brain, which is capable of learning new connections based on diverse associations. During the translation, the NMT system determines the optimum route to a solution from various possible options. But you must not forget that no system is perfect, just like our brains. If you have masses of data to feed it with, it is hard to beat this type of system. However, if the texts to be translated are very specialized, such as in the financial sector or law, or the syntax is too long or complicated, even these methods quickly reach their limits.

So if the linguistic richness of Europe is to be preserved with the aid of artificial intelligence, technological progress must gain momentum, otherwise it might be too late for some languages. According to a study by META-NET, there is a lot of catching up to do in the field of machine translation from a European perspective. And not only in terms of “minor” languages. The reasons for this lie in the complexity of such applications, as already described.

Language technology, endangered languages and CEE

The fact that it can be worthwhile to use new language technologies for supposedly minor languages is demonstrated by the Bulgarian language. It is spoken by an estimated nine million native speakers, mainly in Bulgaria but also in Greece, Macedonia, Romania, Serbia, Turkey, Ukraine, Australia, Canada, the USA, Germany and Spain, as well as in Croatia, the Czech Republic, Hungary, Israel, Moldova, Russia and Slovenia. The same could be said for Slovenian, Serbian and Croatian. If a company is facing the question of how it can conquer the Eastern European market, there is no getting past these key languages. With the growing precision of language technologies, it is getting ever easier to tap into these countries linguistically.

At the moment, existing technologies still demonstrate their strengths for standardized text formats, but the more intelligent they become, the more context-based the automated translations are. The existing solutions already result in a long tail effect, which is well known in ecommerce. Every niche can be served by digitization, even niche languages that people previously did not want to translate or could not for cost reasons. This presents numerous opportunities for companies to open up new markets by addressing them correctly using language technology. Whether in the gaming industry, education, mobile info services or learning software, the possible applications are many and varied. Investment in linguistic diversity is indeed economically sensible, but this goal must also be supported politically and socially. To maintain and promote European linguistic diversity in the digital age, the European Commission has established the European Language Resource Coordination (ELRC). This is a network of representatives of all European languages, with the aim of managing, maintaining and coordinating relevant resources for all officially recognized languages in the EU. The main focus is on the research, further development and qualitative improvement of modern language technologies. In its mission paper to EU institutions, the Language Technology Industry Association (LTI) goes one step further and calls for the use and availability of intelligent language technologies for all EU citizens to help Europe’s multilingualism have a stronger impact with the aid of artificial intelligence. Ultimately, all of those involved must realize that although they speak different languages on this topic, it is with one voice. Language technology may be a decisive factor in this.

Language is a cultural asset that most of all creates identity. It is worthwhile for all stakeholders to also invest in minor and exotic languages. The European Commission has already recognized this, but languages continue to die out in CEE and around the world. Wherever linguistic worlds are endangered, there needs to be social awareness and the will to preserve these languages. Civil society and politics have a joint obligation to invest in new technologies now in order to preserve spoken diversity and carry it into the future.