One year after acquiring MT startup, Zoom launches translation feature
At the time of the acquisition, Zoom’s president of product and engineering said that MT would be the “key in enhancing our platform for Zoom…
→ Continue ReadingT
he Sámi languages are spoken by a few thousand indigenous Northern Europeans in the region of Sápmi, which spans four countries: large areas of Norway and Sweden, the northern areas of Finland, and Russia’s Kola Peninsula. In Norway, the country with the largest Sámi population, the people are represented by the Norwegian Sámi Parliament. In 2004, the parliament received funding from the Norwegian government for a project to ensure a digital future — and thus a future — for the Sámi languages. This kickstarted a 20-year effort to develop language technology (LT) tools for these extremely low-resource languages, with the express goal of making Sámi languages as easy to use in the digital world as Norwegian or English.
The Norwegian Sámi Parliament created the Divvun research and development group to begin tackling the project, and I was appointed project manager and head of the group. In 2011, the group was transferred to the University of Tromsø (UiT) – The Arctic University of Norway. Throughout the project, Divvun collaborated closely with another research group, called Giellatekno, also at UiT.
Advertisement
Initially, we targeted three of the eight Sámi languages: North Sámi (around 25,000 speakers), Lule Sámi (1,000-2,000 speakers), and South Sámi (about 600 speakers). Later on, we added more Sámi languages, as well as other indigenous languages spoken in the Nordic countries and in Canada.
After two decades, the team has developed most LT tools considered necessary in today’s digital environment: keyboards for all eight Sámi languages and for six platforms, spell and grammar checkers, hyphenators, electronic dictionaries, machine translation (MT), and speech synthesis. There is also an experimental version of speech recognition in the works.
In many ways, the project has been a resounding success and serves as a model for developing LT tools for low-resource languages around the world. However, the limitations of commonly used hardware and software negatively affect the tools’ accessibility and ease of use, bringing into question the project’s potential impact. This article discusses what it took to achieve the project’s successes, why it could be considered a last-mile failure, and what lessons can be applied to future LT development projects.
Image caption: Maja Kristine Jåma and Silje Karine Muotka, members of the Norwegian Sámi Parliament
When working with Sámi languages, the scarcity of both digital and human resources is a constant challenge. For instance, due to previous assimilation policies in the Nordic countries, almost no one born before 1975 learned to read and write their own Sámi language. However two guiding design principles helped us overcome the lack of existing resources: use grammar-based technologies and make the source code reusable.
Using grammar-based technologies and focusing on reusability of the source code, in a way, turned the development of the supporting infrastructure into a game of getting the most out of the code with as little effort as possible. Although this way of working is labor-intensive, compared to the amount of work needed to establish a corpus large enough for machine learning (ML) methods, the effort needed is almost negligible.
With no or very little pre-existing digital resources — such as dictionaries and text collections — building LT tools using ML methods is practically impossible. But, as long as there are native speakers, it is possible to build almost anything based on grammar. Grammar-based technologies have a long history in Finland; Finnish, just like the Sámi languages, has a complex word structure and internal changes in the words as they are inflected.
We relied on two grammar-based technologies: (1) finite-state machines using formalisms developed by researchers at the University of Helsinki and Xerox in the 1980s and 1990s and (2) Constraint Grammar (CG), a robust formalism for syntactic parsing that was also developed at University of Helsinki in the 1990s. Finite state transducers (FSTs) are used for roughly word-level processing — including inflection, derivation, and sound changes — while CG is used for everything else. Both technologies have proven fast, efficient, and robust for a large range of purposes. FSTs have proven capable of handling every human language thrown at them so far. CG processing takes the output from the FSTs and processes it further, depending on the purpose of the tool.
Grounding the LT tools in the grammar of the language means we needed and wanted native speakers on our team. Roughly half the team consists of native speakers, and the rest either speak or are learning one of the Sámi languages. Being able to take your own language knowledge and scholarly training and turn that into tools that are valuable for the language community is a strong motivator and source of joy and pride in our work.
With very limited pre-existing digital resources, and a population of native speakers that is trying to fill all the needs of the community, we can’t afford to redo the same work twice. Besides being boring, it would be a huge waste of resources. Therefore, we tried to build versatile resources that can be used for as many purposes as possible and that can be updated with more information relatively easily.
The prototypical example is the difference between a descriptive analyzer, which should be able to tackle whatever text is thrown at it, and a normative tool like a spelling checker, which should flag everything not within the written standard. We solved this by tagging every word form outside the written standard with +Err/xxx (where xxx can be a descriptive string), so that we can subsequently remove all such word forms from the finite state model of the language, leaving us with just the accepted standard.
We follow a similar design principle for other use cases, as well. For example, we tag the exceptions and remove them from all tools in which they are not wanted. This way, we can — with a limited amount of work — support many uses of the same core description of a language.
As the infrastructure developed and achieved success for the initial three Sámi languages, other language communities wanted on board — both Sámi and others. Around the same time, expectations grew beyond the initial set of tools: spelling checkers and hyphenators. It became clear that we needed to improve the supporting infrastructure to be scalable in two dimensions: languages and tools.
The goal for language scalability was that adding a new language to the infrastructure should take almost no effort; in other words, the only thing needed should be the linguistics. In this effort, we succeed quite well: It now takes just a minute or two to add a new language in the most basic form — and 10 to 15 minutes to add additional metadata, configuration, and so on.
The idea with tool scalability is that, once a new tool has been developed for one language, it should be effortless to propagate support for that tool to all languages. Without going into details, our experience so far is that this is working as expected. Example tools we have added after first developing a prototype for a single language are grammar checkers and text processing for speech synthesis.
The team developed high-quality spelling checkers for many languages, advanced grammar checkers for some, MT for several language pairs, and speech synthesis for three Sámi languages. A lot of people in Sápmi, Greenland, and the Faroe Islands depend on our tools for their everyday writing support, and even more will do so in the future.
As anyone who has developed LT tools knows, the work never ends. Languages change constantly along with changes in society, and there is a steady need for updates to the source code. Still, when we measure the quality of our tools, and compare them with similar tools for other languages, they are not worse; in fact, in some cases, they are quite a bit better. Often, they are simply different, with different strengths and weaknesses compared to machine-learned tools. The best quality assurance comes from our users, though, who have told us numerous times about the tools’ importance in their daily work. We have even been given a poem from a very grateful user.
Finally, an important aspect of our work is that it almost ensures community engagement. Since a native speaker is needed as a crucial member of the development team, the language community is by definition involved. This, in turn, makes it much more likely that the tools will be used — as they are, in a way, owned by the language community.
Despite all the positive outcomes of the project, key challenges remain related to the technology platforms that users typically employ. Because our researchers don’t own the platforms the users are on, they are not as compatible as we would like. The platform owners very often don’t see the consequences of their actions for minority language communities. Let’s take a look at a couple of examples.
Tablets are popular in schools. They are relatively cheap and easy to grasp for youngsters growing up on mobile phones. But to be functional for writing, you need a physical keyboard attached. So, tablets plus external keyboards is a common combination in Nordic schools, including in Sámi classrooms.
The problem was that you couldn’t write Sámi using just the external keyboards, because third-party keyboard apps are not given access to them on either Android or iPadOS. So Sámi pupils would have to write either by using the on-screen keyboard only or by jumping back and forth between the physical keyboard and the on-screen one.
The situation for Sámi languages changed late last year, when Apple published an update for all their platforms containing keyboards for eight Sámi languages, including support for hardware keyboards. This was of course very welcome for the Sámi languages, but it does not change the situation for all the other minority and indigenous languages in the world. Additionally, Apple keyboards do not contain a speller, and Apple does not allow the speller in our third-party keyboard app access to the text typed using its keyboards.
When we released the first spellers for North and Lule Sámi in 2007, word processors and other office software only existed on your local computer or terminal server. Software as a service (SaaS) was hardly a thing for regular people — Google Docs went out of beta in 2009. And so, our first three releases ran on local machines.
When office software moved to the cloud, so did the servers and computers running all additional functionality. The languages being served were only the ones that the software producer supported. Suddenly, the Sámi community went back to square one. That’s why, even today, most of our spellers are downloaded and installed on a local computer.
At some point, the makers of cloud-based office suites realized that they needed to allow for extensions — plugins to be installed in the cloud-based office for each user that provide functionality not found in the original software. The plugin author can request access to the text and provide changes to it. Crucially, though, the plugin cannot ask for the language of the text, set the language of the text, make red squiggles under misspelled words, or populate a right-click menu with correction suggestions. Every aspect of the traditional speller is off-limits. What is offered instead is a separate panel outside the actual document window, which the developer can use as best as possible — and we have done so for the grammar checker and speller. But it can only batch-process the text, and all traces of interactive, incremental processing is gone.
Advertisement
The examples above are just two in a long list of various issues we have encountered in the project’s last mile: getting our tools into the hands of the users based on where they are, not on where we as developers are allowed to go. To the extent that technology companies have been mentioned, it is only for illustration purposes. The various issues are industry-wide and concern every aspect of language localization and technology, from letter rendering to virtual assistants.
The question soon presents itself: Why have we ended up here? The way we see it, the following three reasons seem like plausible explanations:
The endless list of — seemingly careless — limitations and restrictions causes uncertainty about whether we’ll be able to get the tools into the hands of the language community. For each item on the list, each language community has to ask every technology provider the following question, “Please, can we get our language in?” Clearly, this does not scale, and it is one of the many forces driving language shift.
So what can be done about it? We propose an “open language” model similar to the idea of open source. In most cases, the software stack is built on many components that together define the business logic or the processes of the system — the functionality that the software is supposed to give to its users. On top of that, there is a thin layer of localization: strings and application programming interfaces (APIs) for language-related services. We believe this thin layer should be opened up to every language community. This layer is what the user sees, and this is where every language matters. The idea is that by opening up this layer to all language communities, each one can decide what is crucial to its members, what they want, and who they will cooperate with. Most importantly, in the open-language model, there is no need to ask for permission to see, speak, read, or write one’s own language.
For open language to actually be beneficial to the language community, three components are required: open access, integrated development environment (IDE) support, and easy distribution.
Platform owners should support open access by automatically making localization data available for developers — for not only their own apps and systems, but also all third-party apps using their ecosystem. Human language APIs and support systems should likewise be openly accessible. To access both data and APIs, it should be enough to be a registered and verified developer, just as for regular software development on a given system.
The default data type for localizable strings should be made such that the platform owners can extract the base locale and make it available for localizers with no effort from the developer. It should also be automatically made available to localizers at latest when an app is released, and in a format that supports easy updating of an existing localization.
When a package of localized apps and human language processing tools is ready for release, it should be easy to get support for the language into the hands of users. The existing app stores for various systems and platforms would be an excellent avenue for this; in addition to getting apps, you also get language support from your preferred language provider.
Open language means that using your language where and when you want to should be effortless. The burden on developers should diminish, as there would be no need to maintain localizations or to remember to use the correct string type — this should all be handled automatically and invisibly by the platform owners and the IDE. As a further bonus, platform owners can open up huge new markets both for themselves and third parties with a modest investment in their platforms.
The core idea of open language was presented for the first time at the United Nations’ “Language Technologies for All” (LT4All) conference in Paris in December 2019, at the end of the International Year of Indigenous Languages. The period of 2022-2032 has been declared the International Decade of Indigenous Languages; by the end of that decade, I hope all major players in the computing technology industry will adhere to open-language principles and remove all barriers to entry for all of the world’s more than 7,000 languages.
Sjur Nørstebø Moshagen is a chief engineer at UiT The Arctic University of Norway and has been leading Sámi LT development for the past 20 years. Previously, he developed LT at Lingsoft of Helsinki, Finland. Sjur holds a degree in general linguistics.
Advertisement
Related Articles
At the time of the acquisition, Zoom’s president of product and engineering said that MT would be the “key in enhancing our platform for Zoom…
→ Continue ReadingFor quite some time now, artificial intelligence (AI) researchers have been trying to figure out how — or perhaps if — computers can be trained…
→ Continue ReadingSEATTLE, WASHINGTON, UNITED STATES, June 3, 2022/ — Nimdzi Insights, a leading international market research and consulting company, is thrilled to announce the appointment of…
→ Continue Reading