New resources for endangered languages

Languages such as French, Italian, German and Spanish are undeniably the language industry powerhouses. However, languages with both large and small community bases, from Igbo to Ojibwa, are stepping up and finding a way to make their presence known. In the twenty-first century, this equates to being available on a digital platform, and tremendous work is currently being done by individuals, companies and educational institutions to bring greater linguistic diversity to the world of digital communication.

The reality is that people across the globe are relying increasingly on technology and are experiencing greater access to mobile phone networks and the internet. If people cannot navigate this technology in their own language there are two potential results: they will remain at a disadvantage since they will not be able to access this information as readily, and they will most likely move away from their native language to one that allows them to access this global network. Even though companies such as Microsoft have increased their efforts to provide linguistic support for their products to a wider market, they still can only manage to offer products and services in limited capabilities for just 108 of the almost 7,000 languages spoken in the world today.

Interestingly, the development of language technology resources has historically not been chosen primarily based on the number of speakers. Representation is disproportionately high toward European languages, some of which have less than one million speakers, such as Estonian, for instance. All of this while languages in Africa or India that have well over 70 million speakers, such as Twi, a language spoken in West Africa, have almost no computational linguistic support or online presence according to the linguistic database Ethnologue: Languages of the World.

It is easy to criticize the development of this situation, but some of it is understandable as it is based on the historical cultural and economic ties between the United States and Europe. It is also due to the business drive for the highest return on investment, which until recently has meant greater focus on Europe, and then Asia, and is just now expanding beyond those horizons on a larger scale. However, the world is changing as more people are accessing technology whether through computers, cell phones or tablets. Additional markets with millions of people are opening up. It will be increasingly important for businesses to respond to this in order to take best advantage of it. Proof of this trend can be seen simply by observing Microsoft and the growing list of languages offered for its products, or by looking at Facebook and Twitter to see the ever-expanding language options there.


Movement to support under-resourced languages

This brings us to another movement that has emerged: the development of linguistic resources for under-resourced languages with large amounts of speakers as well as technology support for struggling language communities. One company supports this by empowering its user base to take responsibility for the localization of its products. Mozilla has long had a tradition of user engagement and empowerment that has also grown into support of endangered languages, all driven by its user base. Whether in India, Latin America or in the United States, the company is working to engage the communities that use its products. This has included motivating the Basque, Tswana and Welsh linguistic communities, among many others, to begin localizing Mozilla products into their languages.

Mozilla Nativo is one community of users and activists working to strengthen the technological support for the many indigenous languages in central and South America. Additionally, Mozilla is also working to update its Firefox Android mobile browser, Fennec, to allow it to offer more languages than is currently supported by the underlying Android system. Mozilla is constantly working to better develop its support to its language communities. It is actively finding ways to increase accessibility for more languages and to reach out to new linguistic communities to get involved in the localization of their products.

Another effort taking a different approach is from Google. Google has worked in conjuncture with Eastern Michigan State and the University of Hawai’i at Manoa to start The Endangered Languages Project, whose mission is to record and share research on endangered languages as well as to provide a space where advice can be readily shared on best practices for those working to document languages under threat. One specific project is the building and maintaining of the Catalogue of Endangered Languages (ELCat). To house the Endangered Languages Project, Google later created the Alliance for Linguistic Diversity whose mission is to, “accelerate, strengthen and catalyze efforts around endangered language documentation, to support communities engaged in protecting and revitalizing their languages, and to raise awareness about ways to address threats to endangered languages.” This alliance collaborates with organizations across the globe that are working to maintain and preserve under-resourced or endangered languages.

Moving on to some perhaps less well-known companies and initiatives, Idibon is a technology linguistic resource and consulting company that has called specifically for natural language processing (NLP) support for all languages. Idibon works with companies to help them understand their linguistic data and utilize NLP programs in order to accomplish their tasks. This often includes encouraging and coaching companies to diversify the languages their business model can support. As a company, Idibon is representative of other similarly focused and emerging companies that are putting energy into educating companies about the need to diversify linguistically beyond the most commonly thought of languages and in this way supporting under-resourced language communities.

Linguists, community activists and software developers are working together to create a plethora of different technological tools, including video games, to encourage use of these languages and to demonstrate their pertinence and applicability of these languages in the current world. Thornton Media, based out of Las Vegas, Nevada, released a 2013 video game in the Cherokee language and now other companies are starting to follow suit for other indigenous languages. For any Star Wars fans and language aficionados, in 2013 the Navajo Nation Museum produced a dubbed version of Star Wars in Navajo with permission from Lucas Films (Figure 1). For those interested in learning Inuktitut, Tusaalanga is a free iPhone application that can be installed on your iPhone to help you learn this now-official language in the territory of Nunavut. The app was created by the Pirurvik Centre for Inuit Language, and with the development of an Inuktitut learning app it provides language courses across the province of Nunavut. Additionally, the Pirurvik Centre has been working with Microsoft through its Local Language program (LLP) to create Inuktitut language packs for Windows operating systems and Office products.

Microsoft’s Local Language Program works to increase access and tools for many different language communities. One recent project was the development of a new font called Ebrima that contains specialized characters for N’Ko, Tifinagh, Vai and Osmanya writing scripts, in addition to the Latin characters that are used in many languages throughout Africa. In an article about this project on the Microsoft LLP website, the program manager in the Microsoft Typography Group said that “In designing Ebrima, [they] wanted to align it with [their] Segoe family of user interface fonts to provide the same modern experience offered in other, more well-established fonts.” Microsoft acknowledged the necessity to create a font that would enable more linguistic communities to engage in digital communication in their native languages.

About the progress made with the creation of the Ebrima font, Charles Riley, a catalog librarian for African Languages at Yale University said that, “The fact that Microsoft has shown a genuine interest in revitalizing local languages and cultures gives a tremendous amount of momentum to efforts to revive at-risk languages.” More can be learned about Microsoft’s efforts in working with under-resourced languages by visiting the LLP website.

A slightly different company, Ogoki Learning Systems, Inc, based out of the Sandy Bay Ojibway First Nation Reservation in Manitoba, Canada, has been active in supporting the Ojibwa language with a language learning smartphone app and the development of other tools. This company is 100% First Nations owned and was started by Darrick Glen Baxter. One of the goals for his company is “to preserve and strengthen the ancestral heritage of Canada’s First Nations, Inuit and Métis people.” Another individual who created her own company is Monica Peters, an independent software developer and a member of the Mohawk Nation of the St. Regis Mohawk community, also known as Akwesasne. Peters similarly designed a language app for the Mohawk language and created her own company to support the development of apps for varying mobile devices and platforms for other Native American languages.

Universities and educational institutions are not only teaming up with large companies but are also supporting student projects being further developed outside of the classroom. One example is the development of a series of e-books in Cree that was started by a former University of Alberta student Caylie Gnyra. Another example is Robert Jimerson, a member of the Seneca nation and a graduate student at Rochester Institute of Technology. With the assistance of a grant, he is working with linguists to develop an online dictionary for the Seneca language, a challenging task due to the language’s highly inflected morphology. Once completed, however, this tool will greatly aid Seneca language learners.

For the Cayuga language, another project has been underway. Cayuga: Our Oral Legacy is a five-year plan to increase the amount of fluent speakers. Through this program and with cooperation between the Linguistics Department at Memorial University in Newfoundland and Labrador, Canada, as well as the Woodland Cultural Centre at the Six Nations Reserve in Ontario, efforts are being made to create resources such as a Cayuga e-dictionary and grammar, along with a downloadable Cayuga keyboard. Making these languages accessible on a digital platform is key to demonstrating their relevance in this digital era, and it is also a way to disseminate this material globally.


Challenges facing these linguistic communities

Some challenges still facing under-resourced and endangered languages are that most of the linguistic analysis methods and approaches have not only been tested solely for European and Asian languages, but they have been primarily designed for these languages. One problem with this is that although European languages may at first glance appear to be very diverse, they come almost entirely from the same language family, Indo-European, and thus share many foundational structural elements. Hence, a linguistic and computational approach that works sufficiently for many of the European languages may not work as well for a language from a different language family with a completely different structure. Additionally, creating supporting tools such as fonts and then making commercially available keyboards to handle the numerous scripts in a convenient and feasible way can also be a hurdle for some languages. Just because a font is available doesn’t mean that it is easy to implement it using a traditional keyboard with Latin characters.

While there is still more work ahead for these communities, every day another linguistic community begins developing a new tool that will help maintain or revitalize a language. This could be the development of a smartphone app that will allow community members to send text messages in their native language, or working to create an add-on that will allow them to view websites and use web browsers in their language since these products have not yet been localized for them by the larger companies. We in the language industry should pay attention to these developments and support these efforts.