More than 99% of content on the internet is written in one of 35 different languages.
The remaining 6,000 languages or so are consigned to relatively small corners of the web, making digitization a key to language preservation as the internet becomes increasingly ingrained in global society. With a significant round of grants from the National Endowment for the Humanities going toward Native American language digitization projects earlier this year, along with several other recent projects to digitize indigenous languages across the world, it’s worth exploring how these projects can aid in language revitalization.
English is the most common language you’ll find on the internet, comprising a little more than 62% of written content online — that’s followed by Russian and Spanish, at 5.9% and 3.7% respectively. These numbers are by no means an accurate depiction of how widely spoken each language is — English is spoken by roughly 18% of the entire world, while Mandarin Chinese, spoken by about 14% of the global population, only makes up 1.4% of the internet’s content.
The overrepresentation of English (and a few other languages like Persian, for example) leads to an underrepresentation of other languages, with less widely spoken and low-resource languages bearing the brunt of this effect. With the internet playing a large role in people’s media consumption habits, a language’s presence on the internet — or lack thereof — can impact its longevity, particularly in cases where the language is already endangered.
Prior to the widespread adoption of the internet — and computers as a whole, really — language revival was a lot more difficult than it is today (which should say a lot, considering the fact that it’s not particularly easy today either). In pre-digital language revival efforts, linguists often had to spend a significant amount of time developing printed dictionaries and grammars based on interactions with the remaining native speakers.
Of course, linguists still have to undertake similarly trying and time-consuming efforts, but storing the linguistic data online saves physical space and creates resources that are more easily accessible. This makes things a lot easier for researchers and language communities alike to share information about the language.
Moreover, publishing written and spoken content online in a given language contributes to the corpus of publicly available information in the language — groups like WikiAfrica have capitalized upon this to help disseminate information in various African languages that are not widely used online.