Localizing user experience: Hindi transliterations and spelling variants

Localizing user experience: Hindiem

Hindi, written in the Devanagari script, is one of the two official languages of India, along with English. It is the lingua franca of the Hindi Belt and is also spoken in other regions of India. This widespread use of Hindi by 340 million first-language speakers (according to the 2011 census) gives rise to spelling variation, as some communities have simplified or pidginized Standard Modern Hindi to suit their own usage and pronunciation.

Currently, the available spelling and writing data in the Hindi language centres on the Standard Modern Hindi spelling standards set out by the Government of India, but Hindi speakers around the world use different accepted and understood spelling varieties.

With a huge increase in technology usage in India, Hindi users are choosing to communicate in ways which make informal written conversation and understanding faster. We understand that Hindi mobile users want to type and formulate communications in both the Devanagari script and a Romanized form of Hindi words in Latin script, with the interface also presenting in both scripts as required.

Although there is an official system of writing Hindi in the Latin script, Hindi speakers use much more flexibility in their informal romanization, especially as the romanization is phonetic and there are different regional pronunciations across India.

This type of language variation (spelling variation) needs to be considered by Natural Language Processing (NLP) tools working with text prediction models in solutions such as keyboards, for instance. To provide a better user experience, these tools need to better reflect their users’ way of communicating. This raises the need for accurate data that better represents the possible spellings that a Hindi speaker might use to transliterate Devanagari wordforms.

As a solution to this, Oxford Languages has created a colloquial transliteration data feature which presents all of the possible Latin spellings for a Hindi word. For example, क्योंक can be transliterated as ‘kyonki’, ‘kyunki’, and ‘kyuunki’. We wanted to be able to present spelling varieties, without hierarchy, in a data solution to improve the Hindi keyboard and writing experience in tech. The types of variation we wanted to be able to present include: anusvara vs half-letter spellings, vocative plural forms, nuqta usage, full /r/ vs half /r/:, and finally, the old vs prescriptive vs modern spellings.

Our goal in developing these data features in lexical datasets for languages which have variations, such as Hindi, is that they can used in written NLP and generative AI applications to improve the native-speaker experience for Hindi users.

There are many nuances in a language that need to be considered by technologies to present a more localized solution to their audiences, raising the need for language experts in the development of data.

Learn more about Oxford Languages at https://languages.oup.com/ 

Emily Hoyland
Emily Hoyland is a product manager for datasets and innovation at Oxford University Press.


Weekly Digest

Subscribe to stay updated

MultiLingual Media LLC