The trials and tribulations of digitizing Urdu, the 10th most spoken language in the world

February 14, 2022

red rose on white paper — A page of Urdu poetry, written using Nastalīq.

Urdu speakers have long lamented the poor digitization of one of their traditional scripts, Nastalīq.

A recent story in Rest of World highlights the trials and tribulations that some Urdu speakers have taken on to adapt their script to digital formats. Nastalīq is an Arabic-based script that was originally developed to write Persian calligraphy — while the characters are mostly similar in shape to standard Arabic characters (also known as Naskh), Nastalīq’s characters are traditionally written along a fluid, diagonal orientation wherein letters at the beginning of a word are slightly higher than letters at the end of a word.

Persian speakers reserve the script’s use for poetry, and as a result, have not suffered the same struggle to effectively use their language on digital formats. On the other hand, the use of Nastalīq in Urdu has been much more widespread in day-to-day life. As digital communications became increasingly important over the last three decades or so, Nastalīq users have struggled to make the script work on digital formats.

Historically, system interfaces have primarily been built with Latinate scripts in mind, in which the characters are neatly arranged along horizontal lines, written from left to right. Tech leaders have generally been slow to functionally digitize scripts that deviate from the sort of grid-like orientation of scripts that are strictly written along straight lines from left to right or right to left — even the traditional Mongolian script, which is written along straight, vertical lines from top to bottom, has yet to be utilized fully online.

Because Nastalīq has been so challenging to adapt to a digital format, many Urdu speakers have taken to using Naskh, which is written along straight horizontal lines, or even using a non-standardized form of the language that uses Latin script. In 2014, Pakistani American programmer Mudassir Azeemi wrote a letter to Apple explaining the issue — three years later, the company released its first Nastalīq typeface for iOS users, but some pitfalls remain.

As time goes on, this Latin-centric approach to how we render text digitally has become sort of fossilized — one Pakistani software engineer, Zeerak Ahmed told Rest of World that some words are rendered too small to read in Apple’s typeface. Ahmed is currently working on developing an Urdu language dataset to aid in the development of better machine learning projects for the language.

Because the language has taken so long to properly digitize, artificial intelligence models involving the language are also behind counterparts for other languages. Despite the fact that Urdu is spoken by more than 200 million speakers worldwide (Ethnologue states that the language is the tenth most widely spoken language in the world), the language has not enjoyed the same level of advancements in fields like machine translation as other widely spoken languages like Hindi, Arabic, or English. Ahmed told Rest of World that, unfortunately, “all Urdu software is broken because the underlying data is broken.”

The trials and tribulations of digitizing Urdu, the 10th most spoken language in the world

RELATED ARTICLES

Of Course We Should Digitize Every Minority Script — Right?

Localization — to the Letter

Learning Multiple Languages: Good. Learning Multiple Scripts: Better?

News by Hand

Lessons From the Mayan Languages Preservation and Digitization Project

Weekly Newsletter, Subscribe to stay updated!

Login or Register