The Underrepresentation of African Languages in Tech

Like any other continent on Earth, Africa is home to hundreds — and potentially thousands, depending on who you ask—of indigenous languages. Some of them you’re likely to have heard of: languages like Kiswahili or Igbo, which are spoken by millions of people. Then, of course, there’s a wealth of languages spoken by smaller populations, often limited to small geographical areas (and a wide range of languages in between both ends of the spectrum).

While the languages of Africa represent a host of language families, features, and cultures, there’s one thing they have in common, no matter how big or small: They’ve been all but neglected by the tech industry. Even widely spoken African languages like Kiswahili are underrepresented in fields like machine translation (MT) and speech recognition, in spite of the fact that they have a wide range of speakers across numerous nations.

“We are not the main target market for these big companies, the companies like Google, Amazon, Apple, have really concentrated on maintaining their key business sectors, their biggest clients, who are mainly Europe and the U.S.,” said Joshua Businge, a Uganda-based tech entrepreneur, in an interview with DW News.

For example, Apple’s virtual assistant Siri is capable of speech recognition in numerous languages and dialects, though most are Indo-European or native to Asia. Two Afro-Asiatic languages, Arabic and Hebrew, are available on Siri, however both of these are native to Asia. In spite of having a native speaker-base roughly ten times the size of Hebrew’s, Kiswahili is nowhere to be seen on the list of languages available on Siri, indicating a significant oversight on Apple’s part.

In the field of speech recognition, it looks like there have been some major strides toward improving accessibility for African languages — Facebook’s recently launched wav2vec-Unsupervised program can develop speech recognition systems for under-resourced languages like Kiswahili, as it works on untranscribed speech data (unlike traditional speech recognition systems, which require transcribed speech, a process that takes significant time and energy). And as for MT, things may be looking up in this field as well: A team of researchers across the continent recently won the Wikimedia Research Foundation Award for their development of MT benchmarks for African languages classified as “left behind.”

Andrew Warner
Andrew Warner is a writer from Sacramento. He received his B.A. in linguistics and English from UCLA and is currently working toward an M.A. in applied linguistics at Columbia University. His writing has been published in Language Magazine, Sactown Magazine, and The Takeout.


Weekly Digest

Subscribe to stay updated

MultiLingual Media LLC