Developing More Accessible Speech Recognition Systems

Researchers at Facebook have developed an artificial intelligence (AI) system that doesn’t need transcribed audio data to recognize speech — this means that the speech recognition system can be successfully trained on unpaired text and audio data alone. In a recent blog post announcing the development of wav2vec-Unsupervised, Facebook claims that it developed the software in order to make speech recognition more accessible for languages like Tagalog, Swahili, and Tatar, which lack the extensive libraries of recorded and transcribed data that other, more heavily researched languages like English and Arabic do.

“Whether it’s giving directions, answering questions, or carrying out requests, speech recognition makes life easier in countless ways,” the team of AI researchers writes. “But today the technology is available for only a small fraction of the thousands of languages spoken around the globe. This is because high-quality systems need to be trained with large amounts of transcribed speech audio. This data simply isn’t available for every language, dialect, and speaking style.”

Traditional speech recognition systems are typically trained using hundreds of hours of spoken data that has been thoroughly transcribed to accurately represent the sounds produced. Typically, the more data there is available, the more accurate and robust the speech recognition software is — this is not so much of a problem for speakers of English or Mandarin Chinese, as phoneticians have already conducted a substantial amount of research on the sound systems of these languages, and as such, speech recognition developers can tap into a large amount transcribed data fairly easily. 

The same cannot be said for the vast majority of the world’s languages — either they’re spoken by a fairly small population, spoken in remote regions of the world, or there simply hasn’t been enough data collection done. That’s where wav2vec-Unsupervised comes in — according to Facebook, the system only requires a library of unlabelled recorded speech and unpaired, written data in the language. The system can then train itself with this data alone, ultimately reducing the error rate to levels comparable to speech recognition systems that are trained on transcribed data.

“AI technologies like speech recognition should not benefit only people who are fluent in one of the world’s most widely spoken languages,” the blog post reads. “Reducing our dependence on annotated data is an important part of expanding access to these tools. We hope this will lead to highly effective speech recognition technology for many more languages and dialects around the world.”

Andrew Warner
Andrew Warner is a writer from Sacramento. He received his B.A. in linguistics and English from UCLA and is currently working toward an M.A. in applied linguistics at Columbia University. His writing has been published in Language Magazine, Sactown Magazine, and The Takeout.


Weekly Digest

Subscribe to stay updated

MultiLingual Media LLC