Now more than ever, people are using automatic speech recognition (ASR) software of some sort.
Typically, this comes in the form of voice assistants like Apple’s Siri or automatic transcription programs like Otter.ai. Despite its prevalence though, it’s obvious that these tools could use some improvement — after all, it can sometimes feel like a bit of a gamble as to whether Siri is going to properly register the pronunciation of certain words, especially when speaking with a non-standard accent or in a slightly noisy environment.
“In real-world scenarios, ASR is challenging due to various factors, such as background noise, speaker accent, gender, and so on,” reads a paper recently published in the journal CAAI Artificial Intelligence Research.
According to the team of researchers behind the paper, this can be particularly difficult to get right in Mandarin Chinese. Researchers from the Hong Kong University of Science and Technology and WeBank, a Chinese banking company, attempted to tackle this issue with their own approach to developing ASR. In their paper, published at the end of August, the researchers outline their approach to improving ASR for Mandarin Chinese, which was found to reduce character error rates by more than 25%.
Typically, ASR works by first decoding sound waves into phones, or representations of speech sounds. Then, a language model generates a set of possible sentences or words and another predicts which is most likely to occur — however, if a sound isn’t clearly registered and decoded into accurate phones, it becomes difficult for the ASR to produce an accurate transcription.
“Traditional learning models are not robust against noisy acoustic model outputs, especially for Chinese polyphonic words with identical pronunciation,” said Xueyang Wu, a researcher at the Hong Kong University of Science and Technology, who worked on developing the model. “If the first pass of the learning model decoding is incorrect, it is extremely hard for the second pass to make it up.”
In their efforts to improve the accuracy of ASR and create a system that’s more resistant to noisy environments, the researchers developed what they refer to as a phonetic-semantic pre-training (PSP) framework. In the case that one word is corrupted by a noisy environment, this framework analyzes the semantic context in which it occurs to recover the lost word. The researchers also pre-trained the model through a “noise-aware curriculum,” which they claim mimics the way humans perceive sentences and words in a noisy environment.