Our ability to perceive and process speech isn’t just about hearing all the right sounds. It’s also about seeing them.
Lip reading typically takes a back seat to auditory input, but it has been shown to play a role in our ability to understand the speech sounds we hear. While widely used speech recognition tools like Siri or Otter generally analyze audio alone, researchers have also made progress in developing visual speech recognition (VSR) models, which rely on visual input to identify what a speaker is saying.
Researchers at Imperial College London recently published a paper outlining their efforts to develop a VSR model and address some of the challenges typically associated with this technology. In the process, the researchers developed a model that outperforms some of the existing models and can also recognize speech in multiple languages.
“Our model takes raw images as input, without extracting any features, and then automatically learns what useful features to extract from these images to complete VSR tasks,” Pingchuan Ma, who developed the model and contributed to the corresponding paper, said in an interview with TechXplore. “The main novelty of this work is that we train a model to perform VSR and also add some additional data augmentation methods and loss functions.”
Tech enthusiasts have long suggested incorporating lip reading into speech recognition technology, but as the researchers note in their paper, early attempts to develop VSR models were only able to function in a lab setting. While the technology has improved, any improvements are largely limited to English models, as most VSR models are trained on English data.
Ma set out to develop a tool that could also process speech in French, Italian, Mandarin, Portuguese, and Spanish while also making adjustments to the model design rather than merely increasing the amount of training data. To do this, he used publicly available audiovisual training data (for example, TEDx Talks) in each of these languages, while adjusting the model design to incorporate time masking, improved language models, and optimized hyperparameters. Upon testing it, the researchers found that the model outperformed many state-of-the-art VSR models by significant margins.
“We have presented our approach for VSR and demonstrated that state-of-the-art performance can be achieved not only by using larger datasets, which is the current trend in the literature, but also by carefully designing a model,” the researchers conclude.