When Google first announced the development of the Translatotron in 2019, the speech-to-speech translation program was commended for its ability to produce target language speech that actually sounded like the initial speaker’s voice, instead of using a pre-set, computer generated voice (like that of Siri or Alexa). The program’s ability to retain the original speaker’s voice was deemed an incredible feat, though there was one issue with the technology:
Some believed that the technology had potential for misuse, fearing that it could be used to create “deepfakes” — highly believable content using an individual’s likeness to portray them saying or doing something they didn’t actually do. In addition to retaining the original voice, users could also generate speech in the target language to match another person’s voice, thus allowing users with malicious intent to create such deepfakes.
Google recently claimed to have solved this issue in its updated version of the program, Translatotron 2, which only allows users to produce translations using the original speaker’s voice, thereby barring ill-natured users from putting words in another person’s mouth, so to say. Google also claims that the Translatotron 2 outperforms the original version of the program, both in terms of the speech’s naturalness and the actual quality of the translation.
“The trained model is restricted to retain the source speaker’s voice, and unlike the original Translatotron, it is not able to generate speech in a different speaker’s voice, making the model more robust for production deployment, by mitigating potential misuse for creating spoofing audio artifacts,” write the researchers who developed the new version of the Translatotron.
In addition to reducing the program’s potential for misuse, Google also described a number of updates to the original program in a paper describing the new model. The company notes that the updated version is significantly more accurate in its translations — MarktechPost’s Amreen Bawa writes that, due to the improved performance and decreased deepfake potential, the program could be a major breakthrough in the field of speech-to-speech translation.
“In contrast to the original Translatotron, (speech-to-speech translation) models trained with the new method is restricted to retain the source speaker’s voice, but not able to generate speech in a different speaker’s voice, which makes the model free from potential abuse such as creating spoofing audios, thus more robust for production deployment,” the researchers write.