Voicing the unvoiced
interview BY Cameron Rasmusson
Birth Place: Milano in 1962
University: Università Statale di Milano, Computer Science with Computer Music final graduation work
Number of employees at Voiseed: 14 employees as of Aug. 22
Favorite Hobby: Riding vintage Enduro motorbikes from the late ’70 and early ‘80
Fun Fact: I love singing and playing in the pubs with my band and my friends
Audio is one of the most essential ingredients to a truly immersive digital experience. From music to sound design to voice-acting performances, audio brings us into the center of the story and promotes empathy with characters and their plights. But the road toward the robust digital-audio tools we enjoy today had its share of twists and turns.
Andrea Ballista has been a part of that evolution from the early days of the technology. And today, he’s still on the forefront, this time in the exciting field of AI-generated audio. He spoke to us recently about his career and some of the exciting technologies in development.
Tell us a little about who you are and what you do. What is your main focus at this stage in your career?
I’m an audio expert who has been working in the game and localization industry for more than 25 years: After founding Binari Sonori in 1994, I joined Keywords Studios as audio director in 2014. Two years ago, I started a new challenge in the field of AI and virtual voices with my new company, Voiseed: a startup focused on creating a new AI-based technology to synthesize expressive speech in multiple languages with multiple voices.
This technology will be the core engine for the development of specific software solutions and applications for the creation and localization of expressive voice content for multiple markets and verticals, with the primary aim to voice the unvoiced.
Your educational background is in music and specifically computer music. How does your interest in music inform the work you do today?
Music, songwriting, and singing have always been the main passions of my entire life: This has been my way to land into computer music in the ‘80s, where computer music was at the forefront of innovation, even not considered “computer science.” My interest in music is also interest in voice, in singing, and how voice is able to convey emotions. So it is like coming back to my roots, with a new focus on deep-learning technology.
You’ve taken that focus on audio and made it a major theme of your work. Early on, for example, you developed a MIDI sequencer that was quite innovative. Can you tell us about that?
In the ‘80s, MIDI was a kind of revolutionary system capable of representing music symbolic information to control in real time a network of audio MIDI-compatible sound synthesis external devices. What was not yet possible was to use MIDI to control internal DSP modules as virtual instruments as part of the computer itself: the MIDI/DSP sequencer was an extremely innovative solution that is now commonly used in almost every MIDI/audio sequencing environment such as Virtual Studio Technology (VST). Today, I’m going in a similar virtualization direction, but with voice and deep learning. It’s a kind of virtual voice technology — that’s exactly what Voiseed is doing.
Groundbreaking Tech to Voice the Unvoiced
With Andrea Ballista
Since then, you veered toward game localization. What drew you to that particular field of language work?
I started from MIDI, music, sound, and voice. Then the technology evolved from floppy disc to CD-ROOM, and so it was pretty natural for me to evolve into the multimedia solutions space and voice recording in the early ’90s. Voice recording was definitely the primary link toward game voice dubbing and then game localization.
You’ve worked with some of the biggest players in the video-game industry, like Sony and Microsoft. What were some of the projects you worked on with them? Any particular standouts over the years?
That was a very special moment: working with both major platform owners and helping them to deal with voice recordings in multiple territories. We started with French and German, then added Italian and Spanish, and then managed to localize the content into more than 12 languages, dealing with hundreds of actors for one single project, with all the casting, compressed schedule, and asset management issues. This was one of the moments in which it became clear to me that a new and more flexible dubbing solution was really needed.
How has game localization evolved from the time you started to now?
The changes were impressive, from receiving a “localization kit” made up with paper and floppy discs — very stable and locked down, with just one language release — to an email storm with assets changing every minute with simultaneous releases in more than 12 languages. And now the industry is shifting toward the game-as-a-service model, with a continuous flow of assets, updates, patches, and multiple releases.
Given your background in audio, you have a strong emphasis on dubbing. That’s obviously an important part of localizing any video game with voice acting. What are some of the most exciting developments in the field right now?
The changes started in 2016 with the improvements on Wavenet and voice cloning: From that moment, it was clear that speech synthesis was about to change. Since then, we have seen quite a lot of improvements, but all of them are essentially based on cloning. But voice cloning is not the solution for virtual dubbing — we need new technologies developed to solve specific use cases, rather than adapting and stretching the existing tech to partially solve multiple use cases.
One of the most cutting-edge technologies enables computer generation of voiced lines. In fact, your work in that field won you the 2022 Innovator of the Year award at LocWorld Berlin. Can you tell us where that technology currently stands?
Our solution is not based on voice cloning: It is a new and more flexible tech, able to virtually synthesize and emulate any voice and any style in any language. We are currently finalizing the core dataset and the AI modules and are preparing the October launch of our first AI voice-dubbing application with core features and languages.
It seems to me this could have a tremendous impact on the entertainment industry and particularly in video games. How do you see this changing the status quo in the years to come?
While we believe that our solution is innovative and unique in the dubbing industry, there are multiple technology-based solutions that will change the way we work in the localization industry using data and AI. The past few years have been driven by innovation in the machine-translation space, but we will see more and more solutions in the voice space in the future.
I’m sure voice actors have their share of nervousness over potentially being replaced by AI, which speaks to a common anxiety throughout the language industry and, for that matter, multiple other industries. What’s your perspective on that?
Thank you for your question which allows me to better clarify our vision: Our primary aim is voicing the unvoiced content, the stories that now are still untold, simply because there are not enough resources to deal with all of them in the required time frame, in all the languages.
We see a very strong similarity with machine translation here: MT has been able to redefine the translation processes and bring more translated content to the global audience, but humans/experts in the loop remain a key element to produce quality output.
Machine dubbing looks very similar, with humans/experts always in the loop: There will just be more voiced content in multiple languages available to the final users. And while the process and the overall productivity will change, the human-produced voice will always be on top of the pyramid, delivering exceptional quality as usual.
Is there anything else you want to mention?
Just a final note to say that AI and data science will have a huge impact on all the processes that we currently have in place: Innovation will increasingly become part of our day-to-day experience. And the key will not only be represented by the new technologies, but especially by the way we will be able to softly and efficiently adopt these new technologies in our usual workflow.
Cameron Rasmusson is the editor-in-chief of MultiLingual Media.