Stretch, Bend, and Adapt| MultiLingual July 2023

VOICEOVER

Stretch, Bend, and Adapt
On the resilience of the voiceover industry

BY JOAN DANS

I recently succumbed to an ad on Facebook inspiring me to become more flexible, and before I knew it I’d purchased a yoga app. Screencasting my phone to the TV, I opened the app to find a lovely woman about my age smiling calmly at me. “This is going to be just what I need,” I sighed, and pressed start.

As I sat cross-legged with my eyes shut, deep-breathing in anticipation as a pan flute lilted in the background, a jarring sound suddenly flung my eyes open. A very non-human, un-caring, and infuriatingly robotic voice ordered me to take a deep breath and find my center. Just like that, my sense of calm vanished and I grew furious in a matter of seconds — all because of this sterile, AI-generated voice. The woman on screen smiled at me and demonstrated a bend, but I needed to pause the app to collect myself.

“OK, I can do this,” I told myself. “Who cares if it’s AI? The important thing is the yoga.” So, I proceeded to perform the first few poses along with my on-screen teacher, wincing unintentionally every time the AI barked a command at me.

A few poses in, the robot voice told me, “Don’t round your back too much and fold in the hip joints.”

“Wait,” I stopped. “Am I supposed to fold at the hip joint or not fold at the hip joint?” The teacher bent at the hips, but the AI’s command sounded like the opposite. By then, my yoga teacher had moved onto the next posture, and I continued to contemplate the ambiguous sentence structure and how I would have rewritten it to be more clear when read by an inflectionless voice.

I gave my new yoga app a few more tries. But even though I wanted to take my teacher to lunch and be her friend, I could not stand her monotone AI colleague. I deleted the app soon after my first session.

AI is currently considered the big disrupter, but this is nothing new. Leaps in technology have been disrupting the voiceover business for more than 30 years. For us, it’s been a continuous cycle of “adapt or die.”

When I began my voiceover career in the early 1990s in Japan, voiceovers were recorded to tape — half-inch thick, obscenely expensive, reel-to-reel tape that weighed a ton. Studios were large and glamorous, with black leather couches for ad execs to lounge upon. Neve mixing consoles — which cost as much as my grandmother’s house — stretched from one end of the room to the other. The engineer and director worked from designer swivel chairs, and enormous, top-of-the-line speakers hung on wood-paneled walls.

Studios were a necessity back then, as home recording equipment simply didn’t exist. The studio, even more than the talent, was often the biggest expense in any voiceover project, and for that reason, the ability to speak without making many mistakes was a critical factor in what made a voice talent “in-demand.” Cutting tape took time and, not only that, but the tape itself was costly and could not be reused much without risking degradation of quality.

My career as an English-language voice talent in Tokyo was a daily parade through the fanciest studios in the city, where directors became impatient if I couldn’t smoothly read a sub-par translation — which most of them were, by the way. My ability to re-translate on the fly rather than aborting the session and rebooking even more studio time to record the new script, saved many a project and elevated my career.

Then, a revolution began. Although ProTools — an audio recording software that’s now widely viewed as the industry standard — was first released in 1991, it didn’t quite take hold until a few years later. By the mid-1990s, all the fancy studios found themselves in a pickle. Brand new voiceover studios with hundreds of thousands of dollars worth of state-of-the-art gear were rendered obsolete practically overnight. Cutting tape became an anachronism, people had to learn new skills, and digital recording drastically changed the requirements of the job for everyone involved in the voiceover industry.

My career zoomed along through the rest of the 1990s in Tokyo where, by the turn of the millennium, video game voiceover had become the big thing. Editing with ProTools was quick, and game recording sessions saw dozens of voice talents in and out of the studio in a single hour. Technology was making us more efficient even as it forgave our occasional mistakes.

When Carasmatic Productions began as a boutique multilingual voiceover production company in Los Angeles in 2006, we were already fully working in a digital environment. But for large projects like the car navigation systems — our specialty in the early days (at least until those were replaced with Google Maps) — we still hired outside studios and lots of personnel.

As our company was clearing a path through the multilingual voiceover industry, two other breakthroughs shifted the landscape and created an unprecedented need for voiceover recordings. The launch of YouTube in 2005 gave the world the ability to easily stream video on the internet, changing the face of advertising forever. And soon after, the iPhone appeared — suddenly, we could watch those YouTube videos from the palm of our hands.

Early 1990s

Most voiceover work is recorded to tape, using bulky, expensive equipment like Neve mixing consoles.

1991

Pro Tools, a digital recording software, launches and allows voiceover talents to forgo expensive hardware.

Mid-1990s

Pro Tools becomes ubiquitous in the industry, making much of the hardware in the studios obsolete.

2005

YouTube launches, increasing demand for voiceover content for videos and revolutionizing advertising.

2007

Apple releases the first iPhone, which made video content significantly more accessible and ubiquitous, further driving increased demand for voiceover.

Early 2010

Siri, Alexa, and other voice assistants powered by text-to-speech — rather than human voice talent — gain popularity.

Demand for voiceovers skyrocketed

Having started our company right on the cusp of this web video revolution, we weren’t taken by surprise. We were young, internet-savvy, and located in Los Angeles, where actors from all countries abound, and we lost no time. We bought top-of-the-line gear — as any good Hollywood facility does — and opened a small studio in a remodeled pool house. It looked great: brick walls and wood floors, hand-made bass traps to acoustically condition the space. The smell of chlorine was long gone, and multinational companies began frequenting our studio to record their projects. We were one of only a handful of companies who knew how to handle the linguistic and sonic requirements of multilingual voiceover, and our work got rave reviews. We’d pivoted successfully.

Within just a few years though, recording hardware and software became affordable to non-professionals, and any human with a voice could put together a home “studio” for a few hundred bucks. Expensive studios became less attractive, and our business pivoted from recording in Los Angeles to recording remotely. The challenge was to become an efficient, centralized hub to organize projects; translate the needs of the job to each talent, many of whom did not speak English; and guide them in recording, editing, and sometimes even setting up their studio.

We trained dozens and dozens of talents in every country, vetting their studio sound until it met our sonic standards. And then we caught the audio as it flew in from around the world and edited it in the pool house, ensuring it matched the client’s various requirements.

Long before the COVID-19 pandemic, we found that working remotely opened up a world of talent to us, and we spent years auditioning for top voices in more than 100 languages until we had our crew. We made certain that their audio was indistinguishable from expensive studio audio, and we trained them in our proprietary dubbing techniques to assure consistency across languages. We were killing it.

Then text-to-speech (TTS) came along.

It didn’t take long for us to realize we were again on the precipice of an enormous change. This entity had not only ended the car navigation side of our business, but presaged the end of voiceover itself. The writing was on the wall, and about to be read aloud by a robotic-sounding voice.

As the tech giants began rolling out their TTS systems, we reluctantly participated by supplying voice talent, knowing all the while that TTS might eventually supplant live, human voices, and we’d all be looking for new careers. We just thought it would take longer than it actually did. But change is inevitable, so one has to focus on the good.

We can find some encouragement by looking at trends in popular music where people have grown tired of the once ubiquitous autotune, preferring their voices more natural, and artists like Adele have famously refused to have anything to do with this robotic-sounding technology. I’d like to think that something similar will happen in voiceover. In the short term, voiceover jobs for which humanity is relatively unimportant — telephony prompts like “Press 1 for English,” for example — will go the way of AI sooner rather than later. We’re all used to robot voices like Siri and Alexa telling us where to go and what to do. We don’t take it personally; we just follow the orders. But for voice work in which personality is essential — such as movie dubbing, commercial advertising, and yoga apps — the human voice still has it over the robot.

The moral of this story is that humanity is not replaceable, and hopefully never will be. As AI voiceover becomes ubiquitous over the next few years, I expect that people will develop a sort of intolerance to the coldness of it, just like they did with autotune. Depending on the application, we’ll begin to long for real voices. For a perceived connection to what we are hearing. For warmth and humanity.

What I like to imagine for the not-too-distant future is an environment in which AI and human voices coexist, but real human voices will regain the edge, even if they initially lose it. The choice of whether to use AI or humans for voiceover work may ultimately come down to how important we perceive emotional connection to be. And products that prioritize budget over humanity — like my yoga app — will be considered inferior. Whatever happens, the voiceover industry will stretch, bend and ultimately adapt, just as it has throughout its technologically turbulent history.

Joan Dans is a founder of Carasmatic Productions, Multilingual Voiceover Specialists. She’s currently working on her first book.

Back to Issue