Spanish voice recognition
Ingrid Cruz is a freelance writer, independent film director and sometimes a digital nomad. She worked as an interpreter and translator for seven years. She has a studio arts degree from the University of California, Irvine

hen I first began my career as a freelance writer, I had to take several odd jobs to make ends meet. One of these included saying words for a small tech company that was apparently looking for people who speak Spanish in a variety of accents.

A world traveler, I was living in Buenos Aires, Argentina at the time, and my only concern was to hustle so I could make ends meet in one of Latin America’s most expensive cities. I did what I was told. I had several lists of vocabulary words that seemed random, with everything from names of fruits and vegetables to not-safe-for-work terms. All I had to do was read these words three times, each while speaking into the software I was given. Being a dutiful worker, I followed all the instructions and laughed about the fact that jobs like this even exist.

The ad I’d responded to was looking for Spanish speakers from various countries. They wanted native speakers, immigrants and even expats who were spending a limited amount of time in these countries. It seemed they were encompassing as many Spanish-language accents as possible.

I didn’t think about the implications of the job itself until I realized that whenever I used Siri in Spanish, my iPhone didn’t always understand the Argentinian words I’d picked up over the years. It turns out that “Latin American Spanish” didn’t represent everyone.

Several of my friends were from Spain and were living in Argentina as well, and it was funny to have misunderstandings in conversations because we used similar words in different contexts, had phrases that sounded strange to each other, and were all losing ourselves to the Argentine Spanish we heard every day. They too, fumbled around with text-to-type functions, experiencing similar difficulties with their devices when the tech couldn’t understand the mix of old and new accents.

And then I understood that my “speaking gig” was actually a smart business strategy for tech startups.

What is neutral Spanish?

We all know Spanish comes from Spain, but the majority of Spanish speakers now live in Latin America. Regional differences, cultural tradition, isolation, influence of other non-Spanish speaking immigrants, accents and even adopted indigenous words all influence who gets representation.

Of course, marketing to a large swath of people with distinct, sometimes competing, interests can be difficult. Entire companies, industries and small business owners have attempted to create “neutral” or “global” Spanish while leaving Spain’s version of the language intact.

If you subscribe to Netflix, Hulu or other streaming services and want to watch a TV show that is dubbed or subtitled into Spanish, you’ll notice two options: Spanish (Spain) or Spanish (Latin America). This is sometimes called Spanish (International). Maybe you stop for a second and wonder why you have two choices for the same language, but this decision later becomes automatic depending on where you live now, or where you learned Spanish.

The thing is, there are many Spanish-language accents that form a part of Latin American Spanish. Some comedians even take delight in imitating these and having people “guess” between the many iterations of Spanish spoken all over the world. Indeed, accomplished Spanish-language actors who can speak in more than one accent with mastery obtain more opportunities and are considered quite skilled. The point being, a neutral or international version of Spanish doesn’t exist.

Remember this the next time you, a resident of Mexico/Spain/Uruguay, hand your phone over to your best friend from Chile/Cuba/Puerto Rico so they can look something up on Siri — only to have Siri explain that she can’t understand your friend. It’s not your friend’s fault, it’s your phone.

Just how many Spanish-language accents are there?

Pinpointing the number of Spanish accents is complicated. There are 20 countries and one territory (Puerto Rico) where Spanish is the official language.

Some countries, such as Mexico, Chile and Bolivia, recognize Spanish and indigenous languages as official. You also have to account for countries where Spanish is widely spoken, though it’s not an official or national language, such as Belize, the United States, Brazil and Andorra.

Then, you have to account for regions that were influenced by non-Spanish speaking immigrants, such as Argentina and Uruguay. Spanish spoken in this region is known as Rioplatense or Southern Cone Spanish. Argentinians actually refer to their version of Spanish as Castellano.

For the sake of numbers, let’s just suppose that each country where Spanish is widely, officially or nationally spoken has just one accent. That gives you roughly 30 distinct ways of speaking, influenced by culture, regional differences, political influence and even class differences that could cause people to enunciate differently.

Every country also has groups of immigrants for whom the national language is a second language, and in Latin America and Spain, differences in accent also cause confusion among people. It’s no wonder most industries have chosen to recognize only two forms of Spanish instead of 30 or more.

Let’s also remember that some people have speech impediments and other physical issues that may make it difficult for a device to understand their speech, enunciation or accent as well.

What tech can learn from the entertainment industry

The film and TV industry have some experience with the successes and failures of making media available to the Spanish-speaking world.

In Spain, dubbing became widespread during Francisco Franco’s dictatorship. Not only did this allow him to censor things he found objectionable, he was also able to prevent non-Spanish languages, such as Basque, Catalan and Galician, from receiving attention. Dubbing, then, served a political as well as a practical purpose.

So what can we learn from this? People who translate things from one language to another can choose to focus on one group while excluding others. Franco later loosened some of the regulations requiring media to be translated into Spanish only, and he was known for using dubbing as a way to censor films his regime found immoral.

Spain’s dubbing industry continues today because the habit stuck, and if you’re a voiceover actor in Spain, you can eke out a lucrative living.

Why is this important? Because tech companies that create voice-recognition software now deal with the same problems as the film industry: they must often choose which type of Spanish they’ll invest in first, depending on their target market (Europe or Latin America). The location, nationality or accent of the staff they hire to train their devices will include a set of people. Inclusion of one group then, means the exclusion of others.

Spanish-language voice recognition

Tech, however, isn’t like TV or film. We sit and watch a movie or show, but we now expect our phones to understand and even talk back to us. What if Spanish is your second language and your phone only does what you say when you speak English? What if you do speak Spanish as a native speaker, but you come from a country that isn’t well-represented by “neutral” Spanish, such as Chile, Venezuela or Cuba?

One criticism of “neutral” Spanish is that it sounds Mexican to most people — and even then, only a certain region in Mexico. Central Americans who use vos instead of or usted as they try to get directions from their house to the nearest pizzeria might be out of luck.

Let’s suppose a person from Spain moves to Mexico City and buys a new phone that is formatted in neutral Spanish. Their phone may not understand their use of the word vosotros because it’s trained for or usted. Googling things using phrases that aren’t heard outside of Spain may also be a problem in this instance. At least a person from Mexico or Spain can reasonably expect their devices to understand them when using the right settings. An Argentinian or Chilean has to try harder to be understood and may even forgo voice-recognition software if they’re able to avoid it.

I eventually learned that solving these issues was at the core of what this tech subsidiary wanted my help with. I was born in El Salvador, but raised in the United States. I also spent some time living in Mexico City and picked up some Argentinian phrases in Buenos Aires. My employer hired me because my accent, jargon and intonation is all over the place. Someone like me will have to adjust or be extra patient with Siri, Alexa and whatever the tech industry invents to Google the weather for me upon my voice command.

People have fewer problems with this than tech does. True, I had some difficulties adjusting to life in Argentina because my ear had to get used to a different accent. I had to start using different names for certain things so I could get what I needed at the grocery store or a restaurant, for example.

Still, everyone in Buenos Aires understood me for several reasons. Many Argentinians said I spoke “neutral” Spanish, and they were used to hearing my accent whenever they saw foreign-language media dubbed into Latin American Spanish. Others simply understood me because the human brain can make sense of sentences, and a person can just ask you to repeat yourself.

Unlike film and TV, various Spanish-language markets are still waiting to be filled. The tech industry hasn’t come up with a consensus, standard or guidelines for how to train devices to hear variants of the same language — including the accents of new immigrants or Spanish-learners.

Why is it taking so long for technology to catch up?

The seeds of voice-recognition software were planted in the late 1700s by Christian Kratzenstein, a Russian scientist who built a rudimentary machine voice gadget. Charles Sumner Tainter, Alexander Graham Bell and Chichester Bell built yet another device in 1881 that enabled 1907’s Dictaphone.

But a true breakthrough came when HK Davis developed Audrey (Automatic Digit Recognition) for Bell Labs in 1952 — nearly 200 years after Kratzenstein began fiddling with voice replication. The device was only able to recognize the voices of Davis and a few other people. It was also a giant machine at six feet tall. There were some breakthroughs after that, but the biggest one came in 2008 with Google Voice.

Google Voice was successfully introduced to the public in 2008, and it took Apple three years to introduce voice recognition. Siri was born in 2011.

It took years to come up with voice recognition as we know it today because the first ever voice-recognition devices only recognized words when spoken one at a time. Computers had to be trained to think more like human beings in order to understand full sentences in English. Most large tech companies that funded research in this technology are from the United States, though China, Japan and South Korea also have large and respected tech industries.

Voice-recognition software still has a long way to go in recognizing voices in certain contexts. Using text to type, or trying to use Siri, Alexa or Microsoft’s Cortana in a quiet room is one thing. However, these AIs still don’t work well when there’s background noise or in a room full of people who are talking at the same time. Even we humans have a limited capacity to understand many people who speak at the same time. AIs have a harder time recognizing voices in loud rooms than we do.

Calibrating computers to understand humans is no work for those who expect instant gratification. Speech recognition software is taught to analyze what you say, then readjust. Sound engineers and tech experts still need to teach devices to handle the many nuances of any language, let alone one that’s been spread throughout large swaths of the world, has different words for names of fruit, contains many double entendres and has ever-changing slang.

Today’s voice-recognition software offers us instant reactions, but ironically, it took centuries of patience to get as far as we are now.

Which brings me back to my one-time gig with this tech company. I was hired in 2014, six years after Google Voice was introduced, and three years after Siri was born. That seems like a long time, but only waiting three years to invest in a project is nothing compared to waiting 200 years. Christian Kratzenstein didn’t live to see the impact he has on us today, and neither did Alexander Graham Bell.

What will it take for computers to truly understand Spanish?

Given how long it took to create voice-recognition programs that can understand full sentences, it may take some time for entrepreneurs to ensure everyone who speaks Spanish is completely understood. Speakers of any language need to be patient with these processes, as languages are ever-evolving.

Engineers and those in the many sectors of technology are quickly working to decode the mysteries of “neutral” Spanish. It’s worth taking some time do this right and ensure representation of as many people as possible.

One way to do this might be to hire remote voice actors from various parts of Latin America, Europe and other countries where Spanish-speakers of all varieties are concentrated.

It also behooves writers, translators and other multilingual professionals to stress the importance of cultural work in technology — not just as a marketing tactic but as a way to represent people fairly. To decode neutral Spanish in our computer-run world, we’ll need this human element.