With all the daily media attention given to artificial intelligence (AI), it’s easy to think of it as a “recent discovery,” and to a certain degree that’s true. While some form of this technology has been with us for decades, we’re still in the early phases of figuring out how to truly harness its power to redefine how we live and work.
One market where AI has proven to be effective is content translation and subtitling, streamlining workflows and causing that market to reevaluate many aspects of its operations.
For the media and entertainment industry (M&E) at large, AI will permanently affect (and is already affecting) how we create, edit, distribute, deliver, share, and enjoy content. Even the very definition of “content” is under debate, as is how we define ownership of that content.
None of this is alarmism — it’s a big issue, and big enough to force the actors’ and writers’ guilds to the picket lines. Big enough to prompt lawsuits by celebrities like Sara Silverman and Tom Brady over AI-driven copyright infringement of their work and likenesses.
Before delving into a few specific issues to consider regarding AI in M&E localization, it’s important to make a key distinction. AI has almost become a universal term to define any type of robotic assistant or automation technology, so let’s understand the different subsets of the technology.
Machine learning (ML) refers to the ability of “machines” to learn over time. Based on a series of models, habits, and patterns, devices and software continually retry functions to find a better path from A to Z. Think of Netflix recommendations, or if we go way back to the 80s movie War Games, wherein a supercomputer frantically runs through a series of Tic Tac Toe games to “learn” the zero-sum nature of nuclear war.
However, there are analytics to be derived from this technology, which leads to the discussion of generative AI. Generative AI has similar characteristics to machine learning, but on a much bigger scale. It can learn, but it can also audit itself with the ability to be more creative, forming new content and ideas, images, videos, and even music. Think of the elephant in the room: ChatGPT and other forms of generative AI coming onto the market.
It’s one thing to use traditional AI and machine learning to make sure audio is in sync with video. You don’t need it to generate something new out of thin air. It will be successful as long as the proper models for learning are in place.
It’s an entirely different issue to have an AI engine create an English-language version of a Japanese script that matches the video. Now it becomes generative since it’s forced to make interpretations as well as analyze tone, inflection, nuance, and context. It will still be wrong at times, but then it will learn from those mistakes as it looks at an enormous amount of information and begins to recognize idioms and other unique language patterns.
Large language models (LLM) based on machine learning are ideal for translation and dubbing. However, perhaps you need to create closed captions and subtitles for the purposes of generating tens of thousands of assets, as well as add a quality control component to the process and new tool sets to mimic voices. That’s where localization houses are exploring the full use of generative AI. From an operational, engineering, or even a non-creative executive perspective, organizations are embracing the potential benefits of cost-efficiency, scalability, and uniformity. For those in the creative space, it’s an understandably more emotional argument: “This is no longer art.” There have already been instances where a system has painted a picture and people have had a hard time telling whether it was human- or machine-generated.
It is now easier to understand the reasons behind the strikes and why people are reacting so strongly. One side contends, “You can’t possibly be creative without using my work.” While the other side counters, “This is an incredible tool with the power to reengineer the business world and make us more efficient and faster.”
The reality is that none of these tools will fully replace everything that’s been done by humans for centuries. This is not a three- or five-year problem. Instead, we’re on a slope. The more people that understand this technology and the more these tools get used, the better they become.
Where these tools fit best within the localization market largely depends on the target audience and their tastes and preferences. For example, with content aimed at younger viewers, who are often more interested in a good story with rich graphics, machine learning models for translation and dubbing may be a perfect fit.
Teenagers and adults are often far more discerning, which could make the case for generative AI. It’s a matter of risk versus reward: Are you willing to spend on hiring voice actors to do kids shows, or are you going to let generative AI manipulate inflections? And at what point does it become overkill?
Our industry has been through this before in the area of visual effects. Does the term “uncanny valley” ring a bell? It refers to the relationship between the human-like appearance of robotic or computer-generated images or objects and our emotional responses. As images become more hyper-realistic, there’s a drop-off point where they simply begin to look unnatural.
A perfect example of this was the theatrical release of The Hobbit, shot digitally at 48fps. What was an artistic attempt to elevate the moviegoing experience failed miserably with audiences and critics, who claimed it looked “too” crisp or clear.
As ML and generative AI tools both continue to improve, we’re seeing another interesting industry dynamic form. The large cloud providers have the capability to build the technology at scale, but they’re not niche enough to apply it to such specific tasks as localization.
On the other hand, smaller companies that are building specific data models are going to get funded quickly because AI is a hot topic. However, they will need to apply their models to the different technology stacks, telling their audiences that while they are running on Google or AWS, it’s still their unique model. And that model is what will become the ultimate value proposition versus capacity and scale.
For now, it’s anybody’s game — we’re just waiting for the right solution to emerge.
Looking ahead
The M&E industry has had its share of landmark evolutions — from black and white to color, standard definition to high definition, and physical to digital media — with each causing seismic changes in our business. But these were mostly limited to M&E, affecting broadcast, sports or live entertainment before filtering out to other walks of life.
Now that AI is the “next big shift” in how we interact with the world around us, it’s not only the M&E industry that’s affected; the whole world is trying to figure it out at the same time.
As an industry, we often see ourselves as the universe, but the reality is that we are more like a rounding error for the larger issues existing in the world. In terms of AI, if you only focus on the M&E technology stack, there’s not enough business to go around. Ultimate success will depend on applications that apply to enterprise, government, healthcare, and other markets.
Interestingly, much of the early foundational technology for transcription came out of the automated telemarketer and customer service worlds, where those services needed to recognize speech patterns, dialects, and tone and then make decisions to route a call to the appropriate option.
For now, we are leading in this space because subtitling and dubbing are growing at a massive rate as everybody’s trying to globalize content while still maintaining control of that content.
What we don’t have is the ability to wait for another industry to catch up, and then adopt a technology use case. There’s a little bit of a chicken-and-egg scenario. But one thing is certain: the ostrich effect will not work. You can only stick your head in the sand for so long. Let’s face it: AI is here. Embrace it or ignore it at your own risk.