Fully Automated Multimedia Localization Possible?

Radek Buchlovsky

Radek Buchlovsky is the publishing and media director at RWS Moravia. He began as a translator and moved into media development after completing studies in linguistics and film directing. Radek has spent the last eight years in the multimedia world using modern technology to drive the localization process forward. He is also a proud leader of the diversity and inclusion committee at RWS Moravia.

Multimedia takes on many forms, some of them easy to localize — like a simple website or an animation — and others more difficult. Video is probably as complicated as it gets. A dream state would be to localize video, accurately, on-brand, on-message and adapted to cultural sensitivities, with no human intervention. If this were to happen, content creation for international markets would increase at an exponential rate that would surprise even the most global of brands. The benefit? Consistently reaching current and new markets with relevant, timely, on-brand content that will meaningfully engage your customers with your product and ultimately increase sales and brand awareness. Plus, your operation is cost-effectively running 24/7, saving you time and money while increasing efficiency.

Sounds ideal, and experts think fully automated multimedia production for localization is not only possible, but will happen in the future. But first: it is important to stress here that we are not talking about automation technology replacing people. What we are talking about is technology that will free creative people up to do what they do best — create. By automating the localization of a piece of source content, which goes far beyond merely translating it, teams of creative talent can save hundreds or even thousands of hours a year in repetitive tasks and instead build imaginative source content that resonates with new and existing customers, and drives brands forward.

But how far away are we from this tipping point?
Take video as an example. This is a complex form of multimedia to produce because it typically has layers of video, audio, text, and graphics that all have to be married into a single, cohesive form. To fully automate this process for localization, you would start with your source content, put it through the automation process and end up with your localized version. Voilà — images, graphics, audio, and everything else would change through a single process without human intervention.

 

How close are we?

There is a lot that we can already do. Sticking with the video example, we can use voice recognition to translate the audio. Smart software takes the source voiceover, translates it, and gives us a new recording using a computer-generated voice. These voices are becoming increasingly more realistic as the industry focuses on voice tone and inflection to make sentences sound like fluid human speech and not just a string of words. We can also generate subtitles in other languages automatically by using suppliers such as CaptionHub, whose software can translate the subtitles and provide target-language versions that are married to the video in the right places. These areas bring different automation technologies together to localize the content and get us close to our desired end state of full automation.

 

Three areas where automation is missing

We can translate, but we cannot adapt some source material because it is too complicated for current software to break down, understand and recreate. For example, if the video has a CEO running through a PowerPoint deck, we can trans-late the CEO’s voice, but we cannot change the PowerPoint slides. There is no technology that can look at the slides and create new versions without human intervention. Having to translate the slides elsewhere through a different process, we are left with different visual assets for each language that now need to be integrated into the project by a professional.

Traditional optical character recognition (OCR) software is nothing new. There are a lot of good options out there. Techradar.com lists the likes of Adobe Acrobat Pro DC and OmniPage Ultimate as a couple of the leaders in the field, but the flaw for multimedia automation is in the name — optical character recognition means characters, not images with embedded text. Image-based optical recognition would allow us to translate an image the same way voice recognition translates our speech. So, if we are looking at a PowerPoint slide and we can pass it through the magic optical recognition scanner, it would read both font-based text in a text box and any words embedded in graphics, images or pictures. It would recognize and translate them all.

Mainstream technology is already moving in this direction. If you take a screenshot on most smartphones, the phone will suggest you crop it based on the image shapes within the screenshot. So, our hand-held technology can identify basic shapes, which was not possible a decade ago.

And how about quality assurance (QA)? When the output file is created, whether it is a video, a brochure, or a website, can the quality be trusted without a set of human eyes giving it a once-over? Is there a creative director in the world that would have that kind of faith? Being able to trust a process like this to handle QA would be a huge step forward for the marketing, advertising, and communication industries, whose reputations and livelihoods depend on the accuracy of the message within their content.

 

Automation and true cultural adaptation

Before we dive into our projections for the future, let’s back up and talk for a minute about what localization really means. Of course, any time we talk about fully automating multimedia localization, we are talking about translating the source content. But it is important to remember that the goal is not just a translated video, but a localized video that fully represents the meaning, tone and message of the source video. This is not translation; it is transcreation, which is the process of fully adapting content for a specific target locale. But can we trust machines alone to get the meaning and intent of the message right?

“The multimedia creators among us will understand that this is multimedia alchemy, and the answer to translation changes and localization issues. Does this exist yet? No, but neither did digital technology (not really; not for the mass market) 30 years ago.”

Maybe one day, but not yet. Perhaps image-based optical recognition technology will also have the intelligence to understand that different markets have different visual cultural rules. This would be a technology that takes responsibility for what it sees in an image, compares that to the market it is going to enter and either suggests changes, or even better, makes the changes.

For example, if you are entering Saudi Arabia and your advertising has an image with a woman in a short-sleeved blouse drinking a glass of wine, is this automatically changed into a long-sleeved blouse and a soft drink? In the future, will a machine be able to recognize the problem and take into account the country’s alcohol ban and cultural norms around women’s clothing?

To create this truly local content, humans would need to develop the rules and input the data into the technology to fully automate the work beyond translation — the critical cultural adaptation.

The more you feed AI with data, the more it comes to understand when usage is right or wrong. But can it get to the point where it understands that a video, brochure or website is going into a marketplace with vastly different cultural norms than where it originated from and it needs to not only transcreate scanned text and imagery, but also adjust it to those cultural norms? Yes, one day, and not too far away.

Tools & Services | SHOWCASE

Tools & Services | SHOWCASE

The future

Imagine a world where all the elements of a multimedia file — all the layers and creative assets that make up each layer — are accessible to anyone who has the file because they are all within the file. A file that never needs to go back to the source software to be changed. A self-contained file that can be automatically localized into any language without the need for Photoshop, InDesign, Premiere, AfterEffects — whatever software created it — because the elements to create it are all within that file.

The multimedia creators among us will understand that this is multimedia alchemy, and the answer to translation changes and localization issues. Does this exist yet? No, but neither did digital technology (not really; not for the mass market) 30 years ago. And software evolution is speeding up exponentially, so this could happen in the foreseeable future thanks to AI.

 

The idea is that AI would deconstruct the final file back into layers, make the changes to those layers necessary for localization and then render and recreate them back into the final viewable file again.

In layman’s terms, you baked a cake but you want to change the sugar from white to brown. With AI, you could take the white sugar out of the cake and add brown sugar without having to go back to bake a new cake. Magic.

Backgrounds are already creating themselves in software like Photoshop. When you cut out a foreground object or move it, Photoshop will create the new background for you using AI to estimate what should be there. So, if you have a photo of a horse running on a beach and you cut the horse out, Photoshop will add water and sand to fill the hole where the horse was, because it is anticipating that this is what would naturally go there. AI can read the elements of the photo around the cut-out object and figure out it was sand and water. With the right knowledge, there is no reason why this technology cannot evolve to also recognize elements that need to be localized and change images, videos, websites, digital platforms or any form of media, so that it fits a specific country or region without human intervention.

 

Getting there

What we are talking about is the Holy Grail of localizing multimedia content. There is a lot we can currently do, but we are still missing a few key ingredients. When we can get those ingredients in place — and it is a “when” not “if” — there will be a game-changing shift in the localization of multimedia, enabling very local, high-quality content to reach more customers more quickly and at less cost. The biggest question at the moment is: can we really trust AI processes to preserve the meaning of the message? Can we trust that message to be transcreated? Do we really want to send media out into new markets, in new languages, without a human checking it? Remember: machines, when programmed correctly, eliminate the mistakes humans sometimes make. AI, when it can handle the cultural component capably, may prevent “bad” content from getting out there. More content is not always better, and humans in a hurry can make mistakes and cut corners. We believe we can trust AI to be intelligent, to really “learn” and to achieve the goal of full automation. At the moment, we are still on the journey. But when we do reach that destination — and there are a lot of people in the industry who think we will get there in the not-too-distant future — then we will have unlimited multimedia localization potential. Content in any language, accurately on-message, precisely on-brand, taking business forward at scale, at speed… at last.