Employing video recording techniques in localization

By Bill Black & Simone Crosignani May 20, 2014

The past 12 months have been particularly significant for the video game industry. The market is now extremely diversified, thanks to a broad range of gaming platforms that differ in cost, technical features and user bases. This unprecedented heterogeneity has allowed for a wide gaming scenario: on one side of the spectrum we have witnessed the final consecration of indie games, titles created by talented developers who often work solo or in smaller teams.

On the opposite side, we have been blown away by epic productions such as The Last of Us (Sony Computer Entertainment) and Grand Theft Auto V (Rockstar Games), created by teams of over 200 people with budgets exceeding $100 million. In a sense, the game industry is getting much closer to the movie and film industry. It seems able to finally give space to creative developers who are void of technological means but who have brilliant ideas, while also churning out Hollywood-like productions at the forefront in terms of both spectacularity and narrative.

Where film and video games keep having a very distant approach is on the audio localization process. Because of the just mentioned affinity between blockbuster movies and games, the local markets ask (and rightly so) for “movie quality dubbing” in AAA games. This has proven impossible so far because of the different production models of the two media. Traditionally, movie assets allow international voice actors to see the complete scenes while recording, with clear references of the faces of their US counterparts. In AAA games such an approach is unimaginable because localization takes place while the game is still in the development phase, with international assets due to be on the same physical media of the source ones. In other words, when a voice actor enters the recording booth in Paris, Madrid or Milan, he or she usually has only one type of asset to support him and that’s the original audio files, a precious help to understand how the US actor delivered his lines. Sometimes the development team is able to provide preliminary mock-up materials or raw videos to give the scene more contexts, but we are still light years away from the scenario of a movie voice actor who can watch the finished (or almost finished) product while doing his job.

Technology advancement now allows us to close this gap. To be more specific, thanks to video-based capture suites it’s now possible to implement a new, more integrated approach between development and localization. This approach, which originates from the US recording session, lets 3D artists and audio localization teams use the same, consistent, audio/video foundation to perform their respective tasks.

Motion capture

A few words on motion capture first. Motion capture is the process of recording the movement of objects or people. It’s a technique widely used in the entertainment field, especially in cartoons and games, to animate digital character models. There is a broad range of motion capture systems available, but optical systems based on recording cameras are the most commonly used. Optical systems can rely on markers and 3D position tracking, or be completely markerless, taking advantage of video streams and software-based video analysis. As previously mentioned, using motion capture is pretty common in AAA games nowadays, especially when it comes to facial motion capture, or in other words the process of recording the facial movements of an actor and replicating them on a character in the desired entertainment product — be it a movie, a cartoon or a video game. The market offers a wide range of suites or packages of components such as grabber, analyzer, retargeter and so on. MOVA Contour, Image Metrics Faceware Technologies, Organic Motion and Dinamixyz Performer Suites are just a few of the better known capture suites.

But how can markerless facial motion capture help create international audio content that is close to the original material? Everything starts in the US recording studio when recording source dialogue. Actors wear helmets equipped with cameras and lighting (Figure 1). These helmets are lightweight and the markerless design does not distract the actor from his or her performance. The recording session generates audio and video tracks simultaneously in the same software application, allowing studio producers and directors to evaluate both audio and visual performances.

Once the session is over, audio and video files are delivered to the game development team. 3D artists can then begin using these assets with proprietary software, and can generate a data stream for their 3D modeling software (Figure 2). This is actually already common practice, however. The novelty here consists in sending the US video recording together with the audio files to international recording studios worldwide.

In live action film and television, the dubbing process is based on the source video. After the translation, a specific text adaptation phase is performed before starting the dialog ADR. ADR is a term meaning automated dialog replacement, and this is a technique where an actor reads lines in a recording studio to match the mouth movement of the face onscreen. This is used first and foremost to repair dialog shot in a live action TV or film, and can also be used when dubbing live action from one language to another language, with the goal being to match the actor’s mouth movements so it is imperceptible to listeners that the dialog they are hearing was not the language the actor performed in. Sync writing, or writing a translated script that follows the actor’s original mouth patterns, is key to perfectly lock the target words with the lip movements of the original actor. Without an accurate and final video reference, there is nothing to write synchronized dialogue with.

In this way, the same technique used in live action film dubbing can now be used in games. The localization team can employ the selected takes from the source English, and sync write dialog for the target language using the video reference originally intended only for the 3D art team. Although the video may have been intended for the sole purpose of creating animation based on data derived from applying proprietary software to the video filmed in the recording studio, now it can be used for dubbing as well.

Thus, using this filmed footage, the localization process can begin much earlier in the production cycle, with a level of precision equivalent to live action film dubbing. Implementing markerless facial motion capture during the US recording phase and allowing international audio teams to access the captured data streams opens the door to multiple advantages: having better reference material earlier in the production cycle, writing accurate crafted scripts, producing movie-quality dialog sync and ultimately achieving superior quality in localized games.