AI Trends in OCR for the Localization Industry

November 30, 2023

Optical character recognition, or OCR, is an important step in preparing documents for localization. It requires thorough text processing before submitting a document to the translator. The reason one should pay attention to document preparation is that not all documents can be edited directly. OCR tools may be required to convert files into an editable format.

What Is OCR?

In the context of this article, OCR is a step in preparing documents for translation. It involves converting non-editable versions of documents (such as PDF, JPG, or TIFF files) into editable versions compatible with CAT tools for subsequent translation. Various automated image and text recognition tools such as ABBYY FineReader, Adobe Acrobat, and Expert PDF are used for this purpose. To ensure compatibility with CAT tools:

All text to be translated must be editable
Linguistic units should not be broken into parts (no incorrect segmentation)
There should be no unnecessary formatting tags

The document structure is also crucial. Pay attention to:

Unnecessary gaps between sections
Customized headers and footers
Automatically generated tables of contents
Automatically created lists
Customized styles

All of these factors ensure high quality and faster processing of the document both before and after its translation.

OCR Stages and Their Automation

Detecting areas in FineReader

In this stage, recognition areas are detected in accordance with the purpose of the element: table, main text, image, and background image with captions. This process can be performed either automatically or manually. Note that a fully automated process may leave some text unrecognized or the destination incorrectly set. You can speed up the work by using area templates for identical layouts.

Checking text in FineReader

FineReader contains a built-in module for checking dubiously recognized characters. This step is performed by an operator who visually matches fragments that FineReader has detected as incorrectly recognized.

Clearing all text formatting in Word

After transferring the recognized document to Word, it is essential to clear all unnecessary formatting. This helps to prevent excessive tags in CAT tools that can hinder the translator’s work and pollute translation memory. You can use macros to automate this process.

Formatting in Word

Adjust page settings, headers, footers, styles, and lists, and make the document as a whole look as similar as possible to the original. This is a mostly manual task, with hardly any room for automation.

Checking the text in Word

Check spelling and numbers. You can optimize this step by using macros to highlight individual digits, periods, commas, and other special characters, considerably speeding up the checking process.

Where Can AI Be Applied?

AI should target the most time-consuming step or steps with a high error probability. The rough ranking of steps by duration is the following, starting from the most labor-intensive:

Formatting in Word
Detecting areas in FineReader
Checking text in FineReader
Checking text in Word
Clearing all text formatting in Word

Let’s have a look at how to use AI in each tool.

FineReader

During the recognition process, FineReader uses AI-based recognition technologies. There is currently no way to improve the automatic area detection feature to speed up this step with the use of existing software. At the same time, FineReader continues to evolve, and we expect that future functionality will allow us to apply AI ourselves.

Word

The most time-consuming formatting step is completely manual and requires a visual comparison to the original. One way to use AI with an already recognized document would be to incorporate it in the checking process for any spelling, number, and single-character errors. Computer vision technologies are already available. However, it remains uncertain whether an AI tool capable of comparing the original image with recognized text will evolve in the near future.

Using AI to verify recognized text without an original image is not very effective. Since there is no access to the original text, the AI has no information about what the recognized text or numbers should be. A true AI game-changer in this area will be able to compare the original text to the recognized version.

Evolving Opportunities to Apply AI to OCR

While AI can potentially improve some aspects of document preparation for translation, such as spelling and character recognition, current software functionality limits its ability to fully replace the verification stage done by humans. That said, the opportunity to apply AI to OCR exists and continues to evolve.

The underlying question is whether there will be sufficient long-term demand. Assuming that the volume of content requiring OCR decreases as digitalization progresses, the need for OCR will diminish as well. However, demand for OCR is still strong owing to widespread use of the PDF format. And currently, all indications are that this file format will continue to be regularly used by people all over the world.

AI Trends in OCR for the Localization Industry

What Is OCR?

OCR Stages and Their Automation

Detecting areas in FineReader

Checking text in FineReader

Clearing all text formatting in Word

Formatting in Word

Checking the text in Word

Where Can AI Be Applied?

FineReader

Word

Evolving Opportunities to Apply AI to OCR

RELATED ARTICLES

The Week in Review: Language Industry News April 21-27

Welo Data Expands to Qatar, Advancing Data Integrity and Culturally Aligned AI in MENA

Healthcare AI Startup ASTRID Adds Spanish Across Medical, Dental, and Veterinary AI Agents

Globalization Partners International Announces Key Promotions in AI and Language Technology Roles

Language Intelligence Corporation Inc. Launches as a Premier Sovereign AI Technology Provider

Weekly Newsletter, Subscribe to stay updated!

Login or Register