Optical character recognition, or OCR, is an important step in preparing documents for localization. It requires thorough text processing before submitting a document to the translator. The reason one should pay attention to document preparation is that not all documents can be edited directly. OCR tools may be required to convert files into an editable format.
What Is OCR?
In the context of this article, OCR is a step in preparing documents for translation. It involves converting non-editable versions of documents (such as PDF, JPG, or TIFF files) into editable versions compatible with CAT tools for subsequent translation. Various automated image and text recognition tools such as ABBYY FineReader, Adobe Acrobat, and Expert PDF are used for this purpose. To ensure compatibility with CAT tools:
- All text to be translated must be editable
- Linguistic units should not be broken into parts (no incorrect segmentation)
- There should be no unnecessary formatting tags
The document structure is also crucial. Pay attention to:
- Unnecessary gaps between sections
- Customized headers and footers
- Automatically generated tables of contents
- Automatically created lists
- Customized styles
All of these factors ensure high quality and faster processing of the document both before and after its translation.
OCR Stages and Their Automation
Detecting areas in FineReader
In this stage, recognition areas are detected in accordance with the purpose of the element: table, main text, image, and background image with captions. This process can be performed either automatically or manually. Note that a fully automated process may leave some text unrecognized or the destination incorrectly set. You can speed up the work by using area templates for identical layouts.
Checking text in FineReader
FineReader contains a built-in module for checking dubiously recognized characters. This step is performed by an operator who visually matches fragments that FineReader has detected as incorrectly recognized.
Clearing all text formatting in Word
After transferring the recognized document to Word, it is essential to clear all unnecessary formatting. This helps to prevent excessive tags in CAT tools that can hinder the translator’s work and pollute translation memory. You can use macros to automate this process.
Formatting in Word
Adjust page settings, headers, footers, styles, and lists, and make the document as a whole look as similar as possible to the original. This is a mostly manual task, with hardly any room for automation.
Checking the text in Word
Check spelling and numbers. You can optimize this step by using macros to highlight individual digits, periods, commas, and other special characters, considerably speeding up the checking process.
Where Can AI Be Applied?
AI should target the most time-consuming step or steps with a high error probability. The rough ranking of steps by duration is the following, starting from the most labor-intensive:
- Formatting in Word
- Detecting areas in FineReader
- Checking text in FineReader
- Checking text in Word
- Clearing all text formatting in Word
Let’s have a look at how to use AI in each tool.
During the recognition process, FineReader uses AI-based recognition technologies. There is currently no way to improve the automatic area detection feature to speed up this step with the use of existing software. At the same time, FineReader continues to evolve, and we expect that future functionality will allow us to apply AI ourselves.
The most time-consuming formatting step is completely manual and requires a visual comparison to the original. One way to use AI with an already recognized document would be to incorporate it in the checking process for any spelling, number, and single-character errors. Computer vision technologies are already available. However, it remains uncertain whether an AI tool capable of comparing the original image with recognized text will evolve in the near future.
Using AI to verify recognized text without an original image is not very effective. Since there is no access to the original text, the AI has no information about what the recognized text or numbers should be. A true AI game-changer in this area will be able to compare the original text to the recognized version.
Evolving Opportunities to Apply AI to OCR
While AI can potentially improve some aspects of document preparation for translation, such as spelling and character recognition, current software functionality limits its ability to fully replace the verification stage done by humans. That said, the opportunity to apply AI to OCR exists and continues to evolve.
The underlying question is whether there will be sufficient long-term demand. Assuming that the volume of content requiring OCR decreases as digitalization progresses, the need for OCR will diminish as well. However, demand for OCR is still strong owing to widespread use of the PDF format. And currently, all indications are that this file format will continue to be regularly used by people all over the world.