Focus

AI for the Language Industry

Why data labeling may be the new industry gig

Michael Reid bio pic

Carol Jin

Carol Jin is a localization veteran and machine learning engineer that aspires to add machine intelligence to natural languages. She is currently a software engineer at LinkedIn.

AI for the Language Industry

Why data labeling may be the new industry gig

Michael Reid bio pic

Carol Jin

Carol Jin is a localization veteran and machine learning engineer that aspires to add machine intelligence to natural languages. She is currently a software engineer at LinkedIn.

A

rtificial intelligence (AI) has been a hot topic in recent years. It’s no stranger in the language industry either — most of us are very familiar with the term machine translation (MT). We take MT as the primary link between localization and AI. However, is MT equal to AI? Of course not! MT is a very niche field inside of AI. Figure 1 shows how MT connects to AI, as well as a few other terms you might have heard of.

AI is not only a productivity booster to localization, it also brings in a new business opportunity: data labeling. As a former localization program manager and a current machine learning engineer, I can say that data labeling is a promising business for the language industry in the era of AI.

Figure 1: How machine translation, natural language processing, and machine learning connect with AI.

What is data labeling?

Let’s start with a simple machine learning example: an image detector that can tell whether an image contains a dog or a cat. How does the machine learn to do that?

First, the machine needs to see a large variety of dog vs. cat examples. Humans need to supply machines with a lot of images with their corresponding labels. See Figure 2 for examples.

Next, all the images and their labels are used to train the machine. The computer uses features on these images to learn a pattern — the machine learning model. Then if you upload a new image of your pet, the machine uses its learned pattern to decide whether this is a dog or a cat. Typically, the more labeled data the machine has seen, the more accurately it can predict new labels.

In fact, almost all AI algorithms today have to be built on top of labeled data to be effective. How are the labels generated? Sometimes companies don’t have to gather them separately, but other times the labels are produced with a separate task: data labeling.

One example is self-driving technology. Cars need to detect traffic lights, pedestrians, road lines, and other obstacles on the road in any weather condition. Another example is to classify news topics for websites like Google News, site engineers may want to automatically add topic labels to the stories the site gathers. Both of these scenarios require a special effort to label data — create separate tasks for humans to label the data for the machine to learn. Large quantity and high-quality data is the prerequisite of AI. Thus, for a very long time into the future, AI will require humans in the loop to be functional.

Figure 2: Training data for image recognition.

Comparison between localization and data labeling services

Why did I say data labeling is a new opportunity for a localization or language service company? There are similarities and differences between the two services.

Similarities:

1) Both are labor-intensive. This kind of industry gave room to subcontracting — think about how language service providers act as middlemen between individual translators and the language service buyers. That said, compared to localization, the data labeling industry is still in its early stages, with a lot more growth opportunities ahead.

2) Similar quality control flow. Both types of projects require training and guidelines for translators or annotators before the project starts. Moreover, the classic translation-editing-proofreading model is perfectly applicable to data labeling projects, which also require review steps to keep the quality high.

3) Similar business models. In a buyer company,
common strategies include hiring in-house translators/annotators, working directly with individual translators/annotators, or using a service provider to manage its supply chain. Service providers can also use in-house translators/annotators to complete the work, or subcontract the work to freelancers. Another common practice is for buyer companies to hire on-site contractors through service providers who charge recruiting and management costs, known as “managed services.” Which model to choose depends entirely on the company’s business needs. The same principles apply to both localization and data labeling.

4) Similar economic motives. Usually, translators are paid by the number of words translated, and annotators are paid by the number of data labels completed. You may argue there are cases where people are paid by hour, but the hourly charge is also largely depending on the productivity output. In both cases, people are directly rewarded by productivity.

Differences:

1) Different source of budgets and ROI requirements. In a buyer company, localization budgets are often from the marketing department, and have clear ROI targets. Data labeling projects are done for research and development. Data labels do not directly produce revenues, and therefore there isn’t a ROI directly related to the labeling projects.

2) Different tolerance to mistakes. For companies like Apple or Google, localized content is customer-facing. Any mistakes can cause serious consequences, and therefore, there is a very low tolerance of mistakes. On the other hand, when there is sufficient data, machine learning models are not sensitive to small noises in data, as long as the noise level is within the maximum permissible errors in statistics.

3) Different qualifications for translators/annotators. A qualified translator typically has received higher education and specialized language training. They are qualified for a localization project through a translation test, which is like a GRE writing test — there is no standard answer whatsoever. However, there is no universal standard in hiring AI project annotators. Some projects only need annotators to tell whether the people in the images are smiling, while some others need annotators to detect diabetic retinopathy in retinal images. They can be tested on questions with standard answers, such as multiple choice questions.

Other than the above-mentioned things, there are other obvious differences such as tool adoption. We are not going to dive into the details of tools, as there are already many tool-side providers on the market.

In summary, artificial intelligence cannot learn by itself. It largely depends on human labeling and annotations, which are still the cornerstones of a smart world. For language service companies that have been seeking new breakthroughs, their experience in business models and resource management are invaluable, and transferable to data labeling tasks.

Advice for launching data labeling services

The two services are similar but also different. There are many potentials for language service companies to be successful in the data labeling business, but there are also challenges. Here I want to offer some suggestions for those companies that are interested in developing a data labeling service.

We can see that different AI projects require a variety of skills from annotators. The next question you may ask is whether there exists a category of data labeling projects that are more suited for language service companies. The answer is yes — natural language processing (NLP) projects. NLP is still deemed a difficult problem for machines. For a global IT company, its NLP projects likely require annotations in multiple languages, and language service companies can very well satisfy their needs. Therefore, NLP projects can offer a breakthrough for language service companies to start data labeling services.

With target project types, the next question is where to find clients. My suggestion is to work with your localization team counterparts at the buyer companies. Although the localization team is no longer your client, they can still be your partners. This is because when the research and development department requires multi-language NLP project support, the first thing that comes to mind is likely the internal localization team. They can be a bridge between language service providers and the NLP department.

Another possibility to find potential clients is to connect with the NLP development team directly. Find them on the professional social network website, and ask them to guide you internally. If they happen to have projects requiring multilingual support, you are in luck! A good way to improve success rates is to watch for companies that are continuously launching multilingual NLP jobs on the data labeling crowdsourcing platforms. Your target clients are within them.

You may ask, if high-tech companies are using crowdsourcing platforms and working directly with the crowds, are there still values for professional service providers? I believe the answer is yes. The adoption of crowdsourcing is related to the history of AI development. In the early days, AI can only solve easier problems. These problems are not only easy to machines, but also to humans. Crowdsourcing can very well satisfy project needs. However, as AI becomes more and more sophisticated, data labeling projects are also increasingly more difficult, which means less tasks can be completed by the crowds, and more tasks require people with special training. Therefore, the data labeling industry will eventually be dominated by the professional data service providers, the same way language services thrive today.
I also want to point out that localization companies should adjust their supply chain management strategies to be successful in data labeling. In a mature localization program, once the buyer and seller sides establish the partnership, it might be a long-term sustainable program, unless the buyer side makes strategic changes. However, data labeling projects have much shorter lifecycles. Once a project is completed, the same buyer company is much less likely to initiate similar projects within a short time. In the meantime, data labeling projects have lower testing and qualification costs, which complements the disadvantages of short life cycles. These features raise new requirements for supply chains to remain ultimately liquid and flexible. The old ways of relying on resource management staff to manage supply chain will soon become a burden. For any localization company to survive in data labeling, it needs to be empowered by more sophisticated enterprise tools than Excel spreadsheets.

Lastly, I want to reiterate that our society needs AI, and AI needs data. For those localization companies that are in plateau, data labeling is worth the consideration and it is inevitably going to be a thriving field for years to come. Perhaps data labeling will become a regular business in the language industry in the near future. Because whether you want it or not, the AI technology wave is here. Get prepared and ride it, and don’t let it crush you.