TAUS is now offering its comprehensive data collection of close to 7.4 billion words for sale at discounts of more than 97% off the original value. The sale ends on April 30, 2024. The 7.4 billion words on offer are all non-public, unique, human translation quality data covering 483 language pairs.
TAUS has been collecting translation data since 2008 and has been selling it to Big Tech companies for the training of their MT engines for the last 15 years. Now, the attention is of course completely shifting from MT to LLMs. LLMs are supposed to be good at translation as well. But they could be so much better, with better training on more quality multilingual data.
In the early days of Statistical and then Neural MT, TAUS data served a relatively small audience of a few dozen MT developers. The landscape has changed drastically since 2023. With GenAI and LLMs there are thousands of new players interested in customizing and improving generic models. The TAUS multilingual data is particularly relevant and valuable, especially because most of the LLMs have been trained almost solely, (more than 90%), on English language data. However, the rates TAUS has historically charged – 1,500 to 2,500 Euros per million words – are now too high for the new generation of smaller-scale users, who are less focused on generic models and more on customized models. That’s why the TAUS data is now available at steep discounts of up to 97%.
“There are shifts in the needs for data”, says Amir Kamran, solution architect at TAUS. “The LLM developers are now looking for data with a lot more context to improve the overall performance and accuracy of the language generation features. For the translation performance, they tend to rely on transfer learning, which results in underperformance of the multilingual and translation features of LLMs. The TAUS data helps to improve the translation quality scores with double-digit percentage points.”
Please contact TAUS or complete the online form, to request the data catalog, samples, and pricing. You can purchase the entire collection or choose specific language pairs.