Technology

Advertisement

MIT CSAIL and Reviving Lost Languages

AI, Technology

Can the evolution of language inform machine translation models for extinct languages? Researchers at CSAIL think so. Jean-Francois Champollion did too.

If not for ancient Greek and Coptic – a descendant of ancient Egyptian – the decades-long effort to crack the Rosetta Stone could have turned to centuries. For dead languages with few or no existing descendants, the task would appear impossible. Machine translation could help.

A project at MIT has been evolving throughout the past decade, as researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have sought to develop a system that can automatically decipher lost languages, even with scarce resources and an absence of related languages.

The team made headway in 2010 when Regina Barzilay, a professor at MIT, alongside Benjamin Snyder and Kevin Knight, developed an effective automatic translation method from the dead language Ugaritic into Hebrew. However, the more recent study considered this breakthrough relatively limited, since both languages are derived from the same proto-Semitic origin. Furthermore, they found the approach too customized and unable to work at scale.

To build off their initial findings, Barzilay and Jiaming Luo, a PHD student at MIT, have proposed a model that accounts for several linguistic constraints, particularly “patterns in language change documented in historical linguistics.”

One grounding principle here is that most human languages evolve in predictable ways. This would account for linguistic patterns where descendant languages rarely make drastic changes to sounds. A plosive “t” sound in a parent language, could feasibly change to a “d” sound, but would very seldom evolve into fricatives like “h” and “s” sounds.

Along with these constraints, another notable detail here is history. As the algorithm deciphers patterns in sounds and syntax, it will also pull from encyclopedic data to fill in some of the blanks.

“For instance, we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence,” Barzilay told MIT News. “These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”

While imperfect, these methods have so far made progress. The team found the algorithm could identify language families, and one instance corroborated earlier findings that Basque – a language spoken in a region of northern Spain and southwestern France – appeared too distinct to assume any linguistic relation.

The team hopes eventually to develop a method of automatically identifying the semantic meaning of words with or without a linguistic relation. Like the linguists who cracked the Rosetta Stone, CSAIL researchers could be on the verge of a paradigm shift.

Tags:, , ,
+ posts

MultiLingual creates go-to news and resources for language industry professionals.

Advertisement
Weglot

Related News:

Advertisement

Straker Wins Major Contract with IBM

AI, Business News, Localization

Shares of New-Zealand based Straker Translations (ASX.STG) jumped almost 45% today — November 11 in New Zealand — on the announcement of a strategic two-year agreement with IBM starting in January 2021.

Straker’s AI-based RAY platform runs on IBM Cloud, and integrates seamlessly with IBM’s technology platforms. It outperformed other technologies that were considered in the selection process. Of particular note is the ability to take on IBM’s global media localization to provide multimedia content in 30 languages.

The localization company already provided localization services into Spanish, and will now expand its portfolio to 55 languages in support of IBM cloud services, IBM adaptive translations, and IBM global media localization. Volumes have not been disclosed, but Straker expects significant growth in revenue and a 30% increase in headcount to handle the new languages.

“This agreement is a recognition of the outstanding capabilities of our technology to handle a large volume of translation that is currently managed internally at IBM. Our talented team will be able to achieve major productivity gains with AI-powered RAY platform,” Straker CEO and co-founder Grant Straker told MultiLingual.

After IBM announced last month that it was restructuring by spinning out its infrastructure services business, IBM CEO Arvind Krishna made it clear that his focus is going to be on transforming the organization into a hybrid cloud management vendor. This is certainly a good sign for Straker and its shareholders.

 

 

Tags:, ,
+ posts

Katie Botkin, Editor-in-Chief at MultiLingual, has a background in linguistics and journalism. She began publishing "multilingual" newsletters at the age of 15, and went on to invest her college and post-graduate career in language learning, teaching and writing. She has extensive experience with niche American microcultures across the political spectrum.

Related News:

Resemble.ai Launches AI Tech That Mimics User Voice

AI

The new voice localization AI from Resemble.ai will translate the user’s voice across languages, initially supporting English, French, German, Dutch, Italian, and Spanish.

Resemble.ai, which works on generative deep learning voice technology, recently announced that it has created Localize, a voice AI technology that localizes speech. Generally speaking, entertainment companies, ad agencies, call centers, and companies that need to translate voices use a different dubbed voice in each language. According to Resemble.ai, however, user voices will carry into any language with Localize, meaning the speaker’s voice will remain consistent even when translated.

Resemble.ai claims to clone voices at scale in seconds, as opposed to weeks. It has taken the previously laborious, expensive process and cloned 42,000 voices for 65,000 users, including two of the largest global telecoms, two of the largest consulting companies, a top global broadcasting company, two of the largest entertainment conglomerates, one of the largest toy makers, and the leader in airport communications systems.

Localize will be compatible with video games, movies, call centers, company videos, and more as they are translated to and from languages including English, French, German, Dutch, Italian, and Spanish, with upcoming plans to introduce Localize for Korean, Japanese and Mandarin.

Normally, voice translation takes an average of two months and can cost companies hundreds of thousands of dollars. For entertainment companies, dubbing a script is logistically challenging and the fidelity of the production is oftentimes lost in translation. This new voice technology aims to accomplish the equivalent volume in a week with maximum creative flexibility and efficiency.

“It’s hard to overstate how important audio has become in recent years — or just how much bigger it’s going to get in an AirPods-first world,” said Peter Rojas, partner at Betaworks Ventures. “Synthetic voice is going to be key to all this by transforming how audio is created. Demand for localized and translated spoken word content, whether it’s in the form of podcasts or audiobooks, is exploding, and AI-based tools like Localize are the way to satisfy that demand.”

Tags:, ,
+ posts

MultiLingual creates go-to news and resources for language industry professionals.

Related News:

Kynamics Secures DHS Translation Device Funding

Technology

Back in February, DHS released a solicitation for a robust multilingual translation device. It has awarded ASR and NLP company Kynamics with the opportunity.

Months after an industry-wide solicitation by the Department of Homeland Security (DHS), Kynamics has secured the Science and Technology Directorate’s (S&T) Silicon Valley Innovation Program (SVIP) language translation capabilities funding award. Based in Mountain View, Kynamics specializes in automatic speech recognition (ASR) and natural language processing (NLP) for mobile devices.

Regarding earlier devices, DHS stated, “the challenge of the USCG system is that it lacks the dynamic/agile robustness to effectively communicate across the spectrum of languages that are emerging across areas that were once more static.”

The award grants Kynamics $192,520 in Phase 1 funding to produce a portable, standalone translation system. The Language Translator solicitation aims to create a device capable of facilitating communication in real time with non-English speakers and those who are unable to communicate verbally. Ideally, the device will support at least 16 languages, including Arabic, Chinese, Mandarin, Persian-Iranian, French, German, Haitian-Creole, Indonesian, Japanese, Korean, Portuguese, Russian, Spanish, Tagalog, Thai, Ukrainian, and Vietnamese. The effort is in support of United States Coast Guard (USCG) missions.

“DHS S&T and SVIP have given the Coast Guard an opportunity to connect with innovative small businesses, such as Kynamics, to develop language translation technology that may enhance operational mission execution,” said Wendy Chaves, Chief of the Coast Guard Research, Development, Test and Evaluation, and Innovation Program.

While performing rescue and investigation missions, Coast Guard operators must be able to accurately communicate in real time with vessel occupants, many of whom are non-English speakers. Furthermore, since USCG personnel are often stationed at sea during extreme weather conditions with no Internet connection, the device must also be able to function offline and withstand temperatures ranging from 140ºF to -50ºF (60ºC to -5ºC).

“We’re excited to see how the Kynamics project unfolds as we move through this first proof-of-concept phase with USCG,” Melissa Oh, SVIP managing director, said. “We’re also delighted to include a minority female-founded company in our portfolio.”

Tags:, , , ,
+ posts

MultiLingual creates go-to news and resources for language industry professionals.

Related News:

TED, SYSTRAN Partner, Create Multilingual NMT Models

Translation Technology

Beginning with ten languages, SYSTRAN will use TED content to develop neural machine translation models for technical content in a variety of fields.

AI-based translation technology company SYSTRAN announced recently its new partnership with TED to build specialized neural translation models that are based on high-quality translations of TED Talks. These unique models are designed to meet the sophisticated translation needs of multinational companies, educational institutions, government agencies and other organizations by enabling accurate and fluent translations of learning, scientific, business, and technical content in ten languages.

A nonprofit organization whose slogan is “Ideas Worth Spreading,” TED has committed to global language access as one of its core foundations. Organizations in 150 countries participate in the TEDx initiative, which allows groups to apply for licenses to organize conferences made up of local participants, ranging from professors to scientists to writers.

Along with TEDx, the company currently has a major translation initiative of their online resources, with a team of over 35,000 human translators, who have produced almost 175,000 translations and captions in 115 languages. The data from this major cache of language resources will likely enable SYSTRAN to expand their neural translation models to even more languages as well.

“SYSTRAN is TED’s first-ever authorized partner in bringing together TED content and machine learning to develop a commercial product,” said Alex Hofmann, Director, Global Distribution & Licensing at TED. “The fact that our inaugural collaboration in the AI space is focused on neural machine translation models built from translations of TED Talks in multiple languages feels natural and are now available on a licensed basis to help enterprises and organizations meet their most sophisticated translation needs.”

The proprietary models are developed by SYSTRAN, pairing TED’s unique multilingual data and SYSTRAN’s AI expertise, and are an early step in advancing data usage in wider applications. TED requires a license for authorized use of its data for commercial AI and machine learning purposes, and SYSTRAN is the first to obtain such a license. In accordance with SYSTRAN’s core principles of security and data privacy, TED fully preserves its intellectual property and ownership of its data as well as the specialized models. The TED-owned models are available on the SYSTRAN Marketplace, a catalog of specialized models for specific domains such as legal, finance, health, education, science/technology and many more.

“This strategic partnership is about taking our shared goals of connecting people and cultures and facilitating multilingual engagement globally,” said CIO of SYSTRAN, John Paul Barraza. “The human-created translations generated by the TED Translator community are of the highest quality, enabling SYSTRAN to build accurate and fluent translation models for use across a plethora of business and professional applications.”

SYSTRAN conducted double-blind human evaluations on the TED models it built, and the results show improvements in accuracy and fluency over baseline state-of-the-art generic models. The human evaluations also revealed unexpected results, with 41% of the models scoring higher than the human reference translations.

“The current global situation is showing us how inter-connected the different countries and populations worldwide are. Companies are imagining a world with far less boundaries — starting with the way we communicate,” said Jean Senellart, SYSTRAN CEO. “Introducing models to the SYSTRAN Marketplace is an incredible opportunity and will respond to real needs in the translation of educational, business, scientific, and technical materials.”

Tags:, ,
+ posts

MultiLingual creates go-to news and resources for language industry professionals.

Related News:

ASR Tech in Hebrew and Arabic Languages Big Focus for IIA

Technology

The Israel Innovation Authority aims to improve ASR capabilities for Hebrew and Arabic languages, which have largely been left behind in voice recognition technologies.

As automatic speech recognition (ASR) technology improves through cutting edge innovations in machine learning (ML) and artificial intelligence (AI), many developments lack accessibility for a large portion of languages other than English. In particular, languages with unique morphology or complex syntax structures that differ from the languages available to current ASR technologies often face exclusion in new innovation and thus a delay in development using the technologies.

Responding to particular challenges for Hebrew and Arabic languages, the Israel Innovation Authority (IIA) and the Israel National Digital Ministry have announced the establishment of the Association of Natural Language Processing (NLP) Technology Companies, an initiative that aims to improve the ASR capability of computerized systems for understanding the Hebrew and Arabic languages.

“The public sector deals with unstructured data in Hebrew and Arabic on a daily basis. One of the major challenges in the digitization of public services is to enable operational efficiency and high productivity while ensuring that such services are free to the public,” said Asher Bitton, the ministry’s director-general.

To cultivate a detailed and complete understanding of the languages in the computerized systems, the association will invest about $2 million in R&D to analyze the syntactic, semantic, and morphological characteristics of the Hebrew and Arabic languages. It will achieve this by deploying collections of Hebrew and Arabic texts from diverse fields, including news, archives, films, books, articles, customer service, transcribed radio and television broadcasts, and professional literature.

“The Association that we established this week will allow Israeli industry to clearly define its needs and help close technological gaps by enabling the use of unstructured databases in Hebrew and Arabic and providing insights which can be harnessed when developing and promoting products and services provided by Israeli companies,” Israel Innovation Authority vice president Aviv Zeevi said in a statement.

The founding members of the association include Intel, Israel-based YNet News, Bank Hapoalim, and AudioCodes Ltd, as well as content creators like Ha’aretz Newspaper, The Israeli Public Broadcasting Corporation (Kan) Television, and The Knesset (Israeli Parliament) Archives.

Tags:, , , ,
+ posts

MultiLingual creates go-to news and resources for language industry professionals.

Related News:

GreenKey Creates NLP Tool for Hedge Funds

AI

Focus Studio, the new application from GreenKey, will provide users with natural language processing workflows specific to hedge fund management.

Bank sales teams often turn to natural language processing (NLP) to find client insights — such as OTC quotes and trades — within emails, direct messages, and phone calls and manage increasing amounts of conversational data. This type of computational power is theoretically possible cross-linguistically as well, which has interesting implications for the language services industry. Specifically, the types of text processed might require unique needs to generate trade ideas. Addressing a specific need for hedge funds, GreenKey, creator of natural language processing (NLP) workflows for sales and trading, has released its latest version of the “Focus Studio” application.

Users of Focus Studio can customize NLP to go through various files and deliver highlighted insights as daily reports or power real-time automation, such as chatbots. This latest version of Focus Studio now includes NLP models designed specifically for hedge funds to help them cope with the amount of unstructured text they process.

Based in Chicago with offices in New York and London, GreenKey is the creator of a patented speech recognition (ASR) and NLP platform that recognizes complex jargon across real-time audio and text sources and transforms them into actionable insights. GreenKey converts disparate communications streams into structured data tools that help banks, trading firms, and emergency services operators automate complex workflows.

GreenKey trains the new NLP models on real sell-side human analysts to capture their insights and include the ability to rapidly customize those models through a quick annotation process. Traders will select from the base models called “trusted curators” and can even ask their favorite sell-side research analyst to create and contribute one. The custom model collection can be fed thousands of documents and will identify trending topics, intents, entities, and can even provide innovative raw sentiment scores such as “word disfluency.” The pre-trained models also include in-depth product knowledge across global fixed income, credit, equities, FX, and commodity markets.

“NLP is already changing the way sales and trading occurs on the sell-side, enabling a wave of automation and insight generation across various workflows,” said GreenKey Founder and CEO Anthony Tassone. “Now the buy-side can begin to leverage NLP to automate and scale their analysis, while retaining the ‘trusted curator’ role of the sell-side research provider and analyst.”

Tags:, , ,
+ posts

MultiLingual creates go-to news and resources for language industry professionals.

Related News: