Can the evolution of language inform machine translation models for extinct languages? Researchers at CSAIL think so. Jean-Francois Champollion did too.
If not for ancient Greek and Coptic – a descendant of ancient Egyptian – the decades-long effort to crack the Rosetta Stone could have turned to centuries. For dead languages with few or no existing descendants, the task would appear impossible. Machine translation could help.
A project at MIT has been evolving throughout the past decade, as researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have sought to develop a system that can automatically decipher lost languages, even with scarce resources and an absence of related languages.
The team made headway in 2010 when Regina Barzilay, a professor at MIT, alongside Benjamin Snyder and Kevin Knight, developed an effective automatic translation method from the dead language Ugaritic into Hebrew. However, the more recent study considered this breakthrough relatively limited, since both languages are derived from the same proto-Semitic origin. Furthermore, they found the approach too customized and unable to work at scale.
To build off their initial findings, Barzilay and Jiaming Luo, a PHD student at MIT, have proposed a model that accounts for several linguistic constraints, particularly “patterns in language change documented in historical linguistics.”
One grounding principle here is that most human languages evolve in predictable ways. This would account for linguistic patterns where descendant languages rarely make drastic changes to sounds. A plosive “t” sound in a parent language, could feasibly change to a “d” sound, but would very seldom evolve into fricatives like “h” and “s” sounds.
Along with these constraints, another notable detail here is history. As the algorithm deciphers patterns in sounds and syntax, it will also pull from encyclopedic data to fill in some of the blanks.
“For instance, we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence,” Barzilay told MIT News. “These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”
While imperfect, these methods have so far made progress. The team found the algorithm could identify language families, and one instance corroborated earlier findings that Basque – a language spoken in a region of northern Spain and southwestern France – appeared too distinct to assume any linguistic relation.
The team hopes eventually to develop a method of automatically identifying the semantic meaning of words with or without a linguistic relation. Like the linguists who cracked the Rosetta Stone, CSAIL researchers could be on the verge of a paradigm shift.