In the time it takes the average child to learn how to speak their first language, they’re exposed to far fewer words than it takes to train a machine translation (MT) or large language model (LLM).
And still, children manage to produce language effortlessly, and with more accuracy than these models.
“Children easily acquire language from quantities of data that are modest by the standards of modern artificial intelligence (AI),” reads a paper published in Nature Communications earlier this week.
Partly inspired by this phenomenon, the team of researchers behind this paper set out to develop a model capable of developing and testing its own theories about human language based on small data sets. Simply put, the researchers developed a model that can analyze instances of human language and then derive possible grammatical, morphological, and phonological rules explaining why different forms appear in certain contexts.
Human beings are capable of looking at fairly small data sets and coming up with theories as to why these forms occur. Outside of first language acquisition, this phenomenon is even more evident in your typical introductory linguistics class: On their homework problem sets, students are commonly given an annotated list of a dozen or so words in a specific language and they must analyze the forms closely to propose a rule underlying the forms in the problem set.
For instance, a particularly simple list might include the forms “horse” and “horses,” (among a few other singular-plural pairings) and students would propose that, in English, the suffix “-s” is added to create a plural form.
“Linguists have thought that in order to really understand the rules of a human language, to empathize with what it is that makes the system tick, you have to be human,” said Adam Albright, a professor of linguistics at the Massachusetts Institute of Technology and one of the researchers who worked on the model. “We wanted to see if we can emulate the kinds of knowledge and reasoning that humans (linguists) bring to the task.”
Machine-learning models generally need much more data than humans do. However, when testing out their model, the researchers found that the model was capable of deriving rules from very small data sets, taken from the same phonology problems a linguistics student might encounter in their first or second year at university. It was able to come up with these rules at an accuracy rate of 60% — admittedly, this isn’t perfect, but the researchers believe the work could shed further light on the nature of human language.
“Theory induction is a grand challenge for AI, and our work here captures only small slices of the theory-building process,” the study reads. “Like our model, human theorists craft models by examining experimental data, but they also propose new theories by unifying existing theoretical frameworks, performing thought experiments, and inventing new formalisms.”