Social media giant Facebook is under the microscope yet again recently for purportedly prioritizing growth and profit over the safety of users and marginalized groups.
In one of the more recent revelations to come from the so-called Facebook Papers, a collection of documents delivered to congress by former Facebook employee turned whistleblower Frances Haugen, is that Facebook’s content moderation efforts fall woefully short for certain non-English languages, including those spoken in particularly volatile areas and by vulnerable groups.
Documents first reported by CNN and a consortium of other outlets and analyzed by MultiLingual show that moderation efforts in dialectical Arabic, Pashto, and Hindi, among others, were hampered by the lack of moderators fluent in those languages.
A lack of artificial and natural intelligence
Reports since 2018 have been sounding the alarm that artificial intelligence (AI) based content moderation for non-English languages at Facebook had been failing to detect instances of threatening or misleading content, but these new revelations indicate that human moderation has also been lacking.
Facebook claims to have AI content moderation algorithms in 40 languages and human content moderation teams in 70 languages, but both of these numbers fall short of the 100 languages that Facebook officially supports.
In addition, even supported languages aren’t equally maintained, leading to inaccurate or incomplete translations, with large chunks of information being left in English.
Two languages for which this problem was highlighted were Dari and Pashto, the two major languages of Afghanistan. These problems, indicated in the company’s own research, run counter to claims made by Facebook as recently as August 2021 to the BBC that the company has “a dedicated team of Afghanistan experts, who are native Dari and Pashto speakers and have knowledge of local context, helping to identify and alert us to emerging issues on the platform.”
A shortage of insight, not of resources
One suggestion made for the lack of both artificial and human intelligence is that these languages are spoken only by small numbers of people, implying that finding data on which to train AI models, and finding human linguists fluent in those languages, represents a challenge for Facebook and leads to these gaps.
However, analysis by MultiLingual showed that languages where moderation had been identified as lacking didn’t always match the numbers of speakers that language had, nor did it match Facebook’s market share or size in a given region (Facebook has approximately five million monthly users in Afghanistan, or around 38% of market share, while there are about 400 million users of all Facebook affiliated platforms in India).
Similarly, documents indicate that Facebook and Instagram prioritized two of the most widely spoken languages in Ethiopia, Amharic and Oromo, in the first half of 2021 based partly on a risk of offline violence. The northeast African country has, however, been embroiled in a bloody civil war since November 3, 2020, and Amharic and Oromo speakers number almost 95 million combined (it’s unknown whether Facebook has also prioritized Tigrinya, the language of the Ethiopian region at the heart of the ongoing civil war).
Facebook also said it added hate speech classifiers (linguistic data samples used to train AI content moderators) for Hindi and Bengali in 2018 and 2020, respectively.
These two major Indian subcontinent languages, the subject of concern due to the presence of anti-Muslim hate speech on the platform, are certainly not wanting for either data or resources, as Hindi is spoken by some 615 million people and Bengali by 265 million people (put another way, Bengali is spoken by more people than German and Japanese combined).
Dari and Pashto, mentioned above, have a combined 160 million speakers, and a wealth of written material on which to train AI models.
These gaps in coverage not only lead to potentially violent or misleading content going unreported (Politico indicated less than 1 percent of hate speech was removed in Afghanistan), but also a significant number of false positives — the same Politico report indicated nonviolent content in Arabic, especially regarding the Palestinian-Israeli conflict, was deleted by moderators 77 percent of the time. Meanwhile, according to information gathered from the Facebook Papers by the Times of Israel, Facebook was unable to respond to abuse allegations from Filipino domestic workers, who make up a sizable portion of domestic workers in the Middle East, because of its inability to flag words in the major Philippine language of Tagalog, which has approximately 68 million speakers.
As new information continues to come out about Facebook’s operations, it becomes clear that neither market size, market share, number of speakers, nor geopolitical realities have driven internal decisions about which content to assess, study, monitor and moderate, or about how to do so. In the meantime, speakers of non-European languages (with the exception of Korean and Japanese) often find themselves either unprotected or overzealously monitored by a company whose actions and knowledge base, ironically, have yet to adapt themselves to the multilingual, multicultural, interconnected world of the 21st century.