Google Releases Dataset to Address Gender Bias

In an effort to address the gender bias in its neural machine translation (NMT) technologies, Google has recently released a new dataset that appears to be able to improve the rate at which Google Translate accurately translates gendered language.

“One research area has been using context from surrounding sentences or passages to improve gender accuracy,” reads a recent blog post from the company’s AI team. “This is a challenge because traditional NMT methods translate sentences individually, but gendered information is not always explicitly stated in each individual sentence.”


In late June, four researchers at Google published a dataset called the Wikipedia Translated Biographies dataset, which includes a collection of Wikipedia entries on a person (identified as male or female), rock band, or sports team (the latter two are considered to be genderless). According to Google, the new dataset appears to be able to significantly improve gender access, though there’s still work to be done.

“It’s worth mentioning that by releasing this dataset, we don’t aim to be prescriptive in determining what’s the optimal approach to address gender bias,” the team writes. “This contribution aims to foster progress on this challenge across the global research community.” 

In the blog post, Google gives an example of a Spanish paragraph whose subject is female, however the subject is not explicitly mentioned in every sentence of the paragraph, due to the fact that Spanish is a pro-drop language and does not always include subjects in every sentence — thus, the translation engine could potentially mistranslate such sentences into English using masculine pronouns, rather than the correct, feminine ones (or vice versa). When the Wikipedia Translated Biographies dataset was used, Google Translate was able to more frequently produce translations using the accurate gender pronouns.

Back in April, MultiLingual reported on issues with the gender bias in Google’s translation engine, after an onslaught of social media users noticed issues with how Google Translate translated non-gendered language into gendered languages. Oftentimes, such translations reflected stereotypical depictions of gender roles — i.e., translating a non-gendered pronoun from Finnish into English as “he” when associated with the word “doctor” but translating the same pronoun as “she” when associated with the word “teacher.”

Andrew Warner
Andrew Warner is a writer from Sacramento. He received his B.A. in linguistics and English from UCLA and is currently working toward an M.A. in applied linguistics at Columbia University. His writing has been published in Language Magazine, Sactown Magazine, and The Takeout.

Related Articles

Weekly Newsletter

Subscribe to stay updated