In recent years, we’ve seen a slow but steady evolution in language driven by significant societal and cultural shifts. Language is becoming more and more inclusive, not only regarding biases related to gender, ethnicity, age, religion, and so on, but also evolving from a binary view into a non-binary one. This may be a sign of our times, but it’s also becoming essential for brands to include people who identify themselves as non-binary individuals when designing user experiences.
Inclusivity is not just a temporary trend; it is becoming a vital aspect of fostering respect, understanding, and equality. Our ways of communicating — facilitated and multiplied by the digital world and instantaneous access to products and services — have been changing, and these changes are reflected in how we speak and write and, consequently, in how we localize content. We must carefully choose words that acknowledge and respect any individual regardless of their gender, ethnicity, age, or any other characteristics. Moving away from a binary and deterministic approach, we can create an environment where everyone feels safe, heard, and valued.
This shift naturally impacts the localization field, where our stated mission has always been ensuring the content resonates with diverse audiences. So, how can you afford not to include inclusivity in the localization process (pun not intended)?
How to embed inclusivity in your localization process
Since inclusivity is relatively new to localization, it is not yet reflected in how conventional language technology is structured. Many studies have proven that machine translation perpetuates a lot of biases in the language, using the predominant masculine form for certain professions, for example, or not being able to reproduce non-binary forms. Because there is not yet enough non-binary data to surpass the huge amount of legacy data used to create language models, we need to be very careful when implementing the latest language technology in the localization processes if our objective is to create inclusive content. We need to train the models to ensure the output follows specific rules and post-edit until the machine-introduced biases are completely removed.
The first step is to create inclusive language guidelines with specific rules to follow while post-editing and adapting previously translated content so it can be used to retrain language models. The guidelines need to be comprehensive and, of course, language-specific.
Since inclusivity (and language in general) is constantly evolving, it’s important to be up to date with the most recent trends and changes, especially for non-binary aspects. It’s fundamental to do a lot of research to ensure the guidelines comply with the different country regulations and language institutions. Accademia della Crusca for Italy and Real Academia for Spain are examples of two such institutions dedicating a lot of space to this topic to make sure that the language choices are approved and adopted by the countries. This aspect can pose a particular challenge, especially to the training of language models, as it’s not something that can be done very often, both for the amount of data needed and the care needed to prepare the data. It can entail a fair bit of manual work and be challenging financially, as training a model can be quite an investment.
Some of the aspects to consider when creating guidelines are the following:
- Make sure the communication is gender inclusive and that no group is consciously excluded, especially when referring to a person’s sexual orientation. For example, you may want to use sentences like “hi everyone” instead of “hi guys” when referring to a mixed group.
- Each language has a specific way of addressing non-binary people. Make sure to use a direct non-binary approach or an indirect non-binary approach (gender neutral) that is in accord with each language’s preferred approach. In English, you would use the pronouns “they/they,” whereas in other languages, you might want to use passive forms, for instance.
- Do not use any racist or sexist messaging, and make sure to exclude anything of this kind from the text.
- Make sure to use the correct inclusive language when talking about or to people with specific disabilities and mental health.
- Religious connotations should be handled carefully to make sure the message is not perceived as offensive.
- Social and economic background and age should not be presented as a reason for discrimination or in a derogatory way.
As stated before, each language has its own rules for inclusive and gender-neutral language. That’s why preparing the data for model training will be quite specific, although we’re also seeing a standardization of processes. Since we usually don’t have the luxury of eliminating content because we need to avoid the risk of losing too much data for training, a deep cleaning is not the preferred solution and a less heavy process is recommended. That being said, the process needs to be more detailed, and that’s why more manual work might be needed to ensure the cleaning process is tailored to what we want to achieve.
Here are some best practices:
- Choose the data you want to use carefully — create a good mix of training data that contains enough representative content of the result you want to achieve. It can be very different if you want to create an engine focused on more generic content, or it can be very tailored to a specific area if you have enough specific data.
- Depending on the level of variety and fluency you want to achieve, you may want to deduplicate the training set, eliminating exact duplicates, i.e., sentences containing the same source and target translations.
- Remove short segments that don’t offer any meaning (i.e., one-word segments) but use the ones containing the gendered word to generate full sentences using a gender-inclusive approach (e.g., segments that contain names of professions).
- Use a variety of sentences containing both binary and non-binary and non-gendered examples. You can use gender tags, which are very common in the context of video games, for example. While non-natural language should usually be eliminated, in this case, it is proven to help address the correct translation of specific cases.
- Use a balanced dataset to tune your model. This dataset should contain enough examples for the output you want to achieve, further ensuring you eliminate biases from the model.
- To ensure a balanced dataset, generate different versions of the same segments or slightly different versions containing non-gendered or non-binary examples. This process can be manual, or you can use LLMs with the help of RAG (retrieval-augmented generation), i.e., using examples to help the technology generate the desired content.
Here are a few examples of what your choices might look like when preparing the data. Due to the relative shortage of data you must contend with, you may have to choose something you would usually discard, as stated above. This choice of training data might seem very unusual to any language technology expert, but it will do until the available data grows.
Source | Target | Action |
Weakling. | Débil. | Reject, short word |
Welcome to my home. | Bienvenide a mi casa. | Original sentence was “Welcome”. Keep and generate full sentences with alternates |
Welcome. | Bienvenido a mi casa. | |
Welcome. | Bienvenida a mi casa. | |
What, are you crazy!? | ¡¿Te has vuelto loque?! | Non-binary example |
Reach Spellcaster Rank 3 -Adept | Alcanzar el nivel 3 de hechicería: {M0.adepto}{F0.adepta}{N0.adepte}. | Use of tags |
It’s me. | Ese soy yo. | Keep, near duplicates |
It’s me. | Esa soy yo. | |
It’s me. | Eso soy yo. |
Naturally, these are just some tips and best practices for leveraging your tech to create inclusive copy. The challenge is how we, as a society, continue to engage with and expand the topic of inclusivity. We’re trending in the direction of language technology becoming more commonplace and easy to use by the masses — as opposed to being reserved for savvy language industry professionals — so training and fine-tuning the models to make sure the output is inclusive and non-binary-friendly is crucial if we want to reflect all the beautiful varieties of human nature and expression. The collaborative process between companies, tech developers, inclusive language experts, and non-binary communities is fundamental and has just started.