Machine Translation and Gender Biases in Video Game Localisation: A Corpus-Based Analysis

The video game industry has been a historically gender-biased terrain due to a higher number of male protagonists and hypersexualised representations [ Dietz, 1998; Downs and Smith, 2010; Lynch et al. , 2016 ] . Nowadays, echoing the debate on inclusive language, companies attempt to erase gender disparity by introducing more main female characters as well as non-binary characters. From a technological point of view, even though recent studies show that Machine Translation (MT) remains largely unadopted by individual video game localisers [Rivas Ginel, 2021, 171], multilanguage vendors are willing to invest in these tools to reduce costs [LIND, 2020, 50]. However, the predominance of the masculine in Natural Language Processing (NLP) and Machine Learning has created allocation and representation biases in Neural Machine Translation (NMT) [Crawford, 2017]. This paper aims to analyse the percentage of gender bias resulting from the use of Google Translate, DeepL, and SmartCat when translating raw in-game content from English into French. The games DeltaRune , The Devil’s Womb and The Faces of the Forest were chosen due to the presence of non-binary characters, non-sexualized characters, and female protagonists. We compared the results in order to recount and analyse the differences between these tools' output in terms of errors related to gender. To this end, we created a parallel corpus to compare source documents and each translation to visualise the semantic and grammatical directions of the word embeddings [Zhou et al. , 2019] and extracted the collocations and concordance lines that represented gender identity by analysing the patterns in the source language.


The video game industry
As a consequence of the increasing popularity of video games, the demands of the industry in terms of video game localisation to reach a larger public and increase the revenues generated by each game have more than doubled.Conversely, localisers' working conditions have not substantially improved over the last decade and, despite constant technological advances, they continue "to work in a double-blind process (no audiovisual context, no text linearity)" [Merino, 2013, 117].In other words, localisers are not given access to the game itself and, even in the best conditions, struggle to identify the gender of the character or the addressee, among many other issues.

Motivation
The idea of focusing on the efficiency of Neural Machine Translation (NMT) by analysing gender-related issues in the field of video game localisation stems from three reasons.First, previous research on the degree of adoption of Machine Translation (MT) tools by video game localisers [Rivas Ginel, 2021, 171] has shown that even though MT remains largely unadopted (11,6% of regular users), "66% of agencies and 44% of in-house translation teams expect to invest in [MT] in 2020" [LIND, 2020, 50].Furthermore, more than 70,37% of video game developers use Unity as their main game engine, which provides localisation management and instant MT options [Rivas Ginel, 2022, 320].Therefore, it is likely that either developers or translation agencies might use MT and then send the text to video game localisers for postediting to cut costs in a saturated market.
The second motivation behind this study is that the video game industry is moving with the times and increasing the number of female, gender-neutral and non-sexualised characters, leaving behind the image of a predominantly male domain.Although this evolution reflects the present trend in society towards diversity and inclusion, these changes entail new challenges for video game localisers, forcing them to redouble their efforts to avoid gender bias and search for neutral yet natural translations.Moreover, the mechanisms used for inclusivity vary from one language to another, notably with gendered languages such as those from Indo-European countries.Languages such as French, which tend to use masculine grammatical forms, can easily become ambiguous and cause the under-representation of women due to the fact that a pronoun in its masculine form can refer to a group of men, to a mixed-gender group of people, a neutral group of people, or to a person whose gender is not precise nor defined.
The third reason is the increasing number of papers that focus on the performance of NMT systems and explore solutions to reduce gender bias in MT.NMT's performance can be affected by its ability to deal with Language Detection, Natural Language Processing and Natural Language Understanding; well-known notions in artificial intelligence used to detect, understand, and treat natural languages.Furthermore, the efficiency of these tools, especially if the source language is relatively neutral and the target language is grammatically gendered, is highly impacted by the methods used to train the artificial intelligence behind those systems.For example, an NMT system that translates a sentence without gender marks such as "the surgeon is very skilled" will automatically use the masculine form and produce the outcome "le chirurgien est très compétent".

Systems
We decided to analyse the output of Google Translate and DeepL-NMT-based-as well as SmartCat (to represent Statistical Machine Translation (SMT) tools).As the company explains in their blog [Google AI Blog], Google Neural Machine Translation (GNMT), which is integrated into Google Translate, was developed in 2016 and modified in 2020.It provides language detection and includes various technologies based on Artificial Intelligence that allow it to handle 100+ languages.GNMT uses a hybrid model architecture that replaces the previous RNN-based GNMT model-used to identify data correlations and patterns-with a transformer encoder (that reads the source text and searches for correlations) and an RNN decoder for predictive typing [Google AI Blog].These hybrid models improve the quality and lower the time of latency.Additionally, the 2020 update allows Google to web crawl using an embedding-based model rather than the previous phrase-based machine translation model of 2016 (dictionary-based crawler), thus improving the quantity and quality of the data and increasing the number of collected sentences [Google AI Blog].Nevertheless, using parallel (or translated) texts as the linguistic data used in its training phase subjects the availability and quantity of paired texts and languages to their presence on the web, affecting the translation quality.Additionally, GNMT will be more accurate at translating terms or words from a source language to a target language rather than long sentences, entire texts, or complex idioms [Yang et al., 2020].
Once again using the company's own description [DeepL Blog], DeepL uses Deep Learning and Neural Networks and draws the linguistic data used for its training from Linguee, a massive corpus built around manually translated sentences, idioms, and texts.The quantity and diversity of the data utilised account for DeepL's degree of accuracy and performance in specialised domains and its capacity of offering translation alternatives such as synonyms and rephrased options [DeepL Blog].Additionally, DeepL includes predictive typing which, combined with the translation alternatives function of DeepL, works similarly to translation memories and means that DeepL might be able to memorise the user's translation alternative when the latter changes a term or rephrases the entire sentence.In consequence, the user's alternative would increase the amount of data exploited by DeepL, which already uses Deep Learning to constantly train paired languages, allowing it to train with it and increasing the number of translation alternatives.In terms of MT, DeepL uses an API to provide restricted access to a glossary of terms which increases the accuracy.In addition, this API also limits the length of documents uploaded into DeepL meaning that the text is split into sentences, then translated and finally put back together [DeepL Blog].
Finally, as a representative of MT integrated into Computer Assisted Translation (CAT) tools and an example of SMT we chose SmartCat, which uses Yandex Free by default.This system is not based on translated texts or language rules but upon statistics, which means that the tool uses comparable corpora.SMT analyses source documents in one language and statistically compares them with related documents in another language first; then it gathers, relates, and aligns the data with parallel texts [Yandex Blog].Another difference is that the system can easily and quickly be adapted when a language faces modifications (spelling, usage, new terms or idioms), which also means that it must be regularly updated to improve its quality.Yandex's language training is built around three models: a translation model, a language model, and a decoder.Those models are used together to sort all terms, idioms, and sentences extracted by language from the comparable corpora to find and match correlations of words, terms or sentences extracted from the parallel corpora [Yandex Blog].The models draw statistical patterns from a first set of paired sentences and then use another set to constantly recalculate, compare, and match terms to find the more accurate equivalent in the target language.The translation is done by the decoder, which gathers all the translation alternatives for a given sentence or sequence, combines phrases from the translation model and statically classifies them in order of probability [Yandex Blog].In this sense, the decoder's function can be assimilated to a phrased-based model or a Seq2Seq model.The decoder also uses the translation model to determine the highest probability of equivalence and frequency of use to improve the accuracy and idiomaticity.

Games
The games The Faces of the Forest, The Devil's Womb and DeltaRune were chosen due to the presence of female protagonists, non-binary characters, and non-sexualized characters.All of them are free to play and the .jsondocuments containing the strings were either online or we were granted permission to use them for educational purposes.Given the difference in the number of words between the three documents, we decided to use the entirety of the two first games and their sum's equivalent in the case of DeltaRune.The Faces of the Forest has a little over 2 350 words in total and 11 characters: 4 females, 6 males and 1 non-binary; the main character is Ana.The Devil's Womb, an RPG game with almost 14 900 words and 4 possible endings, has over 30 characters, the protagonist is a female demon named Luna, the characters with the bigger roles are predominantly female (13 females), and there are many demons and minions whose gender was not specified.Finally, DeltaRune has more than 36 000 words (although only the first 17 900 were selected), it is an RPG as well, the main character is nonbinary and has two companions, a female, and a male.The selected part contains 40 characters, 18 of them (the enemies mostly) are non-binary or simply referred to as "it"; there are 15 male characters and 7 female characters.

Process
After establishing the MT systems and the games, the translatable strings were automatically extracted from the .jsonfiles using an ad-hoc custom computer programme.We decided to work with this file format since previous research showed that .jsonfiles were used by 65,17% of the 379 developers who isolated gameplay text using external files [Rivas Ginel, 2022, 307].Then, the text was copied and pasted into Google Translate and DeepL and, in the case of SmartCat, the file was uploaded and automatically translated without any other interference in order to automate the process as much as possible.Once again, to reproduce video game localisation practices [Rivas Ginel, 2022, 307], we used Excel files to display the source text and the translations.Afterwards, we created so-called "character bibles" of every game to catalogue characters, display their gender (or lack thereof), and include a screenshot for better reference.The bibles were used when the text was ambiguous and we needed to use walkthroughs to determine the identity of the speaker or the addressee.Then, a colour code was created to manually annotate gender-related issues by highlighting the affected cells to provide a visual representation of the errors: • Blue: use of the masculine for a reference to a female character.
• Red: use of the feminine for a reference to a masculine character.
• Violet: references to non-binary or non-gendered characters.
• Green: wrong use of the masculine for a reference to an object.
• Orange: wrong use of the feminine for a reference to an object.Finally, we proofread, annotated, and reviewed individually at least once every document and then went together through them once more before removing all the lines that did not contain gender-related issues.Unfortunately, this included excluding segments where the translation was so poor that it was impossible to determine the type of error, a fairly common occurrence.Once the non-problematic lines were removed and a final check had been performed, we created a single document per colour for better visualisation and then proceeded to count all the errors per tool and type thrice (to make sure that the resulting number was consistent).
Image 2. Excel files per type of error.

Colour code
The previous image (Image 2) already shows patterns regarding the type of error and tool: in the case of blue and green errors, we can easily observe a higher number of cells without issues in the column containing DeepL's translation.Conversely, the same tool appears to err more than the rest in the other three categories.If we look at the total number of errors per tool all categories confounded (Table 1), Google made 333 gender-related errors, DeepL made 261, and SmartCat 340.These figures represent 3.03%, 2.38% and 3.1% of the 10 980 segments, respectively.However, it is necessary to point out that SmartCat's translation quality was remarkably low and, thus, we were not able to distinguish the type of mistake made on numerous occasions.Therefore, the percentage of Yandex's errors should be much higher than the one listed above.
Figures 1 to 3 show the results arranged in pie charts and allow us to observe that all three tools had their highest number of errors when dealing with references to female characters (blue) even though DeepL performs better.DeepL shows a relatively higher tendency to favour feminine over masculine and has the highest percentages in the red and orange categories-and the lowest for blue and green.The differences in terms of percentages between the performance of Google and SmartCat are almost negligible although it would seem that SmartCat has a slightly higher tendency to overcompensate in the case of red and orange.

Corpus-based analysis
Subsequently, we performed a corpus-based analysis to provide concrete examples of the issues encountered in each category.First, all the Excel documents were exported into a text file, one for each tool.Then, they were aligned using Aligner and converted into a single multilingual tmx file with all the segments containing errors.Finally, the file was uploaded into SketchEngine to create a parallel corpus that allowed us to analyse the translations and genderrelated errors from a qualitative and a quantitative approach.In order to find relevant tokens for the analysis, a Wordlist search was carried out first to assert the frequency of appearance and use of all the words present in the corpus.After sorting the results, we decided to focus on "Luna", "Ana", "kid and kiddo" and finally on the pronouns "it", "they" and "you".These terms became the starting point of a more in-depth analysis with advanced parameters such as the parallel concordance function using the "simple" or the "lemma" options.Furthermore, the images have been modified post hoc to include the colour code to frame the sentences as a visual aid to highlight the previously discussed categories.http://jdmdh.episciences.orgeISSN 2416-5999, an open-access journal https://doi.org/10.46298/jdmdh.9065

Nouns
The token "Luna" (the protagonist of The Devil's Womb) refers to a feminine character and thus, in French, the pronouns and adjectives must agree with it if it's the subject of the sentence.However, even if Google Translate and SmartCat both detected that Luna was the subject, they didn't apply the rule as there wasn't any other gender reference in the original and, by default, transformed feminine into masculine.On the other hand, DeepL did not make mistakes in this particular case (Image 3).
The token "Ana" (the protagonist of The Faces of the Forest) predictably shows the same results.In this case, Google Translate made three mistakes, SmartCat had it all wrong, and DeepL proved to be more efficient with only one mistake.In these sentences, as both the name of the protagonist and its coreference-the pronoun "I"-were present, the probability of gender-related errors should have been reduced.However, the results show the contrary and indicate that DeepL seems to be more performant than its competitors in terms of analysing the context and translating accordingly (Image 4).
By using an advanced lemma parallel concordance for "kid", SketchEngine sorted all the results for the tokens "kid" and "kiddo", which serve as coreferences to various characters in DeltaRune and "Ana" in The Faces of the Forest (said by Ana's grandmother and a neighbour).These terms were problematic for every tool and were automatically translated using the masculine, being easily misled when processing English due to its gender-neutral form.
Incidentally, SmartCat's results for "kiddo" were too corrupted and cannot be compared to the other tools since the system did not even translate the term into French (Image 5).Additionally, the lines framed in violet, from DeltaRune, illustrate the intentional use of non-binary English.This decision, clear for a human translator, would have entailed using a neutralisation strategy or inclusive writing in French since "gamin", "petit" and "le gosse" are gendered.

Pronouns
In the following screenshot (Image 6), we can see that DeepL was the only tool that wrongly translated the pronoun "you".In the first example, even though it handled "Luna" properly, it made a mistake while translating "Luna tried talking to you, but you were so nervous".In this Image 8. Token "They".

CONCLUSION
Creating this corpus allowed us to compare the use of gender in the source texts with the raw output of various MT systems to visualise the semantics, syntax, and linguistic differences between a neutral language such as English and a gendered language.By cross-referencing the information extracted from this corpus, we were able to determine the efficiency of each tool, how they analyse the context, their accuracy, and the quality of their translation.As such, DeepL has proved to avoid many gender biases while being more efficient than Google and SmartCat.
In addition, even if Google uses NMT technology, our analysis shows that its architecture and models are closer to SmartCat's.On the other hand, DeepL seems more accurate and fluent, surpassing Google thanks to its use of Deep Learning, as we have seen across the paper.Recent approaches in Neural Machine Translation show that integrating explicit gender inflexion tags into NMT systems reduces the number of gender biases, especially when translating sentences where the gender referent (pronoun) is identified [Saunders et al., 2020].A similar method, incorporating a 'speaker gender' tag into the training phase of MT systems, has also been developed to convey gender at a sentence level [Vanmassenhove et al., 2018].Finally, another approach consists of adjusting the word embedding of NMT systems to reduce gender bias.
Future research will focus on asserting the tools' bias in the context of gender-neutral choices by analysing the results for the token "Kris" (a non-binary character) in-depth as well as all the other errors in the violet category.Even though nowadays we can observe a higher demand for inclusivity and non-binary solutions, players are less familiar with inclusive writing, neutral phrasing or even gender-neutral pronouns, which could negatively affect the flow of the text and the player's immersion.The key factors of the use of inclusive language in video games are the client's intent, their perception of diversity and inclusivity along with the localisers' decisions and skills.The use of inclusive language techniques and constructions requires high awareness from developers and localisers, however, the audience's reception must be taken into account and the topic should be researched further.Furthermore, training a specific MT system applying gender-neutral tagging to indicate neutral referents in the field of video game localisation remains an interesting avenue for future research.Finally, reproducing the experiment with the latest upgrades of the same tools to observe their evolution compared to the April 2021's versions would allow us to examine the results of Deep Learning.

Image 1 .
Excel files and character bibles.

Figures 1
Figures 1 to 3. Distribution of errors per tool.