NLP4DH


1. Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

Nicolas Gutehrlé ; Iana Atanassova.
Background. In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of Logical Layout Analysis applied to historical documents in French. We propose a rule-based method, that we evaluate and compare with two Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set contains French newspapers, periodicals and magazines, published in the first half of the twentieth century in the Franche-Comté Region. Results. Our rule-based system outperforms the two other models in nearly all evaluations. It has especially better Recall results, indicating that our system covers more types of every logical label than the other two models. When comparing RIPPER with Gradient Boosting, we can observe that Gradient Boosting has better Precision scores but RIPPER has better Recall scores. Conclusions. The evaluation shows that our system outperforms the two Machine Learning models, and provides significantly higher Recall. It also confirms that our system can be used to produce annotated data sets that are large enough to envisage Machine Learning or Deep Learning approaches for the task of Logical Layout Analysis. Combining rules and Machine Learning models into hybrid systems could potentially provide even better performances. […]
Rubrique : Humanités numériques en langues

2. Enhancing Legal Argument Mining with Domain Pre-training and Neural Networks

Gechuan Zhang ; Paul Nulty ; David Lillis.
The contextual word embedding model, BERT, has proved its ability on downstream tasks with limited quantities of annotated data. BERT and its variants help to reduce the burden of complex annotation work in many interdisciplinary research areas, for example, legal argument mining in digital humanities. Argument mining aims to develop text analysis tools that can automatically retrieve arguments and identify relationships between argumentation clauses. Since argumentation is one of the key aspects of case law, argument mining tools for legal texts are applicable to both academic and non-academic legal research. Domain-specific BERT variants (pre-trained with corpora from a particular background) have also achieved strong performance in many tasks. To our knowledge, previous machine learning studies of argument mining on judicial case law still heavily rely on statistical models. In this paper, we provide a broad study of both classic and contextual embedding models and their performance on practical case law from the European Court of Human Rights (ECHR). During our study, we also explore a number of neural networks when being combined with different embeddings. Our experiments provide a comprehensive overview of a variety of approaches to the legal argument mining task. We conclude that domain pre-trained transformer models have great potential in this area, although traditional embeddings can also achieve strong performance when combined with additional neural network […]

3. Adapting vs. Pre-training Language Models for Historical Languages

Enrique Manjavacas ; Lauren Fonteyn.
As large language models such as BERT are becoming increasingly popular in Digital Humanities (DH), the question has arisen as to how such models can be made suitable for application to specific textual domains, including that of 'historical text'. Large language models like BERT can be pretrained from scratch on a specific textual domain and achieve strong performance on a series of downstream tasks. However, this is a costly endeavour, both in terms of the computational resources as well as the substantial amounts of training data it requires. An appealing alternative, then, is to employ existing 'general purpose' models (pre-trained on present-day language) and subsequently adapt them to a specific domain by further pre-training. Focusing on the domain of historical text in English, this paper demonstrates that pre-training on domain-specific (i.e. historical) data from scratch yields a generally stronger background model than adapting a present-day language model. We show this on the basis of a variety of downstream tasks, ranging from established tasks such as Part-of-Speech tagging, Named Entity Recognition and Word Sense Disambiguation, to ad-hoc tasks like Sentence Periodization, which are specifically designed to test historically relevant processing.
Rubrique : Humanités numériques en langues

4. Fractal Sentiments and Fairy Tales - Fractal scaling of narrative arcs as predictor of the perceived quality of Andersen's fairy tales

Yuri Bizzoni ; Telma Peura ; Mads Thomsen ; Kristoffer Nielbo.
This article explores the sentiment dynamics present in narratives and their contribution to literary appreciation. Specifically, we investigate whether a certain type of sentiment development in a literary narrative correlates with its quality as perceived by a large number of readers. While we do not expect a story's sentiment arc to relate directly to readers' appreciation, we focus on its internal coherence as measured by its sentiment arc's level of fractality as a potential predictor of literary quality. To measure the arcs' fractality we use the Hurst exponent, a popular measure of fractal patterns that reflects the predictability or self-similarity of a time series. We apply this measure to the fairy tales of H.C. Andersen, using GoodReads' scores to approximate their level of appreciation. Based on our results we suggest that there might be an optimal balance between predictability and surprise in a sentiment arcs' structure that contributes to the perceived quality of a narrative text.

5. Word Sense Induction with Attentive Context Clustering

Moshe Stekel ; Amos Azaria ; Shai Gordin.
This paper presents ACCWSI (Attentive Context Clustering WSI), a method for Word Sense Induction, suitable for languages with limited resources. Pretrained on a small corpus and given an ambiguous word (a query word) and a set of excerpts that contain it, ACCWSI uses an attention mechanism for generating context-aware embeddings, distinguishing between the different senses assigned to the query word. These embeddings are then clustered to provide groups of main common uses of the query word. We show that ACCWSI performs well on the SemEval-2 2010 WSI task. ACCWSI also demonstrates practical applicability for shedding light on the meanings of ambiguous words in ancient languages, such as Classical Hebrew and Akkadian. In the near future, we intend to turn ACCWSI into a practical tool for linguists and historians.

6. Hate speech, Censorship, and Freedom of Speech: The Changing Policies of Reddit

Elissa Nakajima Wickham ; Emily Öhman.
This paper examines the shift in focus on content policies and user attitudes on the social media platform Reddit. We do this by focusing on comments from general Reddit users from five posts made by admins (moderators) on updates to Reddit Content Policy. All five concern the nature of what kind of content is allowed to be posted on Reddit, and which measures will be taken against content that violates these policies. We use topic modeling to probe how the general discourse for Redditors has changed around limitations on content, and later, limitations on hate speech, or speech that incites violence against a particular group. We show that there is a clear shift in both the contents and the user attitudes that can be linked to contemporary societal upheaval as well as newly passed laws and regulations, and contribute to the wider discussion on hate speech moderation.

7. Affect as a proxy for literary mood

Emily Öhman ; Riikka Rossi.
We propose to use affect as a proxy for mood in literary texts. In this study, we explore the differences in computationally detecting tone versus detecting mood. Methodologically we utilize affective word embeddings to look at the affective distribution in different text segments. We also present a simple yet efficient and effective method of enhancing emotion lexicons to take both semantic shift and the domain of the text into account producing real-world congruent results closely matching both contemporary and modern qualitative analyses.

8. The Impact of Incumbent/Opposition Status and Ideological Similitude on Emotions in Political Manifestos

Takumi Nishi.
The study involved the analysis of emotion-associated language in the UK Conservative and Labour party general election manifestos between 2000 to 2019. While previous research have shown a general correlation between ideological positioning and overlap of public policies, there are still conflicting results in matters of sentiments in such manifestos. Using new data, we present how valence level can be swayed by party status within government with incumbent parties presenting a higher frequency in positive emotion-associated words while negative emotion-associated words are more prevalent in opposition parties. We also demonstrate that parties with ideological similitude use positive language prominently further adding to the literature on the relationship between sentiments and party status.

9. The Effects of Political Martyrdom on Election Results: The Assassination of Abe

Miu Nicole Takagi.
In developed nations assassinations are rare and thus the impact of such acts on the electoral and political landscape is understudied. In this paper, we focus on Twitter data to examine the effects of Japan's former Primer Minister Abe's assassination on the Japanese House of Councillors elections in 2022. We utilize sentiment analysis and emotion detection together with topic modeling on over 2 million tweets and compare them against tweets during previous election cycles. Our findings indicate that Twitter sentiments were negatively impacted by the event in the short term and that social media attention span has shortened. We also discuss how "necropolitics" affected the outcome of the elections in favor of the deceased's party meaning that there seems to have been an effect of Abe's death on the election outcome though the findings warrant further investigation for conclusive results.

10. Large-scale weighted sequence alignment for the study of intertextuality in Finnic oral folk poetry

Maciej Janicki.
The digitization of large archival collections of oral folk poetry in Finland and Estonia has opened possibilities for large-scale quantitative studies of intertextuality. As an initial methodological step in this direction, I present a method for pairwise line-by-line comparison of poems using the weighted sequence alignment algorithm (a.k.a. ‘weighted edit distance’). The main contribution of the paper is a novel description of the algorithm in terms of matrix operations, which allows for much faster alignment of a poem against the entire corpus by utilizing modern numeric libraries and GPU capabilities. This way we are able to compute pairwise alignment scores between all pairs from among a corpus of over 280,000 poems. The resulting table of over 40 million pairwise poem similarities can be used in variousways to study the oral tradition. Some starting points for such research are sketched in the latter part of the article.

11. Interactive Analysis and Visualisation of Annotated Collocations in Spanish (AVAnCES)

Simon Gonzalez.
Phraseology studies have been enhanced by Corpus Linguistics, which has become an interdisciplinary field where current technologies play an important role in its development. Computational tools have been implemented in the last decades with positive results on the identification of phrases in different languages. One specific technology that has impacted these studies is social media. As researchers, we have turned our attention to collecting data from these platforms, which comes with great advantages and its own challenges. One of the challenges is the way we design and build corpora relevant to the questions emerging in this type of language expression. This has been approached from different angles, but one that has given invaluable outputs is the building of linguistic corpora with the use of online web applications. In this paper, we take a multidimensional approach to the collection, design, and deployment of a phraseology corpus for Latin American Spanish from Twitter data, extracting features using NLP techniques, and presenting it in an interactive online web application. We expect to contribute to the methodologies used for Corpus Linguistics in the current technological age. Finally, we make this tool publicly available to be used by any researcher interested in the data itself and also on the technological tools developed here.
Rubrique : Humanités numériques en langues

12. Style Classification of Rabbinic Literature for Detection of Lost Midrash Tanhuma Material

Shlomo Tannor ; Nachum Dershowitz ; Moshe Lavee.
Midrash collections are complex rabbinic works that consist of text in multiple languages, which evolved through long processes of unstable oral and written transmission. Determining the origin of a given passage in such a compilation is not always straightforward and is often a matter of dispute among scholars, yet it is essential for scholars' understanding of the passage and its relationship to other texts in the rabbinic corpus. To help solve this problem, we propose a system for classification of rabbinic literature based on its style, leveraging recent advances in natural language processing for Hebrew texts. Additionally, we demonstrate how this method can be applied to uncover lost material from a specific midrash genre, Tan\d{h}uma-Yelammedenu, that has been preserved in later anthologies.

13. The Fractality of Sentiment Arcs for Literary Quality Assessment: the Case of Nobel Laureates

Yuri Bizzoni ; Pascale Moreira ; Mads Rosendahl Thomsen ; Kristoffer L. Nielbo.
In the few works that have used NLP to study literary quality, sentiment and emotion analysis have often been considered valuable sources of information. At the same time, the idea that the nature and polarity of the sentiments expressed by a novel might have something to do with its perceived quality seems limited at best. In this paper, we argue that the fractality of narratives, specifically the longterm memory of their sentiment arcs, rather than their simple shape or average valence, might play an important role in the perception of literary quality by a human audience. In particular, we argue that such measure can help distinguish Nobel-winning writers from control groups in a recent corpus of English language novels. To test this hypothesis, we present the results from two studies: (i) a probability distribution test, where we compute the probability of seeing a title from a Nobel laureate at different levels of arc fractality; (ii) a classification test, where we use several machine learning algorithms to measure the predictive power of both sentiment arcs and their fractality measure. Lastly, we perform another experiment to examine whether arc fractality may be used to distinguish more or less popular works within the Nobel canon itself, looking at the probability of higher GoodReads’ ratings at different levels of arc fractality. Our findings seem to indicate that despite the competitive and complex nature of the task, the populations of Nobel and non-Nobel […]

14. Values That Are Explicitly Present in Fairy Tales: Comparing Samples from German, Italian and Portuguese Traditions

Alba Morollon Diaz-Faes ; Carla Sofia Ribeiro Murteira ; Martin Ruskov.
Looking at how social values are represented in fairy tales can give insights about the variations in communication of values across cultures. We study how values are communicated in fairy tales from Portugal, Italy and Germany using a technique called word embedding with a compass to quantify vocabulary differences and commonalities. We study how these three national traditions differ in their explicit references to values. To do this, we specify a list of value-charged tokens, consider their word stems and analyse the distance between these in a bespoke pre-trained Word2Vec model. We triangulate and critically discuss the validity of the resulting hypotheses emerging from this quantitative model. Our claim is that this is a reusable and reproducible method for the study of the values explicitly referenced in historical corpora. Finally, our preliminary findings hint at a shared cultural understanding and the expression of values such as Benevolence, Conformity, and Universalism across the studied cultures, suggesting the potential existence of a pan-European cultural memory.

15. Towards efficient and reliable utilization of automated data collection: Media scrapers applied to news on climate change

Erkki Mervaala ; Jari Lyytimäki.
Abstract: Automated data collection provides tempting opportunities for social sciences and humanities studies. Abundant data accumulating in various digital archives allows more comprehensive, timely and cost-efficient ways of harvesting and processing information. While easing or even removing some of the key problems, such as laborious and time-consuming data collection and potential errors and biases related to subjective coding of materials and distortions caused by focus on small samples, automated methods also bring in new risks such as poor understanding of contexts of the data or non-recognition of underlying systematic errors or missing information. Results from testing different methods to collect data describing newspaper coverage of climate change in Finland emphasize that fully relying on automatable tools such as media scrapers has its limitations and can provide comprehensive but incomplete document acquisition for research. Many of these limitations can, however, be addressed and not all of them rely on manual control.

16. Predicting Sustainable Development Goals Using Course Descriptions -- from LLMs to Conventional Foundation Models

Lev Kharlashkin ; Melany Macias ; Leo Huovinen ; Mika Hämäläinen.
We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDGs. The best performing model in our experiments was BART with an F1-score of 0.786.

17. Perplexity Games: Maoism vs. Literature through the Lens of Cognitive Stylometry

Maciej Kurzynski.
The arrival of large language models (LLMs) has provoked an urgent search for stylistic markers that could differentiate machine text from human text, but while the human-like appearance of machine text has captivated public attention, the reverse phenomenon—human text becoming machine-like—has raised much less concern. This conceptual lag is surprising given the ample historical evidence of state-backed attempts to regulate human thought. The present article proposes a new comparative framework, Perplexity Games, to leverage the predictive power of LLMs and compare the statistical properties of Maospeak, a language style that emerged during the Mao Zedong’s era in China (1949-1976), with the style of canonical modern Chinese writers, such as Eileen Chang (1920-1995) and Mo Yan (1955-). The low perplexity of Maospeak, as computed across different GPT models, suggests that the impact of ideologies on language can be compared to likelihood-maximization text-generation techniques which reduce the scope of valid sequence continuations. These findings have cognitive implications: whereas engineered languages such as Maospeak hijack the predictive mechanisms of human cognition by narrowing the space of linguistic possibilities, literature resists such cognitive constraints by dispersing the probability mass over multiple, equally valid paths. Exposure to diverse language data counters the influences of ideologies on our linguistically mediated perceptions of the world and […]

18. Study on the Domain Adaption of Korean Speech Act using Daily Conversation Dataset and Petition Corpus

Youngsook Song ; Won Ik Cho.
In Korean, quantitative speech act studies have usually been conducted on single utterances with unspecified sources. In this study, we annotate sentences from the National Institute of Korean Language's Messenger Corpus and the National Petition Corpus, as well as example sentences from an academic paper on contemporary Korean vlogging, and check the discrepancy between human annotation and model prediction. In particular, for sentences with differences in locutionary and illocutionary forces, we analyze the causes of errors to see if stylistic features used in a particular domain affect the correct inference of speech act. Through this, we see the necessity to build and analyze a balanced corpus in various text domains, taking into account cases with different usage roles, e.g., messenger conversations belonging to private conversations and petition corpus/vlogging script that have an unspecified audience.
Rubrique : Jeu de données

19. Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

Khalid Alnajjar ; Mika Hämäläinen.
We present an encoder-decored based model for normalization of Arabic dialects using both BERT and GPT-2 based models. Arabic is a language of many dialects that not only differ from the Modern Standard Arabic (MSA) in terms of pronunciation but also in terms of morphology, grammar and lexical choice. This diversity can be troublesome even to a native Arabic speaker let alone a computer. Several NLP tools work well for MSA and in some of the main dialects but fail to cover Arabic language as a whole. Based on our manual evaluation, our model normalizes sentences entirely correctly 46\% of the time and almost correctly 26\% of the time.

20. Applying computational approaches to energy discourse: a comparative methodological study of rule-based and large language model qualitative content analysis

Ilona Kousa.

21. OCR quality and the resilience of algorithmic identification of linguistic register features in Eighteenth Century Collections Online

Aatu Liimatta.

22. Ainu–Japanese Bi-directional Neural Machine Translation: A Step Towards Linguistic Preservation of Ainu, An Under-Resourced Indigenous Language in Japan

So Miyagawa.
This study presents a groundbreaking approach to preserving the Ainu language, recognized as critically endangered by UNESCO, by developing a bi-directional neural machine translation (MT) system between Ainu and Japanese. Utilizing the Marian MT framework, known for its effectiveness with resource-scarce languages, the research aims to overcome the linguistic complexities inherent in Ainu's polysynthetic structure. The paper delineates a comprehensive methodology encompassing data collection from diverse Ainu text sources, meticulous preprocessing, and the deployment of neural MT models, culminating in the achievement of significant SacreBLEU scores that underscore the models' translation accuracy. The findings illustrate the potential of advanced MT technology to facilitate linguistic preservation and educational endeavors, advocating for integrating such technologies in safeguarding endangered languages. This research not only underscores the critical role of MT in bridging language divides but also sets a precedent for employing computational linguistics to preserve cultural and linguistic heritage.
Rubrique : Humanités numériques en langues

23. Sentiment Analysis for Literary Texts: Hemingway as a Case-study

Bizzoni Yuri ; Feldkamp Pascale.

24. On searchable Mordvin corpora at the Language Bank of Finland, EMERALD

Jack Rueter.
Description of Mordvin language corpora development at the Language Bank of Finland.Description of development.
Rubrique : V. L'apport des corpus

25. Perceptions of 21st-century digital skills and agency among design sprint participants in Laurea UAS, Finland

Asko Mononen.
This explorative study investigated students’ (N=16) perceptions before and after the study unit Digital Analytics and Consumer Insights. The studies were conducted as an intensive hybrid five-day design sprint, a variant of project- and problem-based learning. An online questionnaire with a 5-point Likert scale was used for data collection. The findings indicate that the intervention improved perceptions of most studied digital “hard skills” (8/11 claims). Out of twelve 21st-century “soft skills” claims, perceptions were high initially and improved significantly for critical thinking and systematic problem-solving claims during the design sprint. The agency scores showed a slight improvement but no significant difference. Face-to-face groups would be willing to recommend the sprint method more for peers than online groups.  In the era of global turbulence and artificial intelligence, in addition to hard skills, soft skills like communication, teamwork, problem-solving and project management are in demand by employers. According to LinkedIn data in 2/2024, adaptability is the most demanded skill. In addition to traditional subjects, the pedagogical methods in higher education should better support the development of 21st-century skills.

26. Old Permic Universal Dependencies Treebank

Niko Partanen ; Jack Rueter ; Rogier Blokland.
Old Permic, also known as Old Komi, is an extinct variety of Komi that was spoken in the late Middle Ages in the lower Vychegda river basin in Northeastern European Russia, in an area that currently is not Komi-speaking. This language variety is attested in fragmentary records from the 14th to 17th century written both in the Old Permic alphabet and in Cyrillic. These records are of significant importance for research on the history of the Komi language. Here we introduce our attempt towards a new Universal Dependencies treebank that will eventually contain the existing corpus of Old Permic in a structured and CoNLL-U annotated format. This will be the first time this material is being made openly available in digital format, and our contribution describes the current state of the art and remaining challenges.

27. Components of Character: Exploring the Computational Similarity of Austen's Characters

Carolyn Jane Anderson.

28. Machine transliteration of long text with error detection and correction

Mohamed Abdellatif.
Rubrique : I. Approches historiques et linguistiques

29. Explaining Humour Style Classifications: An XAI Approach to Understanding Computational Humour Analysis

Mary Ogbuka Kenneth ; Foaad Khosmood ; Abbas Edalat.
Humour styles can have either a negative or a positive impact on well-being. Given the importance of these styles to mental health, significant research has been conducted on their automatic identification. However, the automated machine learning models used for this purpose are black boxes, making their prediction decisions opaque. Clarity and transparency are vital in the field of mental health. This paper presents an explainable AI (XAI) framework for understanding humour style classification, building upon previous work in computational humour analysis. Using the best-performing single model (ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to analyse how linguistic, emotional, and semantic features contribute to humour style classification decisions. Our analysis reveals distinct patterns in how different humour styles are characterised and misclassified, with particular emphasis on the challenges in distinguishing affiliative humour from other styles. Through detailed examination of feature importance, error patterns, and misclassification cases, we identify key factors influencing model decisions, including emotional ambiguity, context misinterpretation, and target identification. The framework demonstrates significant utility in understanding model behaviour, achieving interpretable insights into the complex interplay of features that define different humour styles. Our findings contribute to both the theoretical understanding of computational […]
Rubrique : Humanités numériques en langues

30. Exploring Historical Labor Markets: Computational Approaches to Job Title Extraction

Raven Adam ; Klara Venglarova ; Georg Vogeler.
Historical job advertisements provide invaluable insights into the evolution of labor markets and societaldynamics. However, extracting structured information, such as job titles, from these OCRed and unstructuredtexts presents significant challenges. This study evaluates four distinct computational approachesfor job title extraction: a dictionary-based method, a rule-based approach leveraging linguistic patterns,a Named Entity Recognition (NER) model fine-tuned on historical data, and a text generation modeldesigned to rewrite advertisements into structured lists.Our analysis spans multiple versions of the ANNO dataset, including raw OCR, automatically postcorrected,and human-corrected text, as well as an external dataset of German historical job advertisements.Results demonstrate that the NER approach consistently outperforms other methods, showcasingrobustness to OCR errors and variability in text quality. The text generation approach performs well onhigh-quality data but exhibits greater sensitivity to OCR-induced noise. While the rule-based method isless effective overall, it performs relatively well for ambiguous entities. The dictionary-based approach,though limited in precision, remains stable across datasets.This study highlights the impact of text quality on extraction performance and underscores the need foradaptable, generalizable methods. Future work should focus on integrating hybrid approaches, expandingannotated datasets, and improving OCR correction […]

31. Comparing Human-Perceived Cluster Characteristics through the Lens of CIPHE: Measuring Coherence beyond Keywords

Anton Eklund ; Mona Forsman ; Frank Drewes.
A frequent problem in document clustering and topic modeling is the lack of ground truth. Models are typically intended to reflect some aspect of how human readers view texts (the general theme, sentiment, emotional response, etc), but it can be difficult to assess whether they actually do. The only real ground truth is human judgement. To enable researchers and practitioners to collect such judgement in a cost-efficient standardized way, we have developed the crowdsourcing solution CIPHE -- Cluster Interpretation and Precision from Human Exploration. CIPHE is an adaptable framework which systematically gathers and evaluates data on the human perception of a set of document clusters where participants read sample texts from the cluster. In this article, we use CIPHE to study the limitations that keyword-based methods pose in topic modeling coherence evaluation. Keyword methods, including word intrusion, are compared with the outcome of the thorougher CIPHE on scoring and characterizing clusters. The results show how the abstraction of keywords skews the cluster interpretation for almost half of the compared instances, meaning that many important cluster characteristics are missed. Further, we present a case study where CIPHE is used to (a) provide insights into the UK news domain and (b) find out how the evaluated clustering model should be tuned to better suit the intended application. The experiments provide evidence that CIPHE characterizes clusters in a predictable […]

32. Computational Pathways to Intertextuality of the Ancient Indian Literature: A Multi-Method Analysis of the Maitrāyaṇī and Kāṭhaka Saṃhitās

So Miyagawa ; Yuki Kyogoku ; Yuzuki Tsukagoshi ; Kyoko Amano.
This paper examines semantic similarity and intertextuality in selected texts from the Vedic Sanskrit corpus, specifically the Maitrāyaṇī Saṃhitā (MS; Amano 2009) and Kāṭhaka Saṃhitā (KS). Three computational methods are employed: Word2Vec for word embeddings, the stylo package for stylometric analysis, and TRACER for text reuse detection. By comparing various sections of the texts at different granularities, patterns of similarity and structural alignment are uncovered, providing insights into textual relationships and chronology. Word embeddings capture semantic similarities, while stylometric analysis reveals clusters that differentiate the texts. TRACER identifies parallel passages, indicating probable instances of text reuse. Our multi-method analysis corroborates previous philological studies, suggesting that MS.1.9 aligns with later editorial layers, akin to MS.1.7 and KS.9.1. The findings highlight the potential of computational methods in studying ancient Sanskrit literature, complementing traditional approaches, and emphasize that smaller chunk sizes are more effective for detecting intertextual parallels. These approaches expand methodological frontiers in Indology and illuminate new research pathways for analyzing ancient texts.
Rubrique : Visualisation de l'intertextualité et de la réutilisation des textes

33. Out of Context! Managing the Limitations of Context Windows in ChatGPT-4o Text Analyses

Erkki Mervaala ; Ilona Kousa.
In recent years, large language model (LLM) applications have surged in popularity, and academia has followed suit. Researchers frequently seek to automate text annotation - often a tedious task – and, to some extent, text analysis. Notably, popular LLMs such as ChatGPT have been studied as both research assistants and analysis tools, revealing several concerns regarding transparency and the nature of AI-generated content. This study assesses ChatGPT’s usability and reliability for text analysis – specifically keyword extraction and topic classification – within an “out-of-the-box” zero-shot or few-shot context, emphasizing how the size of the context window and varied text types influence the resulting analyses. Our findings indicate that text type and the order in which texts are presented both significantly affect ChatGPT’s analysis. At the same time, context-building tends to be less problematic when analyzing similar texts. However, lengthy texts and documents pose serious challenges: once the context window is exceeded, “hallucinated” results often emerge. While some of these issues stem from the core functioning of LLMs, some can be mitigated through transparent research planning.

34. Evaluating ChatGPT-4 and Machine Learning Models for Sentiment Analysis on a Multi-Script Moroccan Arabic Corpus: Insights, Challenges, and Future Directions

Mohamed HANNANI ; Abdelhadi SOUDI ; Kristof Van Laerhoven.
The application of Large Language Models (LLMs) to low-resource languages and dialects, such as Moroccan Arabic (MA), remains a relatively unexplored area. This study evaluates the performance of ChatGPT-4, fine-tuned BERT models, FastText embeddings, and traditional machine learning approaches for sentiment analysis on MA. Using two publicly available MA datasets—the Moroccan Arabic Corpus (MAC) from X (formerly Twitter) and the Moroccan Arabic YouTube Corpus (MYC)—we assess the ability of these models to detect sentiment across different contexts. Although fine-tuned models performed well, ChatGPT-4 exhibited substantial potential for sentiment analysis, even in zero-shot scenarios. However, performance on MA was generally lower than on Modern Standard Arabic (MSA), attributed to factors such as regional variability, lack of standardization, and limited data availability. Future work should focus on expanding and standardizing MA datasets, as well as developing new methods like combining FastText and BERT embeddings with attention mechanisms to improve performance on this challenging dialect. 
Rubrique : Humanités numériques en langues

35. Stability and change in Icelandic corpora: The case of Stylistic Fronting

Anton Karl Ingason ; Johanna Mechler.
We use Icelandic corpora, the Icelandic Gigaword Corpus, the Icelandic Parsed Historical Corpus, and the Newspaper Corpus, in order to investigate the history of a syntactic construction, Stylistic Fronting (SF). SF has long been noted to be associated with formal style and it has received considerable attention in the theoretical syntax literature, but less has been said in the literature about its history throughout the centuries and modern times. We find that use of SF remained stable from the 12th century to the 20th century, but its rate declines in the (late) 20th and 21st century. Our analysis furthermore shows that use of SF is only significantly connected to genre in the most recent data, suggesting that the link between SF and style may be a relatively modern innovation. Finally, we test our data for the Constant Rate Effect, revealing that it is present for some grammatical contexts of SF. Our paper is a digital humanities study of historical linguistics which would not be possible without parsed corpora that together span all centuries involved in the change.

36. Interpretable Socioeconomic Profiling: A Deep Dive Beyond Narrative Classification

Nahed Abdelgaber ; Labiba Jahan ; Joshua Oltmanns ; Mehak Gupta ; Jia Zhang.
This work presents an interpretable framework for socioeconomic status (SES) profiling based on narrative data. Building on our previous publication, “AI Assistant for Socioeconomic Empowerment Using Federated Learning” (NLP4DH 2025), this extended study explores a complementary system that focuses on thematic topic modeling, transformer-based embedding comparisons, and visualization tools. The framework analyzes student and public narratives to detect SES-related themes (e.g., financial hardship, resilience, access to resources) and assigns SES profiles through similarity-based scoring. By emphasizing interpretability and topic-based filtering, the system facilitates analysis of language patterns linked to different SES groups while supporting qualitative inspection. Results demonstrate the model’s ability to generalize across diverse domains and align with known social science frameworks, contributing toward responsible and transparent AI in education and public policy contexts.

37. Ambiguity in Crisis: A Multimodal and Synthetic Data Approach to Classification

Sumiko Teng.
Social media platforms, such as Twitter (now X), play a crucial role during crises by enabling real-time information sharing. However, the multimodal data can be ambiguous with misalignment of labels cross-modality. Being able to classify informative and not informative tweets can help in crisis response, yet they can be ambiguous and unbalanced in datasets, impairing model performance. This study explores the effectiveness of multimodal learning approaches for classifying crisis-related tweets regardless of ambiguity and addressing class imbalance through synthetic data augmentation using generative artificial intelligence (AI). Experimental results demonstrate that multimodal models consistently outperform unimodal ones, particularly on ambiguous tweets where label misalignment between modalities is prevalent. Furthermore, the addition of synthetic data significantly boosts macro F1 scores, indicating improved performance on the minority class.

38. Efficient Toxicity Detection in Gaming Chats: A Comparative Study of Embeddings, Fine-Tuned Transformers and LLMs

Yehor Tereshchenko ; Mika K Hämäläinen.
This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.

39. Beyond Style: Rethinking Computational Fanfiction Research

Mia Jacobsen ; Ross Deans Kristensen-McLachlan.
Computational studies of fanfiction have gained traction in recent years, due to both the abundance of data and the development of contemporary NLP methods and tools. In this position paper, we outline the predominant themes and findings of previous studies of fanfiction and propose fruitful suggestions for future research. Specifically, we identify two primary ways that fanfiction has been approached from a computational perspective: one concerning the style of successful or popular fanfiction; the other concerning gender and power dynamics in fanfiction texts. This existing research, however, has only begun to grapple with the complexities and challenges of working with fan-produced content. We argue that online fanfiction is a complex and, in many ways, unique cultural phenomenon which requires new ways of thinking about the motivations and purposes of textual production. Fanfiction is a dynamic, community-produced, transformative genre, and as long as research neglects this whole picture, studies will remain underdeveloped and insufficient to answer meaningful questions. Furthermore, these new ways of approaching fanfiction need to be based on ethical archiving and research practices that are rooted in the theory from qualitative research, and which show ethical care for the people who create and engage with fanfiction.