Hanna Martikainen.
While the productivity gains brought about by machine translation (MT) can help translators meet ever-tighter deadlines and respond to pressing demands for publishing content simultaneously in different languages, these tools also impose a workflow that tends to reduce the human translator's role to simply correcting mistakes made by the machine in a one-way process with no real interaction. Thus, although more cost-effective, post-editing of MT output also appears a less creative and enjoyable a task than translation. Adaptive MT, on the other hand, has been advertised as a way to recenter the translation process on the human and foster more genuine interaction with the machine. Said to have been developed for professional translation workflows, the technology enables a dynamic work process that is supposedly very different from the repetitive task that post-editing static MT output can be. This paper presents an experiment with adaptive MT conducted during the 2020-2021 academic year. As part of a course on MT and post-editing, second-year master's students conducted group projects on the Lilt platform. In this paper, students' views on the MT engine are analyzed, with a focus on their interaction with the technology. While students recognize the potential of adaptive MT for empowering the human in the loop, MT quality and CAT ergonomics in general appear to have a greater influence on usability than interaction with the machine.
Rubrique : III. Biotraduction vs. traduction automatique
Anton Eklund ; Mona Forsman ; Frank Drewes.
A frequent problem in document clustering and topic modeling is the lack of ground truth. Models are typically intended to reflect some aspect of how human readers view texts (the general theme, sentiment, emotional response, etc), but it can be difficult to assess whether they actually do. The only real ground truth is human judgement. To enable researchers and practitioners to collect such judgement in a cost-efficient standardized way, we have developed the crowdsourcing solution CIPHE -- Cluster Interpretation and Precision from Human Exploration. CIPHE is an adaptable framework which systematically gathers and evaluates data on the human perception of a set of document clusters where participants read sample texts from the cluster. In this article, we use CIPHE to study the limitations that keyword-based methods pose in topic modeling coherence evaluation. Keyword methods, including word intrusion, are compared with the outcome of the thorougher CIPHE on scoring and characterizing clusters. The results show how the abstraction of keywords skews the cluster interpretation for almost half of the compared instances, meaning that many important cluster characteristics are missed. Further, we present a case study where CIPHE is used to (a) provide insights into the UK news domain and (b) find out how the evaluated clustering model should be tuned to better suit the intended application. The experiments provide evidence that CIPHE characterizes clusters in a predictable […]
Erkki Mervaala ; Ilona Kousa.
In recent years, large language model (LLM) applications have surged in popularity, and academia has followed suit. Researchers frequently seek to automate text annotation - often a tedious task – and, to some extent, text analysis. Notably, popular LLMs such as ChatGPT have been studied as both research assistants and analysis tools, revealing several concerns regarding transparency and the nature of AI-generated content. This study assesses ChatGPT’s usability and reliability for text analysis – specifically keyword extraction and topic classification – within an “out-of-the-box” zero-shot or few-shot context, emphasizing how the size of the context window and varied text types influence the resulting analyses. Our findings indicate that text type and the order in which texts are presented both significantly affect ChatGPT’s analysis. At the same time, context-building tends to be less problematic when analyzing similar texts. However, lengthy texts and documents pose serious challenges: once the context window is exceeded, “hallucinated” results often emerge. While some of these issues stem from the core functioning of LLMs, some can be mitigated through transparent research planning.
Carolyn Jane Anderson.
Mary Ogbuka Kenneth ; Foaad Khosmood ; Abbas Edalat.
Humour styles can have either a negative or a positive impact on well-being. Given the importance of these styles to mental health, significant research has been conducted on their automatic identification. However, the automated machine learning models used for this purpose are black boxes, making their prediction decisions opaque. Clarity and transparency are vital in the field of mental health. This paper presents an explainable AI (XAI) framework for understanding humour style classification, building upon previous work in computational humour analysis. Using the best-performing single model (ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to analyse how linguistic, emotional, and semantic features contribute to humour style classification decisions. Our analysis reveals distinct patterns in how different humour styles are characterised and misclassified, with particular emphasis on the challenges in distinguishing affiliative humour from other styles. Through detailed examination of feature importance, error patterns, and misclassification cases, we identify key factors influencing model decisions, including emotional ambiguity, context misinterpretation, and target identification. The framework demonstrates significant utility in understanding model behaviour, achieving interpretable insights into the complex interplay of features that define different humour styles. Our findings contribute to both the theoretical understanding of computational […]
Rubrique : Humanités numériques en langues
Mohamed HANNANI ; Abdelhadi SOUDI ; Kristof Van Laerhoven.
The application of Large Language Models (LLMs) to low-resource languages and dialects, such as Moroccan Arabic (MA), remains a relatively unexplored area. This study evaluates the performance of ChatGPT-4, fine-tuned BERT models, FastText embeddings, and traditional machine learning approaches for sentiment analysis on MA. Using two publicly available MA datasets—the Moroccan Arabic Corpus (MAC) from X (formerly Twitter) and the Moroccan Arabic YouTube Corpus (MYC)—we assess the ability of these models to detect sentiment across different contexts. Although fine-tuned models performed well, ChatGPT-4 exhibited substantial potential for sentiment analysis, even in zero-shot scenarios. However, performance on MA was generally lower than on Modern Standard Arabic (MSA), attributed to factors such as regional variability, lack of standardization, and limited data availability. Future work should focus on expanding and standardizing MA datasets, as well as developing new methods like combining FastText and BERT embeddings with attention mechanisms to improve performance on this challenging dialect.
Rubrique : Humanités numériques en langues
Mohamed Abdellatif.
Rubrique : I. Approches historiques et linguistiques
So Miyagawa ; Yuki Kyogoku ; Yuzuki Tsukagoshi ; Kyoko Amano.
This paper examines semantic similarity and intertextuality in selected texts from the Vedic Sanskrit corpus, specifically the Maitrāyaṇī Saṃhitā (MS; Amano 2009) and Kāṭhaka Saṃhitā (KS). Three computational methods are employed: Word2Vec for word embeddings, the stylo package for stylometric analysis, and TRACER for text reuse detection. By comparing various sections of the texts at different granularities, patterns of similarity and structural alignment are uncovered, providing insights into textual relationships and chronology. Word embeddings capture semantic similarities, while stylometric analysis reveals clusters that differentiate the texts. TRACER identifies parallel passages, indicating probable instances of text reuse. Our multi-method analysis corroborates previous philological studies, suggesting that MS.1.9 aligns with later editorial layers, akin to MS.1.7 and KS.9.1. The findings highlight the potential of computational methods in studying ancient Sanskrit literature, complementing traditional approaches, and emphasize that smaller chunk sizes are more effective for detecting intertextual parallels. These approaches expand methodological frontiers in Indology and illuminate new research pathways for analyzing ancient texts.
Rubrique : Visualisation de l'intertextualité et de la réutilisation des textes
Raven Adam ; Klara Venglarova ; Georg Vogeler.
Historical job advertisements provide invaluable insights into the evolution of labor markets and societaldynamics. However, extracting structured information, such as job titles, from these OCRed and unstructuredtexts presents significant challenges. This study evaluates four distinct computational approachesfor job title extraction: a dictionary-based method, a rule-based approach leveraging linguistic patterns,a Named Entity Recognition (NER) model fine-tuned on historical data, and a text generation modeldesigned to rewrite advertisements into structured lists.Our analysis spans multiple versions of the ANNO dataset, including raw OCR, automatically postcorrected,and human-corrected text, as well as an external dataset of German historical job advertisements.Results demonstrate that the NER approach consistently outperforms other methods, showcasingrobustness to OCR errors and variability in text quality. The text generation approach performs well onhigh-quality data but exhibits greater sensitivity to OCR-induced noise. While the rule-based method isless effective overall, it performs relatively well for ambiguous entities. The dictionary-based approach,though limited in precision, remains stable across datasets.This study highlights the impact of text quality on extraction performance and underscores the need foradaptable, generalizable methods. Future work should focus on integrating hybrid approaches, expandingannotated datasets, and improving OCR correction […]
Ziyang Lin ; Xiaoru Wu ; Minxuan Feng ; Yi He.
The fine traditional culture of the Chinese nation is often embedded in cultural nouns. Building a knowledge base of them helps to better transmit these outstanding cultural heritages and provides support for compulsory education. This paper clarifies the definition of Chinese traditional cultural nouns and selects 3,512 representative nouns of them. Hownet is utilized to categorize these nouns and construct classified thesaurus. Furthermore, using CIDOC-CRM, the interpretations of these nouns are processed, and the associations between different categories of nouns are presented in a knowledge map. On this basis, a knowledge base of Chinese traditional cultural nouns has been constructed. Quantitative analysis of Chinese traditional cultural nouns reveals that the number and difficulty of Chinese traditional cultural nouns to be mastered by students show an increasing trend with the growth of the school year, which is basically in line with the cognitive characteristics of adolescents. This means that the knowledge base constructed based on the digital humanities approach contributes to cultural education at the compulsory level.
Rubrique : Humanités numériques en langues
Sumiko Teng.
Social media platforms, such as Twitter (now X), play a crucial role during crises by enabling real-time information sharing. However, the multimodal data can be ambiguous with misalignment of labels cross-modality. Being able to classify informative and not informative tweets can help in crisis response, yet they can be ambiguous and unbalanced in datasets, impairing model performance. This study explores the effectiveness of multimodal learning approaches for classifying crisis-related tweets regardless of ambiguity and addressing class imbalance through synthetic data augmentation using generative artificial intelligence (AI). Experimental results demonstrate that multimodal models consistently outperform unimodal ones, particularly on ambiguous tweets where label misalignment between modalities is prevalent. Furthermore, the addition of synthetic data significantly boosts macro F1 scores, indicating improved performance on the minority class.
Yehor Tereshchenko ; Mika K Hämäläinen.
This paper presents a comprehensive comparative analysis of Natural Language Processing (NLP) methods for automated toxicity detection in online gaming chats. Traditional machine learning models with embeddings, large language models (LLMs) with zero-shot and few-shot prompting, fine-tuned transformer models, and retrieval-augmented generation (RAG) approaches are evaluated. The evaluation framework assesses three critical dimensions: classification accuracy, processing speed, and computational costs. A hybrid moderation system architecture is proposed that optimizes human moderator workload through automated detection and incorporates continuous learning mechanisms. The experimental results demonstrate significant performance variations across methods, with fine-tuned DistilBERT achieving optimal accuracy-cost trade-offs. The findings provide empirical evidence for deploying cost-effective, efficient content moderation systems in dynamic online gaming environments.
Mia Jacobsen ; Ross Deans Kristensen-McLachlan.
Computational studies of fanfiction have gained traction in recent years, due to both the abundance of data and the development of contemporary NLP methods and tools. In this position paper, we outline the predominant themes and findings of previous studies of fanfiction and propose fruitful suggestions for future research. Specifically, we identify two primary ways that fanfiction has been approached from a computational perspective: one concerning the style of successful or popular fanfiction; the other concerning gender and power dynamics in fanfiction texts. This existing research, however, has only begun to grapple with the complexities and challenges of working with fan-produced content. We argue that online fanfiction is a complex and, in many ways, unique cultural phenomenon which requires new ways of thinking about the motivations and purposes of textual production. Fanfiction is a dynamic, community-produced, transformative genre, and as long as research neglects this whole picture, studies will remain underdeveloped and insufficient to answer meaningful questions. Furthermore, these new ways of approaching fanfiction need to be based on ethical archiving and research practices that are rooted in the theory from qualitative research, and which show ethical care for the people who create and engage with fanfiction.
Anton Karl Ingason ; Johanna Mechler.
We use Icelandic corpora, the Icelandic Gigaword Corpus, the Icelandic Parsed Historical Corpus, and the Newspaper Corpus, in order to investigate the history of a syntactic construction, Stylistic Fronting (SF). SF has long been noted to be associated with formal style and it has received considerable attention in the theoretical syntax literature, but less has been said in the literature about its history throughout the centuries and modern times. We find that use of SF remained stable from the 12th century to the 20th century, but its rate declines in the (late) 20th and 21st century. Our analysis furthermore shows that use of SF is only significantly connected to genre in the most recent data, suggesting that the link between SF and style may be a relatively modern innovation. Finally, we test our data for the Constant Rate Effect, revealing that it is present for some grammatical contexts of SF. Our paper is a digital humanities study of historical linguistics which would not be possible without parsed corpora that together span all centuries involved in the change.
Nahed Abdelgaber ; Labiba Jahan ; Joshua Oltmanns ; Mehak Gupta ; Jia Zhang.
This work presents an interpretable framework for socioeconomic status (SES) profiling based on narrative data. Building on our previous publication, “AI Assistant for Socioeconomic Empowerment Using Federated Learning” (NLP4DH 2025), this extended study explores a complementary system that focuses on thematic topic modeling, transformer-based embedding comparisons, and visualization tools. The framework analyzes student and public narratives to detect SES-related themes (e.g., financial hardship, resilience, access to resources) and assigns SES profiles through similarity-based scoring. By emphasizing interpretability and topic-based filtering, the system facilitates analysis of language patterns linked to different SES groups while supporting qualitative inspection. Results demonstrate the model’s ability to generalize across diverse domains and align with known social science frameworks, contributing toward responsible and transparent AI in education and public policy contexts.
Niko Partanen ; Jack Rueter.
This study investigates the correspondences between a recent map of Uralic languages that also covers the Erzya and Moksha languages in detail. We discuss our point of view in linguistic cartography more generally, but especially within the context of Uralic languages, and address various difficulties that can be recognized in defining the speaker area boundaries and choosing settlements that should be included in the traditional or contemporary speech communities. We use the historical data of Heikki Paasonen, which, we believe, is a highly reliable indicator of at least some areas that should be included in the traditional distributions of these languages as points of comparison. This data is contrasted with the contemporary language maps.