So Miyagawa ; Yuki Kyogoku ; Yuzuki Tsukagoshi ; Kyoko Amano.
This paper examines semantic similarity and intertextuality in selected texts from the Vedic Sanskrit corpus, specifically the Maitrāyaṇī Saṃhitā (MS; Amano 2009) and Kāṭhaka Saṃhitā (KS). Three computational methods are employed: Word2Vec for word embeddings, the stylo package for stylometric analysis, and TRACER for text reuse detection. By comparing various sections of the texts at different granularities, patterns of similarity and structural alignment are uncovered, providing insights into textual relationships and chronology. Word embeddings capture semantic similarities, while stylometric analysis reveals clusters that differentiate the texts. TRACER identifies parallel passages, indicating probable instances of text reuse. Our multi-method analysis corroborates previous philological studies, suggesting that MS.1.9 aligns with later editorial layers, akin to MS.1.7 and KS.9.1. The findings highlight the potential of computational methods in studying ancient Sanskrit literature, complementing traditional approaches, and emphasize that smaller chunk sizes are more effective for detecting intertextual parallels. These approaches expand methodological frontiers in Indology and illuminate new research pathways for analyzing ancient texts.
Rubrique : Visualisation de l'intertextualité et de la réutilisation des textes
Mohamed Abdellatif.
Rubrique : I. Approches historiques et linguistiques
Anton Eklund ; Mona Forsman ; Frank Drewes.
A frequent problem in document clustering and topic modeling is the lack of ground truth. Models are typically intended to reflect some aspect of how human readers view texts (the general theme, sentiment, emotional response, etc), but it can be difficult to assess whether they actually do. The only real ground truth is human judgement. To enable researchers and practitioners to collect such judgement in a cost-efficient standardized way, we have developed the crowdsourcing solution CIPHE -- Cluster Interpretation and Precision from Human Exploration. CIPHE is an adaptable framework which systematically gathers and evaluates data on the human perception of a set of document clusters where participants read sample texts from the cluster. In this article, we use CIPHE to study the limitations that keyword-based methods pose in topic modeling coherence evaluation. Keyword methods, including word intrusion, are compared with the outcome of the thorougher CIPHE on scoring and characterizing clusters. The results show how the abstraction of keywords skews the cluster interpretation for almost half of the compared instances, meaning that many important cluster characteristics are missed. Further, we present a case study where CIPHE is used to (a) provide insights into the UK news domain and (b) find out how the evaluated clustering model should be tuned to better suit the intended application. The experiments provide evidence that CIPHE characterizes clusters in a predictable […]
Erkki Mervaala ; Ilona Kousa.
In recent years, large language model (LLM) applications have surged in popularity, and academia has followed suit. Researchers frequently seek to automate text annotation - often a tedious task – and, to some extent, text analysis. Notably, popular LLMs such as ChatGPT have been studied as both research assistants and analysis tools, revealing several concerns regarding transparency and the nature of AI-generated content. This study assesses ChatGPT’s usability and reliability for text analysis – specifically keyword extraction and topic classification – within an “out-of-the-box” zero-shot or few-shot context, emphasizing how the size of the context window and varied text types influence the resulting analyses. Our findings indicate that text type and the order in which texts are presented both significantly affect ChatGPT’s analysis. At the same time, context-building tends to be less problematic when analyzing similar texts. However, lengthy texts and documents pose serious challenges: once the context window is exceeded, “hallucinated” results often emerge. While some of these issues stem from the core functioning of LLMs, some can be mitigated through transparent research planning.
Mary Ogbuka Kenneth ; Foaad Khosmood ; Abbas Edalat.
Humour styles can have either a negative or a positive impact on well-being. Given the importance of these styles to mental health, significant research has been conducted on their automatic identification. However, the automated machine learning models used for this purpose are black boxes, making their prediction decisions opaque. Clarity and transparency are vital in the field of mental health. This paper presents an explainable AI (XAI) framework for understanding humour style classification, building upon previous work in computational humour analysis. Using the best-performing single model (ALI+XGBoost) from prior research, we apply comprehensive XAI techniques to analyse how linguistic, emotional, and semantic features contribute to humour style classification decisions. Our analysis reveals distinct patterns in how different humour styles are characterised and misclassified, with particular emphasis on the challenges in distinguishing affiliative humour from other styles. Through detailed examination of feature importance, error patterns, and misclassification cases, we identify key factors influencing model decisions, including emotional ambiguity, context misinterpretation, and target identification. The framework demonstrates significant utility in understanding model behaviour, achieving interpretable insights into the complex interplay of features that define different humour styles. Our findings contribute to both the theoretical understanding of computational […]
Rubrique : Humanités numériques en langues
Carolyn Jane Anderson.
Mohamed HANNANI ; Abdelhadi SOUDI ; Kristof Van Laerhoven.
The application of Large Language Models (LLMs) to low-resource languages and dialects, such as Moroccan Arabic (MA), remains a relatively unexplored area. This study evaluates the performance of ChatGPT-4, fine-tuned BERT models, FastText embeddings, and traditional machine learning approaches for sentiment analysis on MA. Using two publicly available MA datasets—the Moroccan Arabic Corpus (MAC) from X (formerly Twitter) and the Moroccan Arabic YouTube Corpus (MYC)—we assess the ability of these models to detect sentiment across different contexts. Although fine-tuned models performed well, ChatGPT-4 exhibited substantial potential for sentiment analysis, even in zero-shot scenarios. However, performance on MA was generally lower than on Modern Standard Arabic (MSA), attributed to factors such as regional variability, lack of standardization, and limited data availability. Future work should focus on expanding and standardizing MA datasets, as well as developing new methods like combining FastText and BERT embeddings with attention mechanisms to improve performance on this challenging dialect.
Rubrique : Humanités numériques en langues
Raven Adam ; Klara Venglarova ; Georg Vogeler.
Historical job advertisements provide invaluable insights into the evolution of labor markets and societaldynamics. However, extracting structured information, such as job titles, from these OCRed and unstructuredtexts presents significant challenges. This study evaluates four distinct computational approachesfor job title extraction: a dictionary-based method, a rule-based approach leveraging linguistic patterns,a Named Entity Recognition (NER) model fine-tuned on historical data, and a text generation modeldesigned to rewrite advertisements into structured lists.Our analysis spans multiple versions of the ANNO dataset, including raw OCR, automatically postcorrected,and human-corrected text, as well as an external dataset of German historical job advertisements.Results demonstrate that the NER approach consistently outperforms other methods, showcasingrobustness to OCR errors and variability in text quality. The text generation approach performs well onhigh-quality data but exhibits greater sensitivity to OCR-induced noise. While the rule-based method isless effective overall, it performs relatively well for ambiguous entities. The dictionary-based approach,though limited in precision, remains stable across datasets.This study highlights the impact of text quality on extraction performance and underscores the need foradaptable, generalizable methods. Future work should focus on integrating hybrid approaches, expandingannotated datasets, and improving OCR correction […]