2024


1. ArchEthno - a new tool for sharing research materials and a new method for archiving your own research

Florence Weber ; Carlo Zwölf ; Arnaud Trouche ; Agnès Tricoche ; José Sastre.
The archiving of ethnographic material is generally considered a blind spot in ethnographic working methods which place more importance on actual investigations and analysis than on how archives are constructed. A team of computer scientists and ethnographers has built an initial tool for sharing ethnographic materials, based on an SQL relational data model that suited the first survey processed but proved difficult to transpose to other surveys. The team developed a new tool based on dynamic vocabularies of concepts which breaks down archiving into three stages. Firstly ethnographers can select and contextualise their survey materials; secondly they structure them in a database according to the research question discovered during their survey; finally, they share this data with other researchers subject to the opinion of an ethics committee whose members are competent in ethnography.
Section: Data deluge: which skills for wich data?

2. Incorporating Crowdsourced Annotator Distributions into Ensemble Modeling to Improve Classification Trustworthiness for Ancient Greek Papyri

Graham West ; Matthew I. Swindall ; Ben Keener ; Timothy Player ; Alex C. Williams ; James H. Brusuelas ; John F. Wallin.
Performing classification on noisy, crowdsourced image datasets can prove challenging even for the best neural networks. Two issues which complicate the problem on such datasets are class imbalance and ground-truth uncertainty in labeling. The AL-ALL and AL-PUB datasets - consisting of tightly cropped, individual characters from images of ancient Greek papyri - are strongly affected by both issues. The application of ensemble modeling to such datasets can help identify images where the ground-truth is questionable and quantify the trustworthiness of those samples. As such, we apply stacked generalization consisting of nearly identical ResNets with different loss functions: one utilizing sparse cross-entropy (CXE) and the other Kullback-Liebler Divergence (KLD). Both networks use labels drawn from a crowd-sourced consensus. This consensus is derived from a Normalized Distribution of Annotations (NDA) based on all annotations for a given character in the dataset. For the second network, the KLD is calculated with respect to the NDA. For our ensemble model, we apply a k-nearest neighbors model to the outputs of the CXE and KLD networks. Individually, the ResNet models have approximately 93% accuracy, while the ensemble model achieves an accuracy of > 95%, increasing the classification trustworthiness. We also perform an analysis of the Shannon entropy of the various models' output distributions to measure classification uncertainty. Our results suggest that entropy is […]
Section: Digital humanities in languages

3. Toward Automatic Typography Analysis: Serif Classification and Font Similarities

Syed Talal Wasim ; Romain Collaud ; Lara Défayes ; Nicolas Henchoz ; Mathieu Salzmann ; Delphine Ribes Lemay.
Whether a document is of historical or contemporary significance, typography plays a crucial role in its composition. From the early days of modern printing, typographic techniques have evolved and transformed, resulting in changes to the features of typography. By analyzing these features, we can gain insights into specific time periods, geographical locations, and messages conveyed through typography. Therefore, in this paper, we aim to investigate the feasibility of training a model to classify serif typeswithout knowledge of the font and character. We also investigate how to train a vectorial-based image model able to group together fonts with similar features. Specifically, we compare the use of state-of-theart image classification methods, such as the EfficientNet-B2 and the Vision Transformer Base model with different patch sizes, and the state-of-the-art fine-grained image classification method, TransFG, on the serif classification task. We also evaluate the use of the DeepSVG model to learn to group fonts with similar features. Our investigation reveals that fine-grained image classification methods are better suited for the serif classification tasks and that leveraging the character labels helps to learn more meaningful font similarities.This repository contains: - Paper published in the Journal of data mining and digital humanities:WasimEtAl_Toward_Automatic_Typography_Analysis__Serif_Classification_and_Font_Similarities.pdf - Two datasets: The first […]
Section: Project presentations

4. Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done

C. Annemieke Romein ; Tobias Hodel ; Femke Gordijn ; Joris J. van Zundert ; Alix Chagué ; Milan van Lange ; Helle Strandgaard Jensen ; Andy Stauder ; Jake Purcell ; Melissa M. Terras et al.
This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance.

5. Historical Documents and Automatic Text Recognition: Introduction

Ariane Pinche ; Peter Stokes.
With this special issue of the Journal of Data Mining and Digital Humanities (JDMDH), we bringtogether in one single volume several experiments, projects and reflections related to automatic textrecognition applied to historical documents. More and more research projects now include automatic text acquisition in their data processing chain, and this is true not only for projects focussed on Digital or Computational Humanities but increasingly also for those that are simply using existing digital tools as the means to an end. The increasing use of this technology has led to an automation of tasks that affects the role of the researcher in the textual production process. This new data-intensive practice makes it urgent to collect and harmonise the corpora necessary for the constitution of training sets, but also to make them available for exploitation. This special issue is therefore an opportunity to present articles combining philological and technical questions to make a scientific assessment of the use of automatic text recognition for ancient documents, its results, its contributions and the new practices induced by its use in the process of editing and exploring texts. We hope that practical aspects will be questioned on this occasion, while raising methodological challenges and its impact on research data.The special issue on Automatic Text Recognition (ATR) is therefore dedicated to providing a comprehensive overview of the use of ATR in the humanities field, […]

6. Ainu–Japanese Bi-directional Neural Machine Translation: A Step Towards Linguistic Preservation of Ainu, An Under-Resourced Indigenous Language in Japan

So Miyagawa.
This study presents a groundbreaking approach to preserving the Ainu language, recognized as critically endangered by UNESCO, by developing a bi-directional neural machine translation (MT) system between Ainu and Japanese. Utilizing the Marian MT framework, known for its effectiveness with resource-scarce languages, the research aims to overcome the linguistic complexities inherent in Ainu's polysynthetic structure. The paper delineates a comprehensive methodology encompassing data collection from diverse Ainu text sources, meticulous preprocessing, and the deployment of neural MT models, culminating in the achievement of significant SacreBLEU scores that underscore the models' translation accuracy. The findings illustrate the potential of advanced MT technology to facilitate linguistic preservation and educational endeavors, advocating for integrating such technologies in safeguarding endangered languages. This research not only underscores the critical role of MT in bridging language divides but also sets a precedent for employing computational linguistics to preserve cultural and linguistic heritage.
Section: Digital humanities in languages

7. Perceptions of 21st-century digital skills and agency among design sprint participants in Laurea UAS, Finland

Asko Mononen.
This explorative study investigated students’ (N=16) perceptions before and after the study unit Digital Analytics and Consumer Insights. The studies were conducted as an intensive hybrid five-day design sprint, a variant of project- and problem-based learning. An online questionnaire with a 5-point Likert scale was used for data collection. The findings indicate that the intervention improved perceptions of most studied digital “hard skills” (8/11 claims). Out of twelve 21st-century “soft skills” claims, perceptions were high initially and improved significantly for critical thinking and systematic problem-solving claims during the design sprint. The agency scores showed a slight improvement but no significant difference. Face-to-face groups would be willing to recommend the sprint method more for peers than online groups.  In the era of global turbulence and artificial intelligence, in addition to hard skills, soft skills like communication, teamwork, problem-solving and project management are in demand by employers. According to LinkedIn data in 2/2024, adaptability is the most demanded skill. In addition to traditional subjects, the pedagogical methods in higher education should better support the development of 21st-century skills.

8. Sentiment Analysis for Literary Texts: Hemingway as a Case-study

Bizzoni Yuri ; Feldkamp Pascale.

9. On searchable Mordvin corpora at the Language Bank of Finland, EMERALD

Jack Rueter.
Description of Mordvin language corpora development at the Language Bank of Finland.Description of development.
Section: V. The contribution of corpora

10. Towards efficient and reliable utilization of automated data collection: Media scrapers applied to news on climate change

Erkki Mervaala ; Jari Lyytimäki.
Abstract: Automated data collection provides tempting opportunities for social sciences and humanities studies. Abundant data accumulating in various digital archives allows more comprehensive, timely and cost-efficient ways of harvesting and processing information. While easing or even removing some of the key problems, such as laborious and time-consuming data collection and potential errors and biases related to subjective coding of materials and distortions caused by focus on small samples, automated methods also bring in new risks such as poor understanding of contexts of the data or non-recognition of underlying systematic errors or missing information. Results from testing different methods to collect data describing newspaper coverage of climate change in Finland emphasize that fully relying on automatable tools such as media scrapers has its limitations and can provide comprehensive but incomplete document acquisition for research. Many of these limitations can, however, be addressed and not all of them rely on manual control.

11. Perplexity Games: Maoism vs. Literature through the Lens of Cognitive Stylometry

Maciej Kurzynski.
The arrival of large language models (LLMs) has provoked an urgent search for stylistic markers that could differentiate machine text from human text, but while the human-like appearance of machine text has captivated public attention, the reverse phenomenon—human text becoming machine-like—has raised much less concern. This conceptual lag is surprising given the ample historical evidence of state-backed attempts to regulate human thought. The present article proposes a new comparative framework, Perplexity Games, to leverage the predictive power of LLMs and compare the statistical properties of Maospeak, a language style that emerged during the Mao Zedong’s era in China (1949-1976), with the style of canonical modern Chinese writers, such as Eileen Chang (1920-1995) and Mo Yan (1955-). The low perplexity of Maospeak, as computed across different GPT models, suggests that the impact of ideologies on language can be compared to likelihood-maximization text-generation techniques which reduce the scope of valid sequence continuations. These findings have cognitive implications: whereas engineered languages such as Maospeak hijack the predictive mechanisms of human cognition by narrowing the space of linguistic possibilities, literature resists such cognitive constraints by dispersing the probability mass over multiple, equally valid paths. Exposure to diverse language data counters the influences of ideologies on our linguistically mediated perceptions of the world and […]

12. Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

Khalid Alnajjar ; Mika Hämäläinen.
We present an encoder-decored based model for normalization of Arabic dialects using both BERT and GPT-2 based models. Arabic is a language of many dialects that not only differ from the Modern Standard Arabic (MSA) in terms of pronunciation but also in terms of morphology, grammar and lexical choice. This diversity can be troublesome even to a native Arabic speaker let alone a computer. Several NLP tools work well for MSA and in some of the main dialects but fail to cover Arabic language as a whole. Based on our manual evaluation, our model normalizes sentences entirely correctly 46\% of the time and almost correctly 26\% of the time.

13. Predicting Sustainable Development Goals Using Course Descriptions -- from LLMs to Conventional Foundation Models

Lev Kharlashkin ; Melany Macias ; Leo Huovinen ; Mika Hämäläinen.
We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDGs. The best performing model in our experiments was BART with an F1-score of 0.786.

14. Old Permic Universal Dependencies Treebank

Niko Partanen ; Jack Rueter ; Rogier Blokland.
Old Permic, also known as Old Komi, is an extinct variety of Komi that was spoken in the late Middle Ages in the lower Vychegda river basin in Northeastern European Russia, in an area that currently is not Komi-speaking. This language variety is attested in fragmentary records from the 14th to 17th century written both in the Old Permic alphabet and in Cyrillic. These records are of significant importance for research on the history of the Komi language. Here we introduce our attempt towards a new Universal Dependencies treebank that will eventually contain the existing corpus of Old Permic in a structured and CoNLL-U annotated format. This will be the first time this material is being made openly available in digital format, and our contribution describes the current state of the art and remaining challenges.

15. Applying computational approaches to energy discourse: a comparative methodological study of rule-based and large language model qualitative content analysis

Ilona Kousa.

16. Study on the Domain Adaption of Korean Speech Act using Daily Conversation Dataset and Petition Corpus

Youngsook Song ; Won Ik Cho.
In Korean, quantitative speech act studies have usually been conducted on single utterances with unspecified sources. In this study, we annotate sentences from the National Institute of Korean Language's Messenger Corpus and the National Petition Corpus, as well as example sentences from an academic paper on contemporary Korean vlogging, and check the discrepancy between human annotation and model prediction. In particular, for sentences with differences in locutionary and illocutionary forces, we analyze the causes of errors to see if stylistic features used in a particular domain affect the correct inference of speech act. Through this, we see the necessity to build and analyze a balanced corpus in various text domains, taking into account cases with different usage roles, e.g., messenger conversations belonging to private conversations and petition corpus/vlogging script that have an unspecified audience.
Section: Dataset

17. Values That Are Explicitly Present in Fairy Tales: Comparing Samples from German, Italian and Portuguese Traditions

Alba Morollon Diaz-Faes ; Carla Sofia Ribeiro Murteira ; Martin Ruskov.
Looking at how social values are represented in fairy tales can give insights about the variations in communication of values across cultures. We study how values are communicated in fairy tales from Portugal, Italy and Germany using a technique called word embedding with a compass to quantify vocabulary differences and commonalities. We study how these three national traditions differ in their explicit references to values. To do this, we specify a list of value-charged tokens, consider their word stems and analyse the distance between these in a bespoke pre-trained Word2Vec model. We triangulate and critically discuss the validity of the resulting hypotheses emerging from this quantitative model. Our claim is that this is a reusable and reproducible method for the study of the values explicitly referenced in historical corpora. Finally, our preliminary findings hint at a shared cultural understanding and the expression of values such as Benevolence, Conformity, and Universalism across the studied cultures, suggesting the potential existence of a pan-European cultural memory.

18. OCR quality and the resilience of algorithmic identification of linguistic register features in Eighteenth Century Collections Online

Aatu Liimatta.