2024

1. ArchEthno - a new tool for sharing research materials and a new method for archiving your own research

Florence Weber ; Carlo Zwölf ; Arnaud Trouche ; Agnès Tricoche ; José Sastre.

The archiving of ethnographic material is generally considered a blind spot in ethnographic working methods which place more importance on actual investigations and analysis than on how archives are constructed. A team of computer scientists and ethnographers has built an initial tool for sharing ethnographic materials, based on an SQL relational data model that suited the first survey processed but proved difficult to transpose to other surveys. The team developed a new tool based on dynamic vocabularies of concepts which breaks down archiving into three stages. Firstly ethnographers can select and contextualise their survey materials; secondly they structure them in a database according to the research question discovered during their survey; finally, they share this data with other researchers subject to the opinion of an ethics committee whose members are competent in ethnography.

Rubrique : Déluge de données : quelles compétences pour quelles données ?

2. Incorporating Crowdsourced Annotator Distributions into Ensemble Modeling to Improve Classification Trustworthiness for Ancient Greek Papyri

Graham West ; Matthew I. Swindall ; Ben Keener ; Timothy Player ; Alex C. Williams ; James H. Brusuelas ; John F. Wallin.

Performing classification on noisy, crowdsourced image datasets can prove challenging even for the best neural networks. Two issues which complicate the problem on such datasets are class imbalance and ground-truth uncertainty in labeling. The AL-ALL and AL-PUB datasets - consisting of tightly cropped, individual characters from images of ancient Greek papyri - are strongly affected by both issues. The application of ensemble modeling to such datasets can help identify images where the ground-truth is questionable and quantify the trustworthiness of those samples. As such, we apply stacked generalization consisting of nearly identical ResNets with different loss functions: one utilizing sparse cross-entropy (CXE) and the other Kullback-Liebler Divergence (KLD). Both networks use labels drawn from a crowd-sourced consensus. This consensus is derived from a Normalized Distribution of Annotations (NDA) based on all annotations for a given character in the dataset. For the second network, the KLD is calculated with respect to the NDA. For our ensemble model, we apply a k-nearest neighbors model to the outputs of the CXE and KLD networks. Individually, the ResNet models have approximately 93% accuracy, while the ensemble model achieves an accuracy of > 95%, increasing the classification trustworthiness. We also perform an analysis of the Shannon entropy of the various models' output distributions to measure classification uncertainty. Our results suggest that entropy is […]

Rubrique : Humanités numériques en langues

3. Toward Automatic Typography Analysis: Serif Classification and Font Similarities

Syed Talal Wasim ; Romain Collaud ; Lara Défayes ; Nicolas Henchoz ; Mathieu Salzmann ; Delphine Ribes Lemay.

Whether a document is of historical or contemporary significance, typography plays a crucial role in its composition. From the early days of modern printing, typographic techniques have evolved and transformed, resulting in changes to the features of typography. By analyzing these features, we can gain insights into specific time periods, geographical locations, and messages conveyed through typography. Therefore, in this paper, we aim to investigate the feasibility of training a model to classify serif typeswithout knowledge of the font and character. We also investigate how to train a vectorial-based image model able to group together fonts with similar features. Specifically, we compare the use of state-of-theart image classification methods, such as the EfficientNet-B2 and the Vision Transformer Base model with different patch sizes, and the state-of-the-art fine-grained image classification method, TransFG, on the serif classification task. We also evaluate the use of the DeepSVG model to learn to group fonts with similar features. Our investigation reveals that fine-grained image classification methods are better suited for the serif classification tasks and that leveraging the character labels helps to learn more meaningful font similarities.This repository contains: - Paper published in the Journal of data mining and digital humanities:WasimEtAl_Toward_Automatic_Typography_Analysis__Serif_Classification_and_Font_Similarities.pdf - Two datasets: The first […]

Rubrique : Présentations de projets

4. Historical Documents and Automatic Text Recognition: Introduction

Ariane Pinche ; Peter Stokes.

With this special issue of the Journal of Data Mining and Digital Humanities (JDMDH), we bringtogether in one single volume several experiments, projects and reflections related to automatic textrecognition applied to historical documents.More and more research projects1 now include automatic text acquisition in their data processing chain,and this is true not only for projects focussed on Digital or Computational Humanities but increasinglyalso for those that are simply using existing digital tools as the means to an end. The increasing useof this technology has led to an automation of tasks that affects the role of the researcher in the textualproduction process. This new data-intensive practice makes it urgent to collect and harmonise the corporanecessary for the constitution of training sets, but also to make them available for exploitation. Thisspecial issue is therefore an opportunity to present articles combining philological and technical questionsto make a scientific assessment of the use of automatic text recognition for ancient documents, itsresults, its contributions and the new practices induced by its use in the process of editing and exploringtexts. We hope that practical aspects will be questioned on this occasion, while raising methodologicalchallenges and its impact on research data.The special issue on Automatic Text Recognition (ATR) is therefore dedicated to providing a comprehensiveoverview of the use of ATR in the humanities field, particularly […]

5. Ainu–Japanese Bi-directional Neural Machine Translation: A Step Towards Linguistic Preservation of Ainu, An Under-Resourced Indigenous Language in Japan

So Miyagawa.

This study presents a groundbreaking approach to preserving the Ainu language, recognized as critically endangered by UNESCO, by developing a bi-directional neural machine translation (MT) system between Ainu and Japanese. Utilizing the Marian MT framework, known for its effectiveness with resource-scarce languages, the research aims to overcome the linguistic complexities inherent in Ainu's polysynthetic structure. The paper delineates a comprehensive methodology encompassing data collection from diverse Ainu text sources, meticulous preprocessing, and the deployment of neural MT models, culminating in the achievement of significant SacreBLEU scores that underscore the models' translation accuracy. The findings illustrate the potential of advanced MT technology to facilitate linguistic preservation and educational endeavors, advocating for integrating such technologies in safeguarding endangered languages. This research not only underscores the critical role of MT in bridging language divides but also sets a precedent for employing computational linguistics to preserve cultural and linguistic heritage.

Rubrique : Humanités numériques en langues

6. Perceptions of 21st-century digital skills and agency among design sprint participants in Laurea UAS, Finland

Asko Mononen.

This explorative study investigated students’ (N=16) perceptions before and after the study unit Digital Analytics and Consumer Insights. The studies were conducted as an intensive hybrid five-day design sprint, a variant of project- and problem-based learning. An online questionnaire with a 5-point Likert scale was used for data collection. The findings indicate that the intervention improved perceptions of most studied digital “hard skills” (8/11 claims). Out of twelve 21st-century “soft skills” claims, perceptions were high initially and improved significantly for critical thinking and systematic problem-solving claims during the design sprint. The agency scores showed a slight improvement but no significant difference. Face-to-face groups would be willing to recommend the sprint method more for peers than online groups. In the era of global turbulence and artificial intelligence, in addition to hard skills, soft skills like communication, teamwork, problem-solving and project management are in demand by employers. According to LinkedIn data in 2/2024, adaptability is the most demanded skill. In addition to traditional subjects, the pedagogical methods in higher education should better support the development of 21st-century skills.

7. Sentiment Analysis for Literary Texts: Hemingway as a Case-study

Bizzoni Yuri ; Feldkamp Pascale.

8. On searchable Mordvin corpora at the Language Bank of Finland, EMERALD

Jack Rueter.

Description of Mordvin language corpora development at the Language Bank of Finland.Description of development.

Rubrique : V. L'apport des corpus

9. Towards efficient and reliable utilization of automated data collection: Media scrapers applied to news on climate change

Erkki Mervaala ; Jari Lyytimäki.

Abstract: Automated data collection provides tempting opportunities for social sciences and humanities studies. Abundant data accumulating in various digital archives allows more comprehensive, timely and cost-efficient ways of harvesting and processing information. While easing or even removing some of the key problems, such as laborious and time-consuming data collection and potential errors and biases related to subjective coding of materials and distortions caused by focus on small samples, automated methods also bring in new risks such as poor understanding of contexts of the data or non-recognition of underlying systematic errors or missing information. Results from testing different methods to collect data describing newspaper coverage of climate change in Finland emphasize that fully relying on automatable tools such as media scrapers has its limitations and can provide comprehensive but incomplete document acquisition for research. Many of these limitations can, however, be addressed and not all of them rely on manual control.

10. Perplexity Games: Maoism vs. Literature through the Lens of Cognitive Stylometry

Maciej Kurzynski.

The arrival of large language models (LLMs) has provoked an urgent search for stylistic markers that could differentiate machine text from human text, but while the human-like appearance of machine text has captivated public attention, the reverse phenomenon—human text becoming machine-like—has raised much less concern. This conceptual lag is surprising given the ample historical evidence of state-backed attempts to regulate human thought. The present article proposes a new comparative framework, Perplexity Games, to leverage the predictive power of LLMs and compare the statistical properties of Maospeak, a language style that emerged during the Mao Zedong’s era in China (1949-1976), with the style of canonical modern Chinese writers, such as Eileen Chang (1920-1995) and Mo Yan (1955-). The low perplexity of Maospeak, as computed across different GPT models, suggests that the impact of ideologies on language can be compared to likelihood-maximization text-generation techniques which reduce the scope of valid sequence continuations. These findings have cognitive implications: whereas engineered languages such as Maospeak hijack the predictive mechanisms of human cognition by narrowing the space of linguistic possibilities, literature resists such cognitive constraints by dispersing the probability mass over multiple, equally valid paths. Exposure to diverse language data counters the influences of ideologies on our linguistically mediated perceptions of the world and […]

11. Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

Khalid Alnajjar ; Mika Hämäläinen.

We present an encoder-decored based model for normalization of Arabic dialects using both BERT and GPT-2 based models. Arabic is a language of many dialects that not only differ from the Modern Standard Arabic (MSA) in terms of pronunciation but also in terms of morphology, grammar and lexical choice. This diversity can be troublesome even to a native Arabic speaker let alone a computer. Several NLP tools work well for MSA and in some of the main dialects but fail to cover Arabic language as a whole. Based on our manual evaluation, our model normalizes sentences entirely correctly 46\% of the time and almost correctly 26\% of the time.

12. Predicting Sustainable Development Goals Using Course Descriptions -- from LLMs to Conventional Foundation Models

Lev Kharlashkin ; Melany Macias ; Leo Huovinen ; Mika Hämäläinen.

We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDGs. The best performing model in our experiments was BART with an F1-score of 0.786.