Natallia Kokash ; Matteo Romanello ; Ernest Suyver ; Giovanni Colavizza.
The digital transformation of the scientific publishing industry has led to dramatic improvements in content discoverability and information analytics. Unfortunately, these improvements have not been uniform across research areas. The scientific literature in the arts, humanities and social sciences (AHSS) still lags behind, in part due to the scale of analog backlogs, the persisting importance of national languages, and a publisher ecosystem made of many, small or medium enterprises. We propose a bottom-up approach to support publishers in creating and maintaining their own publication knowledge graphs in the open domain. We do so by releasing a pipeline able to extract structured information from the bibliographies and indexes of AHSS publications, disambiguate, normalize and export it as linked data. We test the proposed pipeline on Brill's Classics collection, and release an implementation in open source for further use and improvement.
Simon Gabay ; Thibault Clérice ; Christian Reul.
Machine learning begins with machine teaching: in the following paper, we present the data that we have prepared to kick-start the training of reliable OCR models for 17th century prints written in French. The construction of a representative corpus is a major challenge: we need to gather documents from different decades and of different genres to cover as many sizes, weights and styles as possible. Historical prints containing glyphs and typefaces that have now disappeared, transcription is a complex act, for which we present guidelines. Finally, we provide preliminary results based on these training data and experiments to improve them.
Rubrique :
Jeu de données
Charles Swisher ; Lior Shamir.
The availability of quantitative text analysis methods has provided new ways of analyzing literature in a manner that was not available in the pre-information era. Here we apply comprehensive machine learning analysis to the work of William Shakespeare. The analysis shows clear changes in the style of writing over time, with the most significant changes in the sentence length, frequency of adjectives and adverbs, and the sentiments expressed in the text. Applying machine learning to make a stylometric prediction of the year of the play shows a Pearson correlation of 0.71 between the actual and predicted year, indicating that Shakespeare's writing style as reflected by the quantitative measurements changed over time. Additionally, it shows that the stylometrics of some of the plays is more similar to plays written either before or after the year they were written. For instance, Romeo and Juliet is dated 1596, but is more similar in stylometrics to plays written by Shakespeare after 1600. The source code for the analysis is available for free download.
Daniel Thilo Schroeder ; Mirjam de Bruijn ; Luca Bruls ; Mulatu Alemayehu Moges ; Samba Dialimpa Badji ; Noëmie Fritz ; Modibo Galy Cisse ; Johannes Langguth ; Bruce Mutsvairo ; Kristin Skare Orgeret.
With the expansion of mobile communications infrastructure, social media usage in the Global South is surging. Compared to the Global North, populations of the Global South have had less prior experience with social media from stationary computers and wired Internet. Many countries are experiencing violent conflicts that have a profound effect on their societies. As a result, social networks develop under different conditions than elsewhere, and our goal is to provide data for studying this phenomenon. In this dataset paper, we present a data collection of a national Twittersphere in a West African country of conflict. While not the largest social network in terms of users, Twitter is an important platform where people engage in public discussion. The focus is on Mali, a country beset by conflict since 2012 that has recently had a relatively precarious media ecology. The dataset consists of tweets and Twitter users in Mali and was collected in June 2022, when the Malian conflict became more violent internally both towards external and international actors. In a preliminary analysis, we assume that the conflictual context influences how people access social media and, therefore, the shape of the Twittersphere and its characteristics. The aim of this paper is to primarily invite researchers from various disciplines including complex networks and social sciences scholars to explore the data at hand further. We collected the dataset using a scraping strategy of the follower […]
Blouin Baptiste ; Cécile Armand ; Christian Henriot.
This paper introduces HistText, a pioneering tool devised to facilitate large-scale data mining in historical documents, specifically targeting Chinese sources. Developed in response to the challenges posed by the massive Modern China Textual Database, HistText emerges as a solution to efficiently extract and visualize valuable insights from billions of words spread across millions of documents. With a user-friendly interface, advanced text analysis techniques, and powerful data visualization capabilities, HistText offers a robust platform for digital humanities research. This paper explores the rationale behind HistText, underscores its key features, and provides a comprehensive guide for its effective utilization, thus highlighting its potential to substantially enhance the realm of computational humanities.
Rubrique :
Présentations de projets
Wouter Haverals ; Mike Kestemont.
This study is devoted to two of the oldest known manuscripts in which the oeuvre of the medieval mystical author Hadewijch has been preserved: Brussels, KBR, 2879-2880 (ms. A) and Brussels, KBR, 2877-2878 (ms. B). On the basis of codicological and contextual arguments, it is assumed that the scribe who produced B used A as an exemplar. While the similarities in both layout and content between the two manuscripts are striking, the present article seeks to identify the differences. After all, regardless of the intention to produce a copy that closely follows the exemplar, subtle linguistic variation is apparent. Divergences relate to spelling conventions, but also to the way in which words are abbreviated (and the extent to which abbreviations occur). The present study investigates the spelling profiles of the scribes who produced mss. A and B in a computational way. In the first part of this study, we will present both manuscripts in more detail, after which we will consider prior research carried out on scribal profiling. The current study both builds and expands on Kestemont (2015). Next, we outline the methodology used to analyse and measure the degree of scribal appropriation that took place when ms. B was copied off the exemplar ms. A. After this, we will discuss the results obtained, focusing on the scribal variation that can be found both at the level of individual words and n-grams. To this end, we use machine learning to identify the most distinctive features that […]
Emily Öhman ; Riikka Rossi.
We propose to use affect as a proxy for mood in literary texts. In this study, we explore the differences in computationally detecting tone versus detecting mood. Methodologically we utilize affective word embeddings to look at the affective distribution in different text segments. We also present a simple yet efficient and effective method of enhancing emotion lexicons to take both semantic shift and the domain of the text into account producing real-world congruent results closely matching both contemporary and modern qualitative analyses.
Maciej Janicki.
The digitization of large archival collections of oral folk poetry in Finland and Estonia has opened possibilities for large-scale quantitative studies of intertextuality. As an initial methodological step in this direction, I present a method for pairwise line-by-line comparison of poems using the weighted sequence alignment algorithm (a.k.a. ‘weighted edit distance’). The main contribution of the paper is a novel description of the algorithm in terms of matrix operations, which allows for much faster alignment of a poem against the entire corpus by utilizing modern numeric libraries and GPU capabilities. This way we are able to compute pairwise alignment scores between all pairs from among a corpus of over 280,000 poems. The resulting table of over 40 million pairwise poem similarities can be used in various ways to study the oral tradition. Some starting points for such research are sketched in the latter part of the article.
Shlomo Tannor ; Nachum Dershowitz ; Moshe Lavee.
Midrash collections are complex rabbinic works that consist of text in multiple languages, which evolved through long processes of unstable oral and written transmission. Determining the origin of a given passage in such a compilation is not always straightforward and is often a matter of dispute among scholars, yet it is essential for scholars' understanding of the passage and its relationship to other texts in the rabbinic corpus. To help solve this problem, we propose a system for classification of rabbinic literature based on its style, leveraging recent advances in natural language processing for Hebrew texts. Additionally, we demonstrate how this method can be applied to uncover lost material from a specific midrash genre, Tan\d{h}uma-Yelammedenu, that has been preserved in later anthologies.
Yuri Bizzoni ; Pascale Moreira ; Mads Rosendahl Thomsen ; Kristoffer L. Nielbo.
In the few works that have used NLP to study literary quality, sentiment and emotion analysis have often been considered valuable sources of information. At the same time, the idea that the nature and polarity of the sentiments expressed by a novel might have something to do with its perceived quality seems limited at best. In this paper, we argue that the fractality of narratives, specifically the longterm memory of their sentiment arcs, rather than their simple shape or average valence, might play an important role in the perception of literary quality by a human audience. In particular, we argue that such measure can help distinguish Nobel-winning writers from control groups in a recent corpus of English language novels. To test this hypothesis, we present the results from two studies: (i) a probability distribution test, where we compute the probability of seeing a title from a Nobel laureate at different levels of arc fractality; (ii) a classification test, where we use several machine learning algorithms to measure the predictive power of both sentiment arcs and their fractality measure. Lastly, we perform another experiment to examine whether arc fractality may be used to distinguish more or less popular works within the Nobel canon itself, looking at the probability of higher GoodReads’ ratings at different levels of arc fractality. Our findings seem to indicate that despite the competitive and complex nature of the task, the populations of Nobel and non-Nobel […]
Christophe Tuffery.
This article presents the results of an archaeological archive research. Field recording documents from the Rivaux site in France, which was excavated from the 1970s to the 1990s, were exploited. After digitising a set of field notebook pages, the author developed an application, called Archeotext, which allows transcribing and georeferencing these documents. Some of the results obtained show new ways of exploiting this type of archive by using certain methods and techniques of the digital humanities.
Rubrique :
Sciences de l'Antiquité et humanités numériques
Takumi Nishi.
The study involved the analysis of emotion-associated language in the UK Conservative and Labour party general election manifestos between 2000 to 2019. While previous research have shown a general correlation between ideological positioning and overlap of public policies, there are still conflicting results in matters of sentiments in such manifestos. Using new data, we present how valence level can be swayed by party status within government with incumbent parties presenting a higher frequency in positive emotion-associated words while negative emotion-associated words are more prevalent in opposition parties. We also demonstrate that parties with ideological similitude use positive language prominently further adding to the literature on the relationship between sentiments and party status.
Simon Gonzalez.
Phraseology studies have been enhanced by Corpus Linguistics, which has become an interdisciplinary field where current technologies play an important role in its development. Computational tools have been implemented in the last decades with positive results on the identification of phrases in different languages. One specific technology that has impacted these studies is social media. As researchers, we have turned our attention to collecting data from these platforms, which comes with great advantages and its own challenges. One of the challenges is the way we design and build corpora relevant to the questions emerging in this type of language expression. This has been approached from different angles, but one that has given invaluable outputs is the building of linguistic corpora with the use of online web applications. In this paper, we take a multidimensional approach to the collection, design, and deployment of a phraseology corpus for Latin American Spanish from Twitter data, extracting features using NLP techniques, and presenting it in an interactive online web application. We expect to contribute to the methodologies used for Corpus Linguistics in the current technological age. Finally, we make this tool publicly available to be used by any researcher interested in the data itself and also on the technological tools developed here.
Rubrique :
Humanités numériques en langues
Miu Nicole Takagi.
In developed nations assassinations are rare and thus the impact of such acts on the electoral and political landscape is understudied. In this paper, we focus on Twitter data to examine the effects of Japan's former Primer Minister Abe's assassination on the Japanese House of Councillors elections in 2022. We utilize sentiment analysis and emotion detection together with topic modeling on over 2 million tweets and compare them against tweets during previous election cycles. Our findings indicate that Twitter sentiments were negatively impacted by the event in the short term and that social media attention span has shortened. We also discuss how "necropolitics" affected the outcome of the elections in favor of the deceased's party meaning that there seems to have been an effect of Abe's death on the election outcome though the findings warrant further investigation for conclusive results.
Patrick Andrist ; Tobias Englmeier ; Saskia Dirkse.
In an era when more and more manuscript scholarship is taking place on the internet, through digital manuscripts, databases and electronic publishing, there is a corresponding and growing need among scholars for flexible and creative ways to create clear and coherent online manuscript descriptions. This article presents a prototype of a tool, currently under development at the Ludwig-Maximilian Universität, Munich, which aims to streamline the description process from the perspective of facilitating comparisons between manuscripts from different cultural areas. Presently, the tool is being developed for use with biblical manuscripts and the article outlines the challenges that come with creating and comparing this kind of material, which is often similar in terms of content but very diverse as to its structure. It then offers solutions that allow users to describe and compare the manuscripts at differing levels of granularity. Lastly, the article sets forth a pattern of use that could be transposed onto areas of manuscript studies beyond biblical manuscripts and offers perspectives for the tool’s wider application.
Matthias Gille Levenson.
Submitted to the Journal of Data Mining and Digital Humanities, and accepted. Pending last revisions. Please cite: @article{gille_levenson_2023_towards, author = {Gille Levenson, Matthias}, date = {2023}, journaltitle = {Journal of Data Mining and Digital Humanities}, doi = {10.5281/zenodo.7387376}, editor = {Pinche, Ariane and Stokes, Peter}, issuetitle = {Special Issue: Historical documents and automatic text recognition}, title = {Towards a general open dataset and models for late medieval Castilian text recognition (HTR/OCR)}, note = {Accepted, to be published.} } GILLE LEVENSON , Matthias, « Towards a general open dataset and models for late medieval Castilian text recognition (HTR/OCR) », Journal of Data Mining and Digital Humanities (2023) : Special Issue : Historical documents and automatic text recognition, eds. Ariane PINCHE and Peter STOKES, DOI : 10.5281/zenodo.7387376. Link to the data: https://doi.org/10.5281/zenodo.7386489
Ariane Pinche.
In the Humanities, the emergence of digital methods has opened up research questions to quantitative analysis. This is why HTR technology is increasingly involved in humanities research projects following precursors such as the Himanis project. However, many research teams have limited resources, either financially or in terms of their expertise in artificial intelligence. It may therefore be difficult to integrate handwritten text recognition into their project pipeline if they need to train a model or to create data from scratch. The goal here is not to explain how to build or improve a new HTR engine, nor to find a way to automatically align a preexisting corpus with an image to quickly create ground truths for training. This paper aims to help humanists easily develop an HTR model for medieval manuscripts, create and gather training data by knowing the issues underlying their choices. The objective is also to show the importance of the constitution of consistent data as a prerequisite to allow their gathering and to train efficient HTR models. We will present an overview of our work and experiment in the CREMMALab project (2021-2022), showing first how we ensure the consistency of the data and then how we have developed a generic model for medieval French manuscripts from the 13 th to the 15 th century, ready to be shared (more than 94% accuracy) and/or fine-tuned by other projects.
Denisa Butnaru.
The aim of the present contribution is to analyze how relations of loyalty emerge between researcher and researched during ethnographic fieldwork and to defend a perspective against the principle of open science. I discuss methodological issues with respect to my several years of multi-sited fieldwork experience in various labs, research centers and medical institutions, during which I inquired into the design and use of exoskeletal devices. Exoskeletal devices are technologies applied to three fields of application: rehabilitation, industry and the armed forces. Their invention is the subject of high levels of economic and scientific competition. Given these constraints, I was compelled to develop "loyalty strategies", one of which I call the "contract of silence". I associate this category with an ethnographic exercise in how to address one's interlocutors during fieldwork. I conceive of this process as a result of consciously retaining the information obtained from interviewees that might endanger the position of the researcher in the field. Although a tacit contract with one's interlocutors during ethnographic fieldwork implies anonymity, certain sensitive fields and research situations require forms of auto-censorship and the control of published results. I associate these strategies with the fabrication of fieldwork secrecy.
Rubrique :
Déluge de données : quelles compétences pour quelles données ?
Beatrice Couture ; Farah Verret ; Maxime Gohier ; Dominique Deslandres.
The arrival of handwriting recognition technologies offers new possibilities for research in heritage studies. However, it is now necessary to reflect on the experiences and the practices developed by research teams. Our use of the Transkribus platform since 2018 has led us to search for the most significant ways to improve the performance of our handwritten text recognition (HTR) models which are made to transcribe French handwriting dating from the 17th century. This article therefore reports on the impacts of creating transcribing protocols, using the language model at full scale and determining the best way to use base models in order to help increase the performance of HTR models. Combining all of these elements can indeed increase the performance of a single model by more than 20% (reaching a Character Error Rate below 5%). This article also discusses some challenges regarding the collaborative nature of HTR platforms such as Transkribus and the way researchers can share their data generated in the process of creating or training handwritten text recognition models.
Elpida Perdiki.
HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a synchronous level. Despite the state-of-the-art algorithms and software, historical documents (especially those written in Greek) remain a real-world challenge for researchers. A large number of unedited or under-edited works of Greek Literature (ancient or Byzantine, especially the latter) exist to this day due to the complexity of producing critical editions. To critically edit a literary text, scholars need to pinpoint text variations on several manuscripts, which requires fully (or at least partially) transcribed manuscripts. For a large manuscript tradition (i.e., a large number of manuscripts transmitting the same work), such a process can be a painstaking and time-consuming project. To that end, HTR algorithms that train AI models can significantly assist, even when not resulting in entirely accurate transcriptions. Deep learning models, though, require a quantum of data to be effective. This, in turn, intensifies the same problem: big (transcribed) data require heavy loads of manual transcriptions as training sets. In the absence of such transcriptions, this study experiments with training sets of various sizes to determine the minimum amount of manual transcription needed to produce usable results. HTR models are trained through the Transkribus platform on manuscripts from multiple works of a single Byzantine author, […]
Rubrique :
Sciences de l'Antiquité et humanités numériques
Sergio Torres Aguilar ; Vincent Jolivet.
Handwritten Text Recognition (HTR) techniques aim to accurately recognize sequences of characters in input manuscript images by training artificial intelligence models to capture historical writing features. Efficient HTR models can transform digitized manuscript collections into indexed and quotable corpora, providing valuable research insight for various historical inquiries. However, several challenges must be addressed, including the scarcity of relevant training corpora, the consequential variability introduced by different scribal hands and writing scripts, and the complexity of page layouts. This paper presents two models and one cross-model approach for automatic transcription of Latin and French medieval documentary manuscripts, particularly charters and registers, written between the 12th and 15th centuries and classified into two major writing scripts: Textualis (from the late-11th to 13th century) and Cursiva (from the 13th to the 15th century). The architecture of the models is based on a Convolutional Recurrent Neural Network (CRNN) coupled with a Connectionist Temporal Classification (CTC) loss. The training and evaluation of the models, involving 120k lines of text and almost 1M tokens, were conducted using three available ground-truth corpora : The e-NDP corpus, the Alcar-HOME database and the Himanis project. This paper describes the training architecture and corpora used, while discussing the main training challenges, results, and potential applications of […]
Thibault Clérice.
Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1.