Natallia Kokash ; Matteo Romanello ; Ernest Suyver ; Giovanni Colavizza.
The digital transformation of the scientific publishing industry has led to
dramatic improvements in content discoverability and information analytics.
Unfortunately, these improvements have not been uniform across research areas.
The scientific literature in the arts, humanities and social sciences (AHSS)
still lags behind, in part due to the scale of analog backlogs, the persisting
importance of national languages, and a publisher ecosystem made of many, small
or medium enterprises. We propose a bottom-up approach to support publishers in
creating and maintaining their own publication knowledge graphs in the open
domain. We do so by releasing a pipeline able to extract structured information
from the bibliographies and indexes of AHSS publications, disambiguate,
normalize and export it as linked data. We test the proposed pipeline on
Brill's Classics collection, and release an implementation in open source for
further use and improvement.
Simon Gabay ; Thibault Clérice ; Christian Reul.
Machine learning begins with machine teaching: in the following paper, we present the data that we have prepared to kick-start the training of reliable OCR models for 17th century prints written in French. The construction of a representative corpus is a major challenge: we need to gather documents from different decades and of different genres to cover as many sizes, weights and styles as possible. Historical prints containing glyphs and typefaces that have now disappeared, transcription is a complex act, for which we present guidelines. Finally, we provide preliminary results based on these training data and experiments to improve them.
Section:
Dataset
Charles Swisher ; Lior Shamir.
The availability of quantitative text analysis methods has provided new ways
of analyzing literature in a manner that was not available in the
pre-information era. Here we apply comprehensive machine learning analysis to
the work of William Shakespeare. The analysis shows clear changes in the style
of writing over time, with the most significant changes in the sentence length,
frequency of adjectives and adverbs, and the sentiments expressed in the text.
Applying machine learning to make a stylometric prediction of the year of the
play shows a Pearson correlation of 0.71 between the actual and predicted year,
indicating that Shakespeare's writing style as reflected by the quantitative
measurements changed over time. Additionally, it shows that the stylometrics of
some of the plays is more similar to plays written either before or after the
year they were written. For instance, Romeo and Juliet is dated 1596, but is
more similar in stylometrics to plays written by Shakespeare after 1600. The
source code for the analysis is available for free download.
Wouter Haverals ; Mike Kestemont.
This study is devoted to two of the oldest known manuscripts in which the
oeuvre of the medieval mystical author Hadewijch has been preserved: Brussels,
KBR, 2879-2880 (ms. A) and Brussels, KBR, 2877-2878 (ms. B). On the basis of
codicological and contextual arguments, it is assumed that the scribe who
produced B used A as an exemplar. While the similarities in both layout and
content between the two manuscripts are striking, the present article seeks to
identify the differences. After all, regardless of the intention to produce a
copy that closely follows the exemplar, subtle linguistic variation is
apparent. Divergences relate to spelling conventions, but also to the way in
which words are abbreviated (and the extent to which abbreviations occur). The
present study investigates the spelling profiles of the scribes who produced
mss. A and B in a computational way. In the first part of this study, we will
present both manuscripts in more detail, after which we will consider prior
research carried out on scribal profiling. The current study both builds and
expands on Kestemont (2015). Next, we outline the methodology used to analyse
and measure the degree of scribal appropriation that took place when ms. B was
copied off the exemplar ms. A. After this, we will discuss the results
obtained, focusing on the scribal variation that can be found both at the level
of individual words and n-grams. To this end, we use machine learning to
identify the most distinctive features that […]
Emily Öhman ; Riikka Rossi.
We propose to use affect as a proxy for mood in literary texts. In this
study, we explore the differences in computationally detecting tone versus
detecting mood. Methodologically we utilize affective word embeddings to look
at the affective distribution in different text segments. We also present a
simple yet efficient and effective method of enhancing emotion lexicons to take
both semantic shift and the domain of the text into account producing
real-world congruent results closely matching both contemporary and modern
qualitative analyses.
Maciej Janicki.
The digitization of large archival collections of oral folk poetry in Finland and Estonia has opened possibilities for large-scale quantitative studies of intertextuality. As an initial methodological step in this direction, I present a method for pairwise line-by-line comparison of poems using the weighted sequence alignment algorithm (a.k.a. ‘weighted edit distance’). The main contribution of the paper is a novel description of the algorithm in terms of matrix operations, which allows for much faster alignment of a poem against the entire corpus by utilizing modern numeric libraries and GPU capabilities. This way we are able to compute pairwise alignment scores between all pairs from among a corpus of over 280,000 poems. The resulting table of over 40 million pairwise poem similarities can be used in various
ways to study the oral tradition. Some starting points for such research are sketched in the latter part of the article.
Shlomo Tannor ; Nachum Dershowitz ; Moshe Lavee.
Midrash collections are complex rabbinic works that consist of text in
multiple languages, which evolved through long processes of unstable oral and
written transmission. Determining the origin of a given passage in such a
compilation is not always straightforward and is often a matter of dispute
among scholars, yet it is essential for scholars' understanding of the passage
and its relationship to other texts in the rabbinic corpus. To help solve this
problem, we propose a system for classification of rabbinic literature based on
its style, leveraging recent advances in natural language processing for Hebrew
texts. Additionally, we demonstrate how this method can be applied to uncover
lost material from a specific midrash genre, Tan\d{h}uma-Yelammedenu, that has
been preserved in later anthologies.
Yuri Bizzoni ; Pascale Moreira ; Mads Rosendahl Thomsen ; Kristoffer L. Nielbo.
In the few works that have used NLP to study literary quality, sentiment and emotion analysis have often been considered valuable sources of information. At the same time, the idea that the nature and polarity of the sentiments expressed by a novel might have something to do with its perceived quality seems limited at best. In this paper, we argue that the fractality of narratives, specifically the longterm memory of their sentiment arcs, rather than their simple shape or average valence, might play an important role in the perception of literary quality by a human audience. In particular, we argue that such measure can help distinguish Nobel-winning writers from control groups in a recent corpus of English language novels. To test this hypothesis, we present the results from two studies: (i) a probability distribution test, where we compute the probability of seeing a title from a Nobel laureate at different levels of arc fractality; (ii) a classification test, where we use several machine learning algorithms to measure the predictive power of both sentiment arcs and their fractality measure. Lastly, we perform another experiment to examine whether arc fractality may be used to distinguish more or less popular works within the Nobel canon itself, looking at the probability of higher GoodReads’ ratings at different levels of arc fractality. Our findings seem to indicate that despite the competitive and complex nature of the task, the populations of Nobel and non-Nobel […]
Christophe Tuffery.
This article presents the results of an archaeological archive research. Field recording documents from the Rivaux site in France, which was excavated from the 1970s to the 1990s, were exploited. After digitising a set of field notebook pages, the author developed an application, called Archeotext, which allows transcribing and georeferencing these documents. Some of the results obtained show new ways of exploiting this type of archive by using certain methods and techniques of the digital humanities.
Section:
Sciences of Antiquity and digital humanities
Takumi Nishi.
The study involved the analysis of emotion-associated language in the UK
Conservative and Labour party general election manifestos between 2000 to 2019.
While previous research have shown a general correlation between ideological
positioning and overlap of public policies, there are still conflicting results
in matters of sentiments in such manifestos. Using new data, we present how
valence level can be swayed by party status within government with incumbent
parties presenting a higher frequency in positive emotion-associated words
while negative emotion-associated words are more prevalent in opposition
parties. We also demonstrate that parties with ideological similitude use
positive language prominently further adding to the literature on the
relationship between sentiments and party status.
Simon Gonzalez.
Phraseology studies have been enhanced by Corpus Linguistics, which has become an interdisciplinary field where current technologies play an important role in its development. Computational tools have been implemented in the last decades with positive results on the identification of phrases in different languages. One specific technology that has impacted these studies is social media. As researchers, we have turned our attention to collecting data from these platforms, which comes with great advantages and its own challenges. One of the challenges is the way we design and build corpora relevant to the questions emerging in this type of language expression. This has been approached from different angles, but one that has given invaluable outputs is the building of linguistic corpora with the use of online web applications. In this paper, we take a multidimensional approach to the collection, design, and deployment of a phraseology corpus for Latin American Spanish from Twitter data, extracting features using NLP techniques, and presenting it in an interactive online web application. We expect to contribute to the methodologies used for Corpus Linguistics in the current technological age. Finally, we make this tool publicly available to be used by any researcher interested in the data itself and also on the technological tools developed here.
Section:
Digital humanities in languages
Miu Nicole Takagi.
In developed nations assassinations are rare and thus the impact of such acts
on the electoral and political landscape is understudied. In this paper, we
focus on Twitter data to examine the effects of Japan's former Primer Minister
Abe's assassination on the Japanese House of Councillors elections in 2022. We
utilize sentiment analysis and emotion detection together with topic modeling
on over 2 million tweets and compare them against tweets during previous
election cycles. Our findings indicate that Twitter sentiments were negatively
impacted by the event in the short term and that social media attention span
has shortened. We also discuss how "necropolitics" affected the outcome of the
elections in favor of the deceased's party meaning that there seems to have
been an effect of Abe's death on the election outcome though the findings
warrant further investigation for conclusive results.