Nicolas Gutehrlé ; Iana Atanassova.
Background. In recent years, libraries and archives led important
digitisation campaigns that opened the access to vast collections of historical
documents. While such documents are often available as XML ALTO documents, they
lack information about their logical structure. In this paper, we address the
problem of Logical Layout Analysis applied to historical documents in French.
We propose a rule-based method, that we evaluate and compare with two
Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set
contains French newspapers, periodicals and magazines, published in the first
half of the twentieth century in the Franche-Comté Region. Results. Our
rule-based system outperforms the two other models in nearly all evaluations.
It has especially better Recall results, indicating that our system covers more
types of every logical label than the other two models. When comparing RIPPER
with Gradient Boosting, we can observe that Gradient Boosting has better
Precision scores but RIPPER has better Recall scores. Conclusions. The
evaluation shows that our system outperforms the two Machine Learning models,
and provides significantly higher Recall. It also confirms that our system can
be used to produce annotated data sets that are large enough to envisage
Machine Learning or Deep Learning approaches for the task of Logical Layout
Analysis. Combining rules and Machine Learning models into hybrid systems could
potentially provide even better performances. […]
Section:
Digital humanities in languages
Gechuan Zhang ; Paul Nulty ; David Lillis.
The contextual word embedding model, BERT, has proved its ability on downstream tasks with limited quantities of annotated data. BERT and its variants help to reduce the burden of complex annotation work in many interdisciplinary research areas, for example, legal argument mining in digital humanities. Argument mining aims to develop text analysis tools that can automatically retrieve arguments and identify relationships between argumentation clauses. Since argumentation is one of the key aspects of case law, argument mining tools for legal texts are applicable to both academic and non-academic legal research. Domain-specific BERT variants (pre-trained with corpora from a particular background) have also achieved strong performance in many tasks. To our knowledge, previous machine learning studies of argument mining on judicial case law still heavily rely on statistical models. In this paper, we provide a broad study of both classic and contextual embedding models and their performance on practical case law from the European Court of Human Rights (ECHR). During our study, we also explore a number of neural networks when being combined with different embeddings. Our experiments provide a comprehensive overview of a variety of approaches to the legal argument mining task. We conclude that domain pre-trained transformer models have great potential in this area, although traditional embeddings can also achieve strong performance when combined with additional neural network […]
Enrique Manjavacas ; Lauren Fonteyn.
As large language models such as BERT are becoming increasingly popular in Digital Humanities (DH), the question has arisen as to how such models can be made suitable for application to specific textual domains, including that of 'historical text'. Large language models like BERT can be pretrained from scratch on a specific textual domain and achieve strong performance on a series of downstream tasks. However, this is a costly endeavour, both in terms of the computational resources as well as the substantial amounts of training data it requires. An appealing alternative, then, is to employ existing 'general purpose' models (pre-trained on present-day language) and subsequently adapt them to a specific domain by further pre-training. Focusing on the domain of historical text in English, this paper demonstrates that pre-training on domain-specific (i.e. historical) data from scratch yields a generally stronger background model than adapting a present-day language model. We show this on the basis of a variety of downstream tasks, ranging from established tasks such as Part-of-Speech tagging, Named Entity Recognition and Word Sense Disambiguation, to ad-hoc tasks like Sentence Periodization, which are specifically designed to test historically relevant processing.
Section:
Digital humanities in languages
Yuri Bizzoni ; Telma Peura ; Mads Thomsen ; Kristoffer Nielbo.
This article explores the sentiment dynamics present in narratives and their contribution to literary appreciation. Specifically, we investigate whether a certain type of sentiment development in a literary narrative correlates with its quality as perceived by a large number of readers. While we do not expect a story's sentiment arc to relate directly to readers' appreciation, we focus on its internal coherence as measured by its sentiment arc's level of fractality as a potential predictor of literary quality. To measure the arcs' fractality we use the Hurst exponent, a popular measure of fractal patterns that reflects the predictability or self-similarity of a time series. We apply this measure to the fairy tales of H.C. Andersen, using GoodReads' scores to approximate their level of appreciation. Based on our results we suggest that there might be an optimal balance between predictability and surprise in a sentiment arcs' structure that contributes to the perceived quality of a narrative text.
Moshe Stekel ; Amos Azaria ; Shai Gordin.
This paper presents ACCWSI (Attentive Context Clustering WSI), a method for Word Sense Induction, suitable for languages with limited resources. Pretrained on a small corpus and given an ambiguous word (a query word) and a set of excerpts that contain it, ACCWSI uses an attention mechanism for generating context-aware embeddings, distinguishing between the different senses assigned to the query word. These embeddings are then clustered to provide groups of main common uses of the query word. We show that ACCWSI performs well on the SemEval-2 2010 WSI task. ACCWSI also demonstrates practical applicability for shedding light on the meanings of ambiguous words in ancient languages, such as Classical Hebrew and Akkadian. In the near future, we intend to turn ACCWSI into a practical tool for linguists and historians.
Elissa Nakajima Wickham ; Emily Öhman.
This paper examines the shift in focus on content policies and user attitudes on the social media platform Reddit. We do this by focusing on comments from general Reddit users from five posts made by admins (moderators) on updates to Reddit Content Policy. All five concern the nature of what kind of content is allowed to be posted on Reddit, and which measures will be taken against content that violates these policies. We use topic modeling to probe how the general discourse for Redditors has changed around limitations on content, and later, limitations on hate speech, or speech that incites violence against a particular group. We show that there is a clear shift in both the contents and the user attitudes that can be linked to contemporary societal upheaval as well as newly passed laws and regulations, and contribute to the wider discussion on hate speech moderation.
Emily Öhman ; Riikka Rossi.
We propose to use affect as a proxy for mood in literary texts. In this
study, we explore the differences in computationally detecting tone versus
detecting mood. Methodologically we utilize affective word embeddings to look
at the affective distribution in different text segments. We also present a
simple yet efficient and effective method of enhancing emotion lexicons to take
both semantic shift and the domain of the text into account producing
real-world congruent results closely matching both contemporary and modern
qualitative analyses.
Takumi Nishi.
The study involved the analysis of emotion-associated language in the UK
Conservative and Labour party general election manifestos between 2000 to 2019.
While previous research have shown a general correlation between ideological
positioning and overlap of public policies, there are still conflicting results
in matters of sentiments in such manifestos. Using new data, we present how
valence level can be swayed by party status within government with incumbent
parties presenting a higher frequency in positive emotion-associated words
while negative emotion-associated words are more prevalent in opposition
parties. We also demonstrate that parties with ideological similitude use
positive language prominently further adding to the literature on the
relationship between sentiments and party status.
Miu Nicole Takagi.
In developed nations assassinations are rare and thus the impact of such acts
on the electoral and political landscape is understudied. In this paper, we
focus on Twitter data to examine the effects of Japan's former Primer Minister
Abe's assassination on the Japanese House of Councillors elections in 2022. We
utilize sentiment analysis and emotion detection together with topic modeling
on over 2 million tweets and compare them against tweets during previous
election cycles. Our findings indicate that Twitter sentiments were negatively
impacted by the event in the short term and that social media attention span
has shortened. We also discuss how "necropolitics" affected the outcome of the
elections in favor of the deceased's party meaning that there seems to have
been an effect of Abe's death on the election outcome though the findings
warrant further investigation for conclusive results.
Maciej Janicki.
The digitization of large archival collections of oral folk poetry in Finland and Estonia has opened possibilities for large-scale quantitative studies of intertextuality. As an initial methodological step in this direction, I present a method for pairwise line-by-line comparison of poems using the weighted sequence alignment algorithm (a.k.a. ‘weighted edit distance’). The main contribution of the paper is a novel description of the algorithm in terms of matrix operations, which allows for much faster alignment of a poem against the entire corpus by utilizing modern numeric libraries and GPU capabilities. This way we are able to compute pairwise alignment scores between all pairs from among a corpus of over 280,000 poems. The resulting table of over 40 million pairwise poem similarities can be used in various
ways to study the oral tradition. Some starting points for such research are sketched in the latter part of the article.
Simon Gonzalez.
Phraseology studies have been enhanced by Corpus Linguistics, which has become an interdisciplinary field where current technologies play an important role in its development. Computational tools have been implemented in the last decades with positive results on the identification of phrases in different languages. One specific technology that has impacted these studies is social media. As researchers, we have turned our attention to collecting data from these platforms, which comes with great advantages and its own challenges. One of the challenges is the way we design and build corpora relevant to the questions emerging in this type of language expression. This has been approached from different angles, but one that has given invaluable outputs is the building of linguistic corpora with the use of online web applications. In this paper, we take a multidimensional approach to the collection, design, and deployment of a phraseology corpus for Latin American Spanish from Twitter data, extracting features using NLP techniques, and presenting it in an interactive online web application. We expect to contribute to the methodologies used for Corpus Linguistics in the current technological age. Finally, we make this tool publicly available to be used by any researcher interested in the data itself and also on the technological tools developed here.
Section:
Digital humanities in languages
Shlomo Tannor ; Nachum Dershowitz ; Moshe Lavee.
Midrash collections are complex rabbinic works that consist of text in
multiple languages, which evolved through long processes of unstable oral and
written transmission. Determining the origin of a given passage in such a
compilation is not always straightforward and is often a matter of dispute
among scholars, yet it is essential for scholars' understanding of the passage
and its relationship to other texts in the rabbinic corpus. To help solve this
problem, we propose a system for classification of rabbinic literature based on
its style, leveraging recent advances in natural language processing for Hebrew
texts. Additionally, we demonstrate how this method can be applied to uncover
lost material from a specific midrash genre, Tan\d{h}uma-Yelammedenu, that has
been preserved in later anthologies.
Yuri Bizzoni ; Pascale Moreira ; Mads Rosendahl Thomsen ; Kristoffer L. Nielbo.
In the few works that have used NLP to study literary quality, sentiment and emotion analysis have often been considered valuable sources of information. At the same time, the idea that the nature and polarity of the sentiments expressed by a novel might have something to do with its perceived quality seems limited at best. In this paper, we argue that the fractality of narratives, specifically the longterm memory of their sentiment arcs, rather than their simple shape or average valence, might play an important role in the perception of literary quality by a human audience. In particular, we argue that such measure can help distinguish Nobel-winning writers from control groups in a recent corpus of English language novels. To test this hypothesis, we present the results from two studies: (i) a probability distribution test, where we compute the probability of seeing a title from a Nobel laureate at different levels of arc fractality; (ii) a classification test, where we use several machine learning algorithms to measure the predictive power of both sentiment arcs and their fractality measure. Lastly, we perform another experiment to examine whether arc fractality may be used to distinguish more or less popular works within the Nobel canon itself, looking at the probability of higher GoodReads’ ratings at different levels of arc fractality. Our findings seem to indicate that despite the competitive and complex nature of the task, the populations of Nobel and non-Nobel […]