2022

1. The effect of Facebook behaviors on the prediction of review helpfulness

Emna Ben-Abdallah ; Khouloud Boukadi.

Facebook reviews contain reviews and reviewers' information and include a set of likes, comments, sharing, and reactions called Facebook Behaviors (FBs). We extend existing research on review helpfulness to fit Facebook reviews by demonstrating that Facebook behaviors can impact review helpfulness. This study proposes a theoretical model that explains reviews' helpfulness based on FBs and baseline features. The model is empirically validated using a real Facebook data set and different feature selection methods (FS) to determine the importance level of such features to maximize the helpfulness prediction. Consequently, a combination of the impactful features is identified based on a robust and effective model. In this context, the like and love behaviors deliver the best predictive performance. Furthermore, we employ different classification techniques and a set of influencer features. The results showed the performance of the proposed model by 0.925 of accuracy.The outcomes of the current study can be applied to develop a smart review ranking system for Facebook product pages.

2. Automatic medieval charters structure detection : A Bi-LSTM linear segmentation approach

Sergio Torres Aguilar ; Pierre Chastang ; Xavier Tannier.

This paper presents a model aiming to automatically detect sections in medieval Latin charters. These legal sources are some of the most important sources for medieval studies as they reflect economic and social dynamics as well as legal and institutional writing practices. An automatic linear segmentation can greatly facilitate charter indexation and speed up the recovering of evidence to support historical hypothesis by the means of granular inquiries on these raw, rarely structured sources. Our model is based on a Bi-LSTM approach using a final CRF-layer and was trained using a large, annotated collection of medieval charters (4,700 documents) coming from Lombard monasteries: the CDLM corpus (11th-12th centuries). The evaluation shows a high performance in most sections on the test-set and on an external evaluation corpus consisting of the Montecassino abbey charters (10th-12th centuries). We describe the architecture of the model, the main problems related to the treatment of medieval Latin and formulaic discourse, and we discuss some implications of the results in terms of record-keeping practices in High Middle Ages.

3. Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

Pit Schneider ; Yves Maurer.

Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.

Rubrique : Humanités numériques en langues

4. Artificial colorization of digitized microfilms: a preliminary study

Thibault Clérice ; Ariane Pinche.

A lot of available digitized manuscripts online are actually digitized microfilms, a technology dating back from the 1930s. With the progress of artificial colorization, we make the hypothesis that microfilms could be colored with these recent technologies, testing InstColorization. We train a model over an ad-hoc dataset of 18 788 color images that are artificially gray-scaled for this purpose. With promising results in terms of colorization but clear limitations due to the difference between artificially grayscaled images and "naturaly" greyscaled microfilms, we evaluate the impact of this artificial colorization on two downstream tasks using Kraken: layout analysis and text recognition. Unfortunately, the results show little to no improvements which limits the interest of artificial colorization on manuscripts in the computer vision domain.

Rubrique : Vers un écosystème numérique : NLP. Infrastructure de corpus. Méthodes de récupération des textes et de calcul des similarités de textes

5. Linguistic Fingerprints on Translation's Lens

J.D. Porter ; Yulia Ilchuk ; Quinn Dombrowski.

What happens to the language fingerprints of a work when it is translated into another language? While translation studies has often prioritized concepts of equivalence (of form and function), and of textual function, digital humanities methodologies can provide a new analytical lens onto ways that stylistic traces of a text's source language can persist in a translated text. This paper presents initial findings of a project undertaken by the Stanford Literary Lab, which has identified distinctive grammatical features in short stories that have been translated into English. While the phenomenon of "translationese" has been well established particularly in corpus translation studies, we argue that digital humanities methods can be valuable for identifying specific traits for a vision of a world atlas of literary style.

Rubrique : Projet

6. Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

Nicolas Gutehrlé ; Iana Atanassova.

Background. In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of Logical Layout Analysis applied to historical documents in French. We propose a rule-based method, that we evaluate and compare with two Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set contains French newspapers, periodicals and magazines, published in the first half of the twentieth century in the Franche-Comté Region. Results. Our rule-based system outperforms the two other models in nearly all evaluations. It has especially better Recall results, indicating that our system covers more types of every logical label than the other two models. When comparing RIPPER with Gradient Boosting, we can observe that Gradient Boosting has better Precision scores but RIPPER has better Recall scores. Conclusions. The evaluation shows that our system outperforms the two Machine Learning models, and provides significantly higher Recall. It also confirms that our system can be used to produce annotated data sets that are large enough to envisage Machine Learning or Deep Learning approaches for the task of Logical Layout Analysis. Combining rules and Machine Learning models into hybrid systems could potentially provide even better performances. […]

Rubrique : Humanités numériques en langues

7. Hate speech, Censorship, and Freedom of Speech: The Changing Policies of Reddit

Elissa Nakajima Wickham ; Emily Öhman.

This paper examines the shift in focus on content policies and user attitudes on the social media platform Reddit. We do this by focusing on comments from general Reddit users from five posts made by admins (moderators) on updates to Reddit Content Policy. All five concern the nature of what kind of content is allowed to be posted on Reddit, and which measures will be taken against content that violates these policies. We use topic modeling to probe how the general discourse for Redditors has changed around limitations on content, and later, limitations on hate speech, or speech that incites violence against a particular group. We show that there is a clear shift in both the contents and the user attitudes that can be linked to contemporary societal upheaval as well as newly passed laws and regulations, and contribute to the wider discussion on hate speech moderation.

8. Fractal Sentiments and Fairy Tales - Fractal scaling of narrative arcs as predictor of the perceived quality of Andersen's fairy tales

Yuri Bizzoni ; Telma Peura ; Mads Thomsen ; Kristoffer Nielbo.

This article explores the sentiment dynamics present in narratives and their contribution to literary appreciation. Specifically, we investigate whether a certain type of sentiment development in a literary narrative correlates with its quality as perceived by a large number of readers. While we do not expect a story's sentiment arc to relate directly to readers' appreciation, we focus on its internal coherence as measured by its sentiment arc's level of fractality as a potential predictor of literary quality. To measure the arcs' fractality we use the Hurst exponent, a popular measure of fractal patterns that reflects the predictability or self-similarity of a time series. We apply this measure to the fairy tales of H.C. Andersen, using GoodReads' scores to approximate their level of appreciation. Based on our results we suggest that there might be an optimal balance between predictability and surprise in a sentiment arcs' structure that contributes to the perceived quality of a narrative text.

9. Word Sense Induction with Attentive Context Clustering

Moshe Stekel ; Amos Azaria ; Shai Gordin.

This paper presents ACCWSI (Attentive Context Clustering WSI), a method for Word Sense Induction, suitable for languages with limited resources. Pretrained on a small corpus and given an ambiguous word (a query word) and a set of excerpts that contain it, ACCWSI uses an attention mechanism for generating context-aware embeddings, distinguishing between the different senses assigned to the query word. These embeddings are then clustered to provide groups of main common uses of the query word. We show that ACCWSI performs well on the SemEval-2 2010 WSI task. ACCWSI also demonstrates practical applicability for shedding light on the meanings of ambiguous words in ancient languages, such as Classical Hebrew and Akkadian. In the near future, we intend to turn ACCWSI into a practical tool for linguists and historians.

10. Enhancing Legal Argument Mining with Domain Pre-training and Neural Networks

Gechuan Zhang ; Paul Nulty ; David Lillis.

The contextual word embedding model, BERT, has proved its ability on downstream tasks with limited quantities of annotated data. BERT and its variants help to reduce the burden of complex annotation work in many interdisciplinary research areas, for example, legal argument mining in digital humanities. Argument mining aims to develop text analysis tools that can automatically retrieve arguments and identify relationships between argumentation clauses. Since argumentation is one of the key aspects of case law, argument mining tools for legal texts are applicable to both academic and non-academic legal research. Domain-specific BERT variants (pre-trained with corpora from a particular background) have also achieved strong performance in many tasks. To our knowledge, previous machine learning studies of argument mining on judicial case law still heavily rely on statistical models. In this paper, we provide a broad study of both classic and contextual embedding models and their performance on practical case law from the European Court of Human Rights (ECHR). During our study, we also explore a number of neural networks when being combined with different embeddings. Our experiments provide a comprehensive overview of a variety of approaches to the legal argument mining task. We conclude that domain pre-trained transformer models have great potential in this area, although traditional embeddings can also achieve strong performance when combined with additional neural network […]

11. Adapting vs. Pre-training Language Models for Historical Languages

Enrique Manjavacas ; Lauren Fonteyn.

As large language models such as BERT are becoming increasingly popular in Digital Humanities (DH), the question has arisen as to how such models can be made suitable for application to specific textual domains, including that of 'historical text'. Large language models like BERT can be pretrained from scratch on a specific textual domain and achieve strong performance on a series of downstream tasks. However, this is a costly endeavour, both in terms of the computational resources as well as the substantial amounts of training data it requires. An appealing alternative, then, is to employ existing 'general purpose' models (pre-trained on present-day language) and subsequently adapt them to a specific domain by further pre-training. Focusing on the domain of historical text in English, this paper demonstrates that pre-training on domain-specific (i.e. historical) data from scratch yields a generally stronger background model than adapting a present-day language model. We show this on the basis of a variety of downstream tasks, ranging from established tasks such as Part-of-Speech tagging, Named Entity Recognition and Word Sense Disambiguation, to ad-hoc tasks like Sentence Periodization, which are specifically designed to test historically relevant processing.

Rubrique : Humanités numériques en langues

12. La traduction littéraire automatique : Adapter la machine à la traduction humaine individualisée

Damien Hansen ; Emmanuelle Esperança-Rodier ; Hervé Blanchon ; Valérie Bada.

La traduction automatique neuronale et son adaptation à des domaines spécifiques par le biais de corpus spécialisés ont permis à cette technologie d’intégrer bien plus largement qu’auparavant le métier et la formation des traducteur·trice·s. Si le paradigme neuronal (et le deep learning de manière générale) a ainsi pu investir des domaines parfois insoupçonnés, y compris certains où la créativité est de mise, celui-ci est moins marqué par un gain phénoménal de performance que par une utilisation massive auprès du public et les débats qu’il génère, nombre d’entre eux invoquant couramment le cas littéraire pour (in)valider telle ou telle observation. Pour apprécier la pertinence de cette technologie, et ce faisant surmonter les discours souvent passionnés des opposants et partisans de la traduction automatique, il est toutefois nécessaire de mettre l’outil à l’épreuve, afin de fournir un exemple concret de ce que pourrait produire un système entraîné spécifiquement pour la traduction d’œuvres littéraires. Inscrit dans un projet de recherche plus vaste visant à évaluer l’aide que peuvent fournir les outils informatiques aux traducteurs et traductrices littéraires, cet article propose par conséquent une expérience de traduction automatique de la prose qui n’a plus été tentée pour le français depuis les systèmes probabilistes et qui rejoint un nombre croissant d’études sur le sujet pour d’autres paires de […]

Rubrique : V. L'apport des corpus