1. Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

Jean-Baptiste Camps ; Simon Gabay ; Paul Fièvre ; Thibault Clérice ; Florian Cafiero.
This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.
Section: Digital humanities in languages

2. Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Pit Schneider.
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.
Section: HistoInformatics

3. French vital records data gathering and analysis through image processing and machine learning algorithms

Cyprien Plateau-Holleville ; Enzo Bonnot ; Franck Gechter ; Laurent Heyberger.
Vital records are rich of meaningful historical data concerning city as well as countryside inhabitants that can be used, among others, to study former populations and then reveal the social, economic and demographic characteristics of those populations. However, these studies encounter a main difficulty for collecting the data needed since most of these records are scanned documents that need a manual transcription step in order to gather all the data and start exploiting it from a historical point of view. This step consequently slows down the historical research and is an obstacle to a better knowledge of the population habits depending on their social conditions. Therefore in this paper, we present a modular and self-sufficient analysis pipeline using state-of-the-art algorithms mostly regardless of the document layout that aims to automate this data extraction process.

4. Topic models do not model topics: epistemological remarks and steps towards best practices

Anna Shadrova.
The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological concerns centering around the interface between topic modeling and linguistic concepts and the argumentative embedding of evidence obtained through topic modeling. It concludes that topic modeling in its present state of methodological integration does not meet the requirements of an independent research method. It operates from relevantly unrealistic assumptions, is non-deterministic, cannot effectively be validated against a reasonable number of competing models, does not lock into a well-defined linguistic interface, and does not scholarly model topics in the sense of themes or content. These features are intrinsic and make the interpretation of its results prone to apophenia (the human tendency to perceive random sets of elements as meaningful patterns) and confirmation bias (the human tendency to perceptually prefer patterns that are in alignment with pre-existing biases). While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way.

5. TraduXio Project: Latest Upgrades and Feedback

Philippe Lacour ; Aurélien Bénel.
TraduXio is a digital environment for computer assisted multilingual translation which is web-based, free to use and with an open source code. Its originality is threefold-whereas traditional technologies are limited to two languages (source/target), TraduXio enables the comparison of different versions of the same text in various languages; its concordancer provides relevant and multilingual suggestions through a classification of the source according to the history, genre and author; it uses collaborative devices (privilege management, forums, networks, history of modification, etc.) to promote collective (and distributed) translation. TraduXio is designed to encourage the diversification of language learning and to promote a reappraisal of translation as a professional skill. It can be used in many different ways, by very diverse kind of people. In this presentation, I will present the recent developments of the software (its version 2.1) and illustrate how specific groups (language teaching, social sciences, literature) use it on a regular basis. In this paper, I present the technology but concentrate more on the possible uses of TraduXio, thus focusing on translators' feedback about their experience when working in this digital environment in a truly collaborative way.
Section: Digital humanities in languages

6. Indigenous frameworks for data-intensive humanities: recalibrating the past through knowledge engineering and generative modelling.

Sydney Shep ; Marcus Frean ; Rhys Owen ; Rere-No-A-Rangi Pope ; Pikihuia Reihana ; Valerie Chan.
Identifying, contacting and engaging missing shareholders constitutes an enormous challenge for Māori incorporations, iwi and hapū across Aotearoa New Zealand. Without accurate data or tools to har-monise existing fragmented or conflicting data sources, issues around land succession, opportunities for economic development, and maintenance of whānau relationships are all negatively impacted. This unique three-way research collaboration between Victoria University of Wellington (VUW), Parininihi ki Waitotara Incorporation (PKW), and University of Auckland funded by the National Science Challenge | Science for Technological Innovation catalyses innovation through new digital humanities-inflected data science modelling and analytics with the kaupapa of reconnecting missing Māori shareholders for a prosperous economic, cultural, and socially revitalised future. This paper provides an overview of VUW's culturally-embedded social network approach to the project, discusses the challenges of working within an indigenous worldview, and emphasises the importance of decolonising digital humanities.
Section: HistoInformatics

7. Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Eva Pfanzelter ; Sarah Oberbichler ; Jani Marjanen ; Pierre-Carl Langlais ; Stefan Hechl.
Many libraries offer free access to digitised historical newspapers via user interfaces. After an initial period of search and filter options as the only features, the availability of more advanced tools and the desire for more options among users has ushered in a period of interface development. However, this raises a number of open questions and challenges. For example, how can we provide interfaces for different user groups? What tools should be available on interfaces and how can we avoid too much complexity? What tools are helpful and how can we improve usability? This paper will not provide definite answers to these questions, but it gives an insight into the difficulties, challenges and risks of using interfaces to investigate historical newspapers. More importantly, it provides ideas and recommendations for the improvement of user interfaces and digital tools.
Section: HistoInformatics

8. Character Segmentation in Asian Collector's Seal Imprints: An Attempt to Retrieval Based on Ancient Character Typeface

Kangying Li ; Biligsaikhan Batjargal ; Akira Maeda.
Collector's seals provide important clues about the ownership of a book. They contain much information pertaining to the essential elements of ancient materials and also show the details of possession, its relation to the book, the identity of the collectors and their social status and wealth, amongst others. Asian collectors have typically used artistic ancient characters rather than modern ones to make their seals. In addition to the owner's name, several other words are used to express more profound meanings. A system that automatically recognizes these characters can help enthusiasts and professionals better understand the background information of these seals. However, there is a lack of training data and labelled images, as samples of some seals are scarce and most of them are degraded images. It is necessary to find new ways to make full use of such scarce data. While these data are available online, they do not contain information on the characters' position. The goal of this research is to assist in obtaining more labelled data through user interaction and provide retrieval tools that use only standard character typefaces extracted from font files. In this paper, a character segmentation method is proposed to predict the candidate characters' area without any labelled training data that contain character coordinate information. A retrieval-based recognition system that focuses on a single character is also proposed to support seal retrieval and […]
Section: HistoInformatics

9. Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Raphaël Barman ; Maud Ehrmann ; Simon Clematide ; Sofia Ares Oliveira ; Frédéric Kaplan.
The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance.
Section: HistoInformatics

10. Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Arlene Casey ; Mike Bennett ; Richard Tobin ; Claire Grover ; Iona Walker ; Lukas Engelmann ; Beatrice Alex.
The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most […]
Section: HistoInformatics

11. The renewal of the digital humanities. An overview of the transformation of professions in the humanities and social sciences

Marie-Laure Massot ; Agnès Tricoche.
This article presents a study of the French-speaking digital humanities. It is based on the experience of two research engineers from the French National Center for Scientific Research (CNRS) who have been studying these issues for the last ten years. They conducted a survey at the École Normale Supérieure (ENS-Paris) which enabled them to draw up an overview of the transformation of the profession of humanities and social sciences research engineers in the context of the digital humanities. The Digit_Hum initiative, which they run in parallel with their respective activities at the ENS, also provided information for this overview thanks to its role as a space for discussion about the digital humanities along with training and structuring of this field at the ENS and the Université Paris Sciences & Lettres (PSL).
Section: Data deluge: which skills for wich data?

12. Publishing open-access bibliographical data on Ancient Greek and Latin texts: challenges, constraints, progression

Julie Giovacchini ; Laurent Capron.
We present here both some of our thoughts on methodology in relation to the specific constraints that complexify the ways of structuring and accessing bibliographical data in the Sciences of Antiquity, and the solutions adopted by the IPhiS-CIRIS project for dealing with these constraints. The project began in 2014 in a general scientific environment that was still being standardised and structured, with digital bibliographical resources in this disciplinary field becoming increasingly numerous, although of uneven quality and hard to access and/or private.
Section: Sciences of Antiquity and digital humanities