1. Indigenous frameworks for data-intensive humanities: recalibrating the past through knowledge engineering and generative modelling.

Sydney Shep ; Marcus Frean ; Rhys Owen ; Rere-No-A-Rangi Pope ; Pikihuia Reihana ; Valerie Chan.
Identifying, contacting and engaging missing shareholders constitutes an enormous challenge for Māori incorporations, iwi and hapū across Aotearoa New Zealand. Without accurate data or tools to har-monise existing fragmented or conflicting data sources, issues around land succession, opportunities for economic development, and maintenance of whānau relationships are all negatively impacted. This unique three-way research collaboration between Victoria University of Wellington (VUW), Parininihi ki Waitotara Incorporation (PKW), and University of Auckland funded by the National Science Challenge | Science for Technological Innovation catalyses innovation through new digital humanities-inflected data science modelling and analytics with the kaupapa of reconnecting missing Māori shareholders for a prosperous economic, cultural, and socially revitalised future. This paper provides an overview of VUW's culturally-embedded social network approach to the project, discusses the challenges of working within an indigenous worldview, and emphasises the importance of decolonising digital humanities.

2. How to read the 52.000 pages of the British Journal of Psychiatry? A collaborative approach to source exploration

Eva Andersen ; Maria Biryukov ; Roman Kalyakin ; Lars Wieneke.
Historians are confronted with an overabundance of sources that require new perspectives and tools to make use of large-scale corpora. Based on a use case from the history of psychiatry this paper describes the work of an interdisciplinary team to tackle these challenges by combining different NLP tools with new visual interfaces that foster the exploration of the corpus. The paper highlights several research challenges in the preparation and processing of the corpus and sketches new insights for historical research that were gathered due to the use of the tools.

3. Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Arlene Casey ; Mike Bennett ; Richard Tobin ; Claire Grover ; Iona Walker ; Lukas Engelmann ; Beatrice Alex.
The design of models that govern diseases in population is commonly built oninformation and data gathered from past outbreaks. However, epidemic outbreaksare never captured in statistical data alone but are communicated bynarratives, supported by empirical observations. Outbreak reports discusscorrelations between populations, locations and the disease to infer insightsinto causes, vectors and potential interventions. The problem with thesenarratives is usually the lack of consistent structure or strong conventions,which prohibit their formal analysis in larger corpora. Our interdisciplinaryresearch investigates more than 100 reports from the third plague pandemic(1894-1952) evaluating ways of building a corpus to extract and structure thisnarrative information through text mining and manual annotation. In this paperwe discuss the progress of our ongoing exploratory project, how we enhanceoptical character recognition (OCR) methods to improve text capture, ourapproach to structure the narratives and identify relevant entities in thereports. The structured corpus is made available via Solr enabling search andanalysis across the whole collection for future research dedicated, forexample, to the identification of concepts. We show preliminary visualisationsof the characteristics of causation and differences with respect to gender as aresult of syntactic-category-dependent corpus statistics. Our goal is todevelop structured accounts of some of the most significant concepts that […]

4. Character Segmentation in Asian Collector's Seal Imprints: An Attempt to Retrieval Based on Ancient Character Typeface

Kangying Li ; Biligsaikhan Batjargal ; Akira Maeda.
Collector's seals provide important clues about the ownership of a book. They contain much information pertaining to the essential elements of ancient materials and also show the details of possession, its relation to the book, the identity of the collectors and their social status and wealth, amongst others. Asian collectors have typically used artistic ancient characters rather than modern ones to make their seals. In addition to the owner's name, several other words are used to express more profound meanings. A system that automatically recognizes these characters can help enthusiasts and professionals better understand the background information of these seals. However, there is a lack of training data and labelled images, as samples of some seals are scarce and most of them are degraded images. It is necessary to find new ways to make full use of such scarce data. While these data are available online, they do not contain information on the characters' position. The goal of this research is to assist in obtaining more labelled data through user interaction and provide retrieval tools that use only standard character typefaces extracted from font files. In this paper, a character segmentation method is proposed to predict the candidate characters' area without any labelled training data that contain character coordinate information. A retrieval-based recognition system that focuses on a single character is also proposed to support seal retrieval and matching. The […]

5. Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Raphaël Barman ; Maud Ehrmann ; Simon Clematide ; Sofia Ares Oliveira ; Frédéric Kaplan.
The massive amounts of digitized historical documents acquired over the lastdecades naturally lend themselves to automatic processing and exploration.Research work seeking to automatically process facsimiles and extractinformation thereby are multiplying with, as a first essential step, documentlayout analysis. If the identification and categorization of segments ofinterest in document images have seen significant progress over the last yearsthanks to deep learning techniques, many challenges remain with, among others,the use of finer-grained segmentation typologies and the consideration ofcomplex, heterogeneous documents such as historical newspapers. Besides, mostapproaches consider visual features only, ignoring textual signal. In thiscontext, we introduce a multimodal approach for the semantic segmentation ofhistorical newspapers that combines visual and textual features. Based on aseries of experiments on diachronic Swiss and Luxembourgish newspapers, weinvestigate, among others, the predictive power of visual and textual featuresand their capacity to generalize across time and sources. Results showconsistent improvement of multimodal models in comparison to a strong visualbaseline, as well as better robustness to high material variance.

6. Digital interfaces of historical newspapers: opportunities, restrictions and recommendations

Eva Pfanzelter ; Sarah Oberbichler ; Jani Marjanen ; Pierre-Carl Langlais ; Stefan Hechl.
Many libraries offer free access to digitised historical newspapers via user interfaces. After an initial period of search and filter options as the only features, the availability of more advanced tools and the desire for more options among users has ushered in a period of interface development. However, this raises a number of open questions and challenges. For example, how can we provide interfaces for different user groups? What tools should be available on interfaces and how can we avoid too much complexity? What tools are helpful and how can we improve usability? This paper will not provide definite answers to these questions, but it gives an insight into the difficulties, challenges and risks of using interfaces to investigate historical newspapers. More importantly, it provides ideas and recommendations for the improvement of user interfaces and digital tools.

7. The expansion of isms, 1820-1917: Data-driven analysis of political language in digitized newspaper collections

Jani Marjanen ; Jussi Kurunmäki ; Lidia Pivovarova ; Elaine Zosa.
Words with the suffix-ism are reductionist terms that help us navigate complex social issues by using a simple one-word label for them. On the one hand they are often associated with political ideologies, but on the other they are present in many other domains of language, especially culture, science, and religion. This has not always been the case. This paper studies isms in a historical record of digitized newspapers from 1820 to 1917 published in Finland to find out how the language of isms developed historically. We use diachronic word embeddings and affinity propagation clustering to trace how new isms entered the lexicon and how they relate to one another over time. We are able to show how they became more common and entered more and more domains. Still, the uses of isms as traditions for political action and thinking stand out in our analysis.