Project presentations

Toward Automatic Typography Analysis: Serif Classification and Font Similarities

Syed Talal Wasim ; Romain Collaud ; Lara Défayes ; Nicolas Henchoz ; Mathieu Salzmann ; Delphine Ribes Lemay.
Whether a document is of historical or contemporary significance, typography plays a crucial role in its composition. From the early days of modern printing, typographic techniques have evolved and transformed, resulting in changes to the features of typography. By analyzing these features, we can gain insights into specific time periods, geographical locations, and messages conveyed through typography. Therefore, in this paper, we aim to investigate the feasibility of training a model to classify serif typeswithout knowledge of the font and character. We also investigate how to train a vectorial-based image model able to group together fonts with similar features. Specifically, we compare the use of state-of-theart image classification methods, such as the EfficientNet-B2 and the Vision Transformer Base model with different patch sizes, and the state-of-the-art fine-grained image classification method, TransFG, on the serif classification task. We also evaluate the use of the DeepSVG model to learn to group fonts with similar features. Our investigation reveals that fine-grained image classification methods are better suited for the serif classification tasks and that leveraging the character labels helps to learn more meaningful font similarities.This repository contains: - Paper published in the Journal of data mining and digital humanities:WasimEtAl_Toward_Automatic_Typography_Analysis__Serif_Classification_and_Font_Similarities.pdf - Two datasets: The first […]

HistText: An Application for leveraging large-scale historical textbases

Blouin Baptiste ; Cécile Armand ; Christian Henriot.
This paper introduces HistText, a pioneering tool devised to facilitate large-scale data mining in historical documents, specifically targeting Chinese sources. Developed in response to the challenges posed by the massive Modern China Textual Database, HistText emerges as a solution to efficiently extract and visualize valuable insights from billions of words spread across millions of documents. With a user-friendly interface, advanced text analysis techniques, and powerful data visualization capabilities, HistText offers a robust platform for digital humanities research. This paper explores the rationale behind HistText, underscores its key features, and provides a comprehensive guide for its effective utilization, thus highlighting its potential to substantially enhance the realm of computational humanities.

The Database of Byzantine Book Epigrams Project: Principles, Challenges, Opportunities

Rachele Ricceri ; Klaas Bentein ; Floris Bernard ; Antoon Bronselaer ; Els De Paermentier ; Pieterjan De Potter ; Guy De Tré ; Ilse De Vos ; Maxime Deforche ; Kristoffel Demoen et al.
This paper presents an overview of the history, conceptualization, and development of the Database of Byzantine Book Epigrams, an ongoing research project hosted at Ghent University. It also offers a glimpse into current and future research threads carried out within the project, with an eye on long-term sustainability. The first part of the paper pinpoints the position of DBBE within the broad field of Digital Humanities and addresses the question of how and why Byzantine metrical paratexts have been collected in an open-access online database. In the second part of the article, we describe the main features of the relational database currently available, both from the perspective of its users and from a technical point of view. The third section of the paper includes the description of four subprojects connected to DBBE, which at present involve the development of a graph database complementary to the relational one, the implementation of natural language pre-processing applied to the DBBE corpus, the linguistic analysis of formulaicity in book epigrams, and the exploration of the broad implications of the study of book epigrams for a better understanding of Byzantine book culture.

Processing Tools for Greek and Other Languages of the Christian Middle East

Bastien Kindt.
This paper presents some computer tools and linguistic resources of the GREgORI project. These developments allow automated processing of texts written in the main languages of the Christian Middel East, such as Greek, Arabic, Syriac, Armenian and Georgian. The main goal is to provide scholars with tools (lemmatized indexes and concordances) making corpus-based linguistic information available. It focuses on the questions of text processing, lemmatization, information retrieval, and bitext alignment.

Intertextual Pointers in the Text Alignment Network

Joel Kalvesmaki.
The Text Alignment Network (TAN) is a suite of XML encoding formats intended to serve anyone who wishes to encode, exchange, and study multiple versions of texts (e.g., translations, paraphrases), and annotations on those texts (e.g., quotations, word-for-word correspondences). This article focuses on TAN’s innovative intertextual pointers, which, I argue, provide an unprecedented level of readability, interoperability, and semantic context. Because TAN is a new, experimental format, this article provides a brief introduction to the format and concludes with comments on progress and future prospects.

Digital Greek Patristic Catena (DGPC). A brief presentation

Athanasios Paparnakis ; Constantinos Domouchtsis.
The project is to develop a database, which is planned to include all available information on the use of the Bible in the patristic works of Migne's Patrologia Graeca. Utilization of the data will be available through a web page equipped with necessary tools for developing data mining techniques and other methods of analysis. The main aim of the project is to revive the catenae, the ancient exegetical tool for biblical interpretation.

Computer - Assisted Processing of Intertextuality in Ancient Languages

Mark Hedges ; Anna Jordanous ; K. Faith Lawrence ; Charlotte Roueché ; Charlotte Tupman.
The production of digital critical editions of texts using TEI is now a widely-adopted procedure within digital humanities. The work described in this paper extends this approach to the publication of gnomologia (anthologies of wise sayings) , which formed a widespread literary genre in many cultures of the medieval Mediterranean. These texts are challenging because they were rarely copied straightforwardly ; rather , sayings were selected , reorganised , modified or re-attributed between manuscripts , resulting in a highly interconnected corpus for which a standard approach to digital publication is insufficient. Focusing on Greek and Arabic collections , we address this challenge using semantic web techniques to create an ecosystem of texts , relationships and annotations , and consider a new model – organic , collaborative , interconnected , and open-ended – of what constitutes an edition. This semantic web-based approach allows scholars to add their own materials and annotations to the network of information and to explore the conceptual networks that arise from these interconnected sayings .

Bioinformatics and Classical Literary Study

Pramit Chaudhuri ; Joseph P. Dexter.
This paper describes the Quantitative Criticism Lab, a collaborative initiative between classicists, quantitative biologists, and computer scientists to apply ideas and methods drawn from the sciences to the study of literature. A core goal of the project is the use of computational biology, natural language processing, and machine learning techniques to investigate authorial style, intertextuality, and related phenomena of literary significance. As a case study in our approach, here we review the use of sequence alignment, a common technique in genomics and computational linguistics, to detect intertextuality in Latin literature. Sequence alignment is distinguished by its ability to find inexact verbal similarities, which makes it ideal for identifying phonetic echoes in large corpora of Latin texts. Although especially suited to Latin, sequence alignment in principle can be extended to many other languages.

QuotationFinder - Searching for Quotations and Allusions in Greek and Latin Texts and Establishing the Degree to Which a Quotation or Allusion Matches Its Source

Luc Herren.
The software programs generally used with the TLG (Thesaurus Linguae Graecae) and the CLCLT (CETEDOC Library of Christian Latin Texts) CD-ROMs are not well suited for finding quotations and allusions. QuotationFinder uses more sophisticated criteria as it ranks search results based on how closely they match the source text, listing search results with literal quotations first and loose verbal parallels last.

Dealing with all types of quotations (and their parallels) in a closed corpus: The methodology of the Project The literary tradition in the third and fourth centuries CE: Grammarians, rhetoricians and sophists as sources of Graeco-Roman literature

Lucía Rodríguez-Noriega.
The Project The literary tradition in the third and fourth centuries CE: Grammarians, rhetoricians and sophists as sources of Graeco-Roman literature (FFI2014-52808-C2-1-P) aims to trace and classify all types of quotations, both explicit (with or without mention of the author and/or title) and hidden, in a corpus comprising the Greek grammarians, rhetoricians and " sophists " of the third and fourth centuries CE. At the same time, we try to detect whether or not these are first-hand quotations, and if our quoting authors (28 in all) are, in turn, secondary sources for the same citations in later authors. We also study the philological (textual) aspects of the quotations in their context, and the problems of limits they sometimes pose. Finally, we are interested in the function of the quotation in the citing work. This is the first time that such a comprehensive study of this corpus is attempted. This paper explains our methodology, and how we store all these data in our electronic card-file.

Editing New Testament Arabic Manuscripts in a TEI-base: fostering close reading in Digital Humanities

Claire Clivaz ; Sara Schulthess ; Martial Sankar.
If one is convinced that " quantitative research provides data not interpretation " [Moretti, 2005, 9], close reading should thus be considered as not only the necessary bridge between big data and interpretation but also the core duty of the Humanities. To test its potential in a neglected field – the Arabic manuscripts of the Letters of Paul of Tarsus – an enhanced, digital edition has been in development as a progression of a Swiss National Fund project. This short paper presents the development of this edition and perspectives regarding a second project. Based on the Edition Visualization Technology tool, the digital edition provides a transcription of the Arabic text, a standardized and vocalized version, as well as French translation with all texts encoded in TEI XML. Thanks to another Swiss National Foundation subsidy, a new research project on the unique New Testament, trilingual (Greek-Latin-Arabic) manuscript, the Marciana Library Gr. Z. 11 (379), 12th century, is currently underway. This project includes new features such as " Textlink " , " Hotspot " and notes: HumaReC.