Elpida Perdiki - Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training

jdmdh:10419 - Journal of Data Mining & Digital Humanities, December 20, 2023, Historical Documents and automatic text recognition - https://doi.org/10.46298/jdmdh.10419
Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR TrainingArticle

Authors: Elpida Perdiki ORCID1

  • 1 Democritus University of Thrace

HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a synchronous level. Despite the state-of-the-art algorithms and software, historical documents (especially those written in Greek) remain a real-world challenge for researchers. A large number of unedited or under-edited works of Greek Literature (ancient or Byzantine, especially the latter) exist to this day due to the complexity of producing critical editions. To critically edit a literary text, scholars need to pinpoint text variations on several manuscripts, which requires fully (or at least partially) transcribed manuscripts. For a large manuscript tradition (i.e., a large number of manuscripts transmitting the same work), such a process can be a painstaking and time-consuming project. To that end, HTR algorithms that train AI models can significantly assist, even when not resulting in entirely accurate transcriptions. Deep learning models, though, require a quantum of data to be effective. This, in turn, intensifies the same problem: big (transcribed) data require heavy loads of manual transcriptions as training sets. In the absence of such transcriptions, this study experiments with training sets of various sizes to determine the minimum amount of manual transcription needed to produce usable results. HTR models are trained through the Transkribus platform on manuscripts from multiple works of a single Byzantine author, John Chrysostom. By gradually reducing the number of manually transcribed texts and by training mixed models from multiple manuscripts, economic transcriptions of large bodies of manuscripts (in the hundreds) can be achieved. Results of these experiments show that if the right combination of manuscripts is selected, and with the transfer-learning tools provided by Transkribus, the required training sets can be reduced by up to 80%. Certain peculiarities of Greek manuscripts, which lead to easy automated cleaning of resulting transcriptions, could further improve these results. The ultimate goal of these experiments is to produce a transcription with the minimum required accuracy (and therefore the minimum manual input) for text clustering. If we can accurately assess HTR learning and outcomes, we may find that less data could be enough. This case study proposes a solution for researching/editing authors and works that were popular enough to survive in hundreds (if not thousands) of manuscripts and are, therefore, unfeasible to be evaluated by humans.

Volume: Historical Documents and automatic text recognition
Section: Sciences of Antiquity and digital humanities
Published on: December 20, 2023
Accepted on: November 7, 2023
Submitted on: December 2, 2022
Keywords: Byzantine manuscripts,deep learning,HTR models,Transkribus,Big data,[SHS]Humanities and Social Sciences,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing,[SHS.STAT]Humanities and Social Sciences/Methods and statistics


Perdiki, E. (2023). List of manuscripts containing John Chrysostom’s Homilies and the relevant manual transcriptions (Versions 1.2, 1–) [dataset]. Zenodo. 10.5281/ZENODO.7681132

Consultation statistics

This page has been seen 214 times.
This article's PDF has been downloaded 82 times.