Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training

Elpida Perdiki

doi:10.46298/jdmdh.10419

Elpida Perdiki - Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training

jdmdh:10419 - Journal of Data Mining & Digital Humanities, 20 décembre 2023, Documents historiques et reconnaissance automatique de texte - https://doi.org/10.46298/jdmdh.10419

Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR TrainingArticle

Auteurs : Elpida Perdiki ¹

1 Democritus University of Thrace

HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a synchronous level. Despite the state-of-the-art algorithms and software, historical documents (especially those written in Greek) remain a real-world challenge for researchers. A large number of unedited or under-edited works of Greek Literature (ancient or Byzantine, especially the latter) exist to this day due to the complexity of producing critical editions. To critically edit a literary text, scholars need to pinpoint text variations on several manuscripts, which requires fully (or at least partially) transcribed manuscripts. For a large manuscript tradition (i.e., a large number of manuscripts transmitting the same work), such a process can be a painstaking and time-consuming project. To that end, HTR algorithms that train AI models can significantly assist, even when not resulting in entirely accurate transcriptions. Deep learning models, though, require a quantum of data to be effective. This, in turn, intensifies the same problem: big (transcribed) data require heavy loads of manual transcriptions as training sets. In the absence of such transcriptions, this study experiments with training sets of various sizes to determine the minimum amount of manual transcription needed to produce usable results. HTR models are trained through the Transkribus platform on manuscripts from multiple works of a single Byzantine author, John Chrysostom. By gradually reducing the number of manually transcribed texts and by training mixed models from multiple manuscripts, economic transcriptions of large bodies of manuscripts (in the hundreds) can be achieved. Results of these experiments show that if the right combination of manuscripts is selected, and with the transfer-learning tools provided by Transkribus, the required training sets can be reduced by up to 80%. Certain peculiarities of Greek manuscripts, which lead to easy automated cleaning of resulting transcriptions, could further improve these results. The ultimate goal of these experiments is to produce a transcription with the minimum required accuracy (and therefore the minimum manual input) for text clustering. If we can accurately assess HTR learning and outcomes, we may find that less data could be enough. This case study proposes a solution for researching/editing authors and works that were popular enough to survive in hundreds (if not thousands) of manuscripts and are, therefore, unfeasible to be evaluated by humans.

https://doi.org/10.46298/jdmdh.10419

Source : HAL:hal-03880102v4

Volume : Documents historiques et reconnaissance automatique de texte

Rubrique : Sciences de l'Antiquité et humanités numériques

Publié le : 20 décembre 2023

Accepté le : 7 novembre 2023

Soumis le : 2 décembre 2022

Mots-clés : Byzantine manuscripts,deep learning,HTR models,Transkribus,Big data,[SHS]Humanities and Social Sciences,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing,[SHS.STAT]Humanities and Social Sciences/Methods and statistics

Licence : Attribution 4.0 International (CC BY 4.0)

Datasets

Référence

Perdiki, E. (2023). List of manuscripts containing John Chrysostom’s Homilies and the relevant manual transcriptions (Versions 1.2, 1–) [Dataset]. Zenodo. 10.5281/ZENODO.7681132

Elpida Perdiki - Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training

Datasets

Références bibliographiques

Partager et exporter

Statistiques de consultation