Elpida Perdiki - Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training

jdmdh:10419 - Journal of Data Mining & Digital Humanities, 20 décembre 2023, Documents historiques et reconnaissance automatique de texte - https://doi.org/10.46298/jdmdh.10419
Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR TrainingArticle

Auteurs : Elpida Perdiki ORCID1

  • 1 Democritus University of Thrace

HTR (Handwritten Text Recognition) technologies have progressed enough to offer high-accuracy results in recognising handwritten documents, even on a synchronous level. Despite the state-of-the-art algorithms and software, historical documents (especially those written in Greek) remain a real-world challenge for researchers. A large number of unedited or under-edited works of Greek Literature (ancient or Byzantine, especially the latter) exist to this day due to the complexity of producing critical editions. To critically edit a literary text, scholars need to pinpoint text variations on several manuscripts, which requires fully (or at least partially) transcribed manuscripts. For a large manuscript tradition (i.e., a large number of manuscripts transmitting the same work), such a process can be a painstaking and time-consuming project. To that end, HTR algorithms that train AI models can significantly assist, even when not resulting in entirely accurate transcriptions. Deep learning models, though, require a quantum of data to be effective. This, in turn, intensifies the same problem: big (transcribed) data require heavy loads of manual transcriptions as training sets. In the absence of such transcriptions, this study experiments with training sets of various sizes to determine the minimum amount of manual transcription needed to produce usable results. HTR models are trained through the Transkribus platform on manuscripts from multiple works of a single Byzantine author, John Chrysostom. By gradually reducing the number of manually transcribed texts and by training mixed models from multiple manuscripts, economic transcriptions of large bodies of manuscripts (in the hundreds) can be achieved. Results of these experiments show that if the right combination of manuscripts is selected, and with the transfer-learning tools provided by Transkribus, the required training sets can be reduced by up to 80%. Certain peculiarities of Greek manuscripts, which lead to easy automated cleaning of resulting transcriptions, could further improve these results. The ultimate goal of these experiments is to produce a transcription with the minimum required accuracy (and therefore the minimum manual input) for text clustering. If we can accurately assess HTR learning and outcomes, we may find that less data could be enough. This case study proposes a solution for researching/editing authors and works that were popular enough to survive in hundreds (if not thousands) of manuscripts and are, therefore, unfeasible to be evaluated by humans.


Volume : Documents historiques et reconnaissance automatique de texte
Rubrique : Sciences de l'Antiquité et humanités numériques
Publié le : 20 décembre 2023
Accepté le : 7 novembre 2023
Soumis le : 2 décembre 2022
Mots-clés : Byzantine manuscripts,deep learning,HTR models,Transkribus,Big data,[SHS]Humanities and Social Sciences,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing,[SHS.STAT]Humanities and Social Sciences/Methods and statistics

Datasets

References
Perdiki, E. (2023). List of manuscripts containing John Chrysostom’s Homilies and the relevant manual transcriptions (Versions 1.2, 1–) [Dataset]. Zenodo. 10.5281/ZENODO.7681132

Statistiques de consultation

Cette page a été consultée 727 fois.
Le PDF de cet article a été téléchargé 228 fois.