Historical Documents and automatic text recognition

Edited by Ariane Pinche (CIHAM, CNRS) and Peter Stokes (AOROC, École Pratique des Hautes Études – Université PSL)

With this special issue of the Journal of Data Mining and Digital Humanities (JDMDH), we bring together in one single volume several experiments, projects and reflections related to automatic text recognition on Historical documents.

Many projects now include automatic text acquisition in their data processing chain. The integration of this technology into increasingly powerful processing chains has led to an automation of tasks that affects the role of the researcher in the textual production process. This new data-intensive practice makes it urgent to collect and harmonise the corpora necessary for the constitution of training sets, but also to
make them available for exploitation. This issue is an opportunity to propose articles combining philological and technical questions to make a scientific assessment of the use of automatic text recognition for ancient documents, its results, its contributions and the new practices induced by its use in the process
of editing and exploring texts. We hope that practical aspects will be questioned on this occasion, while raising methodological challenges and its impact on research data.

The special issue on Automatic Text Recognition (ATR) is dedicated to providing a comprehensive overview of the use of ATR in the humanities field, particularly concerning historical documents in the early 2020s. This issue presents a fusion of engineering and philological aspects, catering to both beginners and experienced users interested in launching projects with ATR. The collection encompasses a diverse array of approaches, covering topics such as data creation or collection for training generic models, reaching specific objectives, technical and HTR machine architecture, segmentation methods, and image processing.

Table of Contents


PINCHE, Ariane, STOKES, Peter A., « Historical Documents and Automatic Text Recognition: Introduction », https://doi.org/10.46298/jdmdh.13247

1. ATR and research projects, corpus and model building

a. Research projects

COUTURE, Béatrice, VERRET, Farah, GOHIER, Maxime [et al.], « The challenges of HTR model training: Feedbacks from the project Donner le goût de l’archive à l'ère numérique », https://jdmdh.episciences.org/12556

CALVELLI, Lorenzo, BOSCHETTI, Federico et TOMMASI, Tatiana, « EpiSearch. Identifying Ancient Inscriptions in Epigraphic Manuscripts », https://doi.org/10.46298/jdmdh.10417

ROMEIN, C. Annemieke, HODEL, Tobias, GORDIJN, Femke, [et al.], « Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done », https://doi.org/10.46298/jdmdh.10403

PERDIKI, Elpida, « Preparing Big Manuscript Data for Hierarchical Clustering with Minimal HTR Training », https://doi.org/10.46298/jdmdh.10419

b. Corpus and model building

LEVENSON GILLE, Matthias, « Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Paper », https://doi.org/10.46298/jdmdh.10416

PINCHE, Ariane, « Generic HTR Models for Medieval Manuscripts The CREMMALab Project », https://jdmdh.episciences.org/11592

2. ATR, technical improvement and tools: image enhancement, segmentation, ATR engine architecture, etc.

a. Improvement of segmentation and ATR engine

AGUILAR, Sergio Torres et JOLIVET, Vincent, « Handwritten Text Recognition for Documentary Medieval Manuscripts », https://doi.org/10.46298/jdmdh.10484

CLÉRICE, Thibault, « You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine », https://doi.org/10.46298/jdmdh.9806

b. Pre-treatment of sources and improving ATR

JACSONT, Pauline et LEBLANC, Elina, « Impact of Image Enhancement Methods on HTR Trainings with eScriptorium », https://doi.org/10.46298/jdmdh.10262

WEST, Graham, SWINDALL, Matthew I., KEENER, Ben, [et al.], « An Approach for Noisy, Crowdsourced Datasets Utilizing Ensemble Modeling, “Human Softmax” Distributions, and Entropic Measures of Uncertainty », https://doi.org/10.46298/jdmdh.10297

3. List of the datasets and models cited in the issue

Boschetii F., episearch-htr. Published online November 23, 2022. Accessed July 27, 2023. https://github.com/vedph/episearch-htr

Clérice T. YALTAi: Segmonto Manuscript and Early Printed Book Dataset. Published online July 10, 2022. doi:10.5281/zenodo.6814770

Hodel T, Schoch D, Dängeli P. Handwritten Text Recognition Ground Truth Set: StABS Ratsbücher O10, Urfehdenbuch X. Published online August 2, 2021. doi:10.5281/zenodo.5153263

Jacsont P. Toponomasia : edition of cod. 174 of Bern Burgerbibliothek. Published online July 26, 2022. doi:10.5281/zenodo.7026585

Levenson MG. Towards a general open dataset and model for late medieval Castilian text recognition (HTR/OCR). Datasets and scripts. Published online December 1, 2022. doi:10.5281/zenodo.7389195

Perdiki E. List of manuscripts containing John Chrysostom’s Homilies and the relevant manual transcriptions. Published online February 27, 2023. doi:10.5281/zenodo.7681133

Pinche A, Gabay S, Leroy N, Christensen K. Données HTR incunables du 15e siècle. Published online March 22, 2023. Accessed July 27, 2023. https://github.com/Gallicorpora/HTR-incunable-15e-siecle

Pinche A, Gabay S, Leroy N, Christensen K. Données HTR manuscrits du 15e siècle. Published online March 22, 2023. Accessed July 27, 2023. https://github.com/Gallicorpora/HTR-MSS-15e-Siecle

Pinche A. Cremma Medieval. Published online June 2022. Accessed July 27, 2023. https://github.com/HTR-United/cremma-medieval

Torres Aguilar S, Jolivet V. Dataset and evaluation for HTR models for Latin and French Medieval Documentary Manuscripts. Published online January 10, 2023. doi:10.5281/zenodo.7401833

Torres Aguilar S, Jolivet V. HTR model for Latin and French Medieval Documentary Manuscripts (12th-15th). Published online January 18, 2023. doi:10.5281/zenodo.7547438

Journal of Data Mining and Digital Humanities is an open-access peer-reviewed journal with first draft as pre-print in arxiv or HAL and peer-review post-pblication.

Contact : ariane.pinche@cnrs.fr