Historical Documents and automatic text recognition

With this special issue of the Journal of Data Mining and Digital Humanities (JDMDH), we wish to bring together in one single volume several experiments, projects and reflections related to automatic text recognition on Historical documents. This special issue is the outcome of an event that took place at the Ecole Nationale des Chartes in Paris on June 23 and 24, 2022, which brought together scholars from various backgrounds to discuss the use of HTR and OCR in their researches. During these days, problems of engineering, machine learning or infrastructure were raised. Many technical subjects such as segmentation or the development of models linked to philological questions were discussed. The different speeches covered a large number of documents: manuscripts, archives, epigraphic materials, documents, sometimes in languages with their own specificities such as Hebrew, CHAM or ancient Greek from the 11th to the 20th century.

1. Impact of Image Enhancement Methods on Automatic Transcription Trainings with eScriptorium

Pauline Jacsont ; Elina Leblanc.
This study stems from the Desenrollando el cordel (Untangling the cordel) project, which focuses on 19th-century Spanish prints editing. It evaluates the impact of image enhancement methods on the automatic transcription of low-quality documents, both in terms of printing and digitisation. We compare different methods (binarisation, deblur) and present the results obtained during the training of models with the Kraken tool. We demonstrate that binarisation methods give better results than the other, and that the combination of several techniques did not significantly improve the transcription prediction. This study shows the significance of using image enhancement methods with Kraken. It paves the way for further experiments with larger and more varied corpora to help future projects design their automatic transcription workflow.