Simon Gabay ; Thibault Clérice ; Christian Reul - OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more)

jdmdh:6492 - Journal of Data Mining & Digital Humanities, June 28, 2023, 2023 - https://doi.org/10.46298/jdmdh.6492
OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more)Article

Authors: Simon Gabay ORCID1,2,3; Thibault Clérice ORCID4; Christian Reul ORCID5,6,7,8

Machine learning begins with machine teaching: in the following paper, we present the data that we have prepared to kick-start the training of reliable OCR models for 17th century prints written in French. The construction of a representative corpus is a major challenge: we need to gather documents from different decades and of different genres to cover as many sizes, weights and styles as possible. Historical prints containing glyphs and typefaces that have now disappeared, transcription is a complex act, for which we present guidelines. Finally, we provide preliminary results based on these training data and experiments to improve them.


Volume: 2023
Section: Dataset
Published on: June 28, 2023
Accepted on: June 28, 2023
Submitted on: May 20, 2020
Keywords: OCR,17th c French,Training data,Corpus building,Data paper,OCR,XVIIème siècle,Données,Construction de corpus,[SHS]Humanities and Social Sciences,[INFO]Computer Science [cs],[INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE],[SHS.HIST]Humanities and Social Sciences/History,[SHS.INFO]Humanities and Social Sciences/Library and information sciences,[SHS.LITT]Humanities and Social Sciences/Literature

Consultation statistics

This page has been seen 1186 times.
This article's PDF has been downloaded 284 times.