Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

Pit Schneider; Yves Maurer

doi:10.46298/jdmdh.8561

Pit Schneider ; Yves Maurer - Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

jdmdh:8561 - Journal of Data Mining & Digital Humanities, 30 novembre 2022, 2022 - https://doi.org/10.46298/jdmdh.8561

Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement PredictionArticle

Auteurs : Pit Schneider ¹; Yves Maurer ¹

1 National Library of Luxembourg

Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.

Comment: Journal of Data Mining and Digital Humanities; Minor revision

https://doi.org/10.46298/jdmdh.8561

Source : arXiv.org:2110.01661

Volume : 2022

Rubrique : Humanités numériques en langues

Publié le : 30 novembre 2022

Accepté le : 30 novembre 2022

Soumis le : 8 octobre 2021

Mots-clés : Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, I.2.7

Licence : Attribution 4.0 International (CC BY 4.0)

Références bibliographiques

Partager et exporter

Statistiques de consultation

Cette page a été consultée 3253 fois.

Le PDF de cet article a été téléchargé 1181 fois.