Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Pit Schneider

doi:10.46298/jdmdh.7277

Pit Schneider - Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

jdmdh:7277 - Journal of Data Mining & Digital Humanities, 4 novembre 2021, 2021 - https://doi.org/10.46298/jdmdh.7277

Combining Morphological and Histogram based Text Line Segmentation in the OCR ContextArticle

Auteurs : Pit Schneider

Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or presence of noise. For that reason, the segmenter in question could be of particular interest for cultural institutions, that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.

Comment: Journal of Data Mining and Digital Humanities; Small adjustments

https://doi.org/10.46298/jdmdh.7277

Source : arXiv.org:2103.08922

Volume : 2021

Rubrique : HistoInformatique

Publié le : 4 novembre 2021

Accepté le : 4 novembre 2021

Soumis le : 18 mars 2021

Mots-clés : Computer Science - Computer Vision and Pattern Recognition, I.4.6

Licence : Attribution 4.0 International (CC BY 4.0)

Références bibliographiques

1 Document citant cet article

Partager et exporter

Statistiques de consultation

Cette page a été consultée 3394 fois.

Le PDF de cet article a été téléchargé 803 fois.