Simon Gabay ; Ariane Pinche ; Kelly Christensen ; Jean-Baptiste Camps - SegmOnto: A Controlled Vocabulary to Describe and Process Digital Facsimiles

jdmdh:12689 - Journal of Data Mining & Digital Humanities, December 17, 2024 - https://doi.org/10.46298/jdmdh.12689
SegmOnto: A Controlled Vocabulary to Describe and Process Digital FacsimilesArticle

Authors: Simon Gabay ORCID1,2; Ariane Pinche ORCID3,4; Kelly Christensen ORCID5,6,7; Jean-Baptiste Camps ORCID8,6,7

Our initiative aims at designing a controlled vocabulary for the description of the layout of textual sources: SegmOnto. Following a more physical approach rather than a strictly semantic one, it is designed as a pragmatic and generic typology, coping with most of the Western historical documents rather than answering specific needs. The harmonisation of the layout description has a double objective: on the one hand it facilitates the mutualisation of annotated data and therefore the training of better models for page segmentation (a crucial preliminary step for text recognition), on the other hand it allows the development of a shared post-processing workflow and pipeline for the transformation of ALTO or PAGE files into DH standard formats, which preserves as much as possible the link between the extracted information and the digital facsimile. To demonstrate the capacity of SegmOnto to answer both these objectives, we aggregate data from multiple projects to train a layout analysis model, and we propose a prototype of a generic pipeline for converting ALTO-XMLs into XML-TEI.


Published on: December 17, 2024
Accepted on: December 17, 2024
Submitted on: December 14, 2023
Keywords: Controlled vocabulary,Layout analysis,Text encoding initiative,Document modelling,Vocabulaire contrôlé,Analyse de mises en page,Text encoding initiative,Modélisation de document,ACM: I.: Computing Methodologies/I.4: IMAGE PROCESSING AND COMPUTER VISION/I.4.0: General/I.4.0.1: Image processing software,ACM: I.: Computing Methodologies/I.5: PATTERN RECOGNITION/I.5.4: Applications/I.5.4.0: Computer vision,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing,[INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV],[SHS.HIST]Humanities and Social Sciences/History,[SHS.LITT]Humanities and Social Sciences/Literature

Consultation statistics

This page has been seen 126 times.
This article's PDF has been downloaded 63 times.