Thibault Clérice - Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

jdmdh:5581 - Journal of Data Mining & Digital Humanities, April 7, 2020, 2020 - https://doi.org/10.46298/jdmdh.5581
Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and LatinArticle

Authors: Thibault Clérice ORCID1,2,3,4,5

Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.


Volume: 2020
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities
Published on: April 7, 2020
Accepted on: April 7, 2020
Submitted on: June 18, 2019
Keywords: convolutional network,scripta continua,tokenization,Old French,word segmentation,[SHS.LANGUE]Humanities and Social Sciences/Linguistics,[SHS.CLASS]Humanities and Social Sciences/Classical studies,[INFO]Computer Science [cs]

2 Documents citing this article

Consultation statistics

This page has been seen 2304 times.
This article's PDF has been downloaded 1018 times.