Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Tokenization of modern and old Western European languages seems to be fairly simple, as it relies on the presence of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, ancient epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or inword-sequence is shown to be effective at tokenizing such inputs. Additionally, the software created for this article (Boudams) is released with a simple interface for tokenizing a corpus or generating a training set 1.


I INTRODUCTION
Tokenization of spaceless strings is a task that is specifically difficult for computers as compared to "whathumanscando". Scripta continua is a western Latin writing phenomenon in which words are not separated by spaces. It disappeared around the 8th century (see Zanna [1998]), but, nevertheless, spacing remained erratic 2 in later centuries writing as Stutzmann [2016] explains (cf. Figure 1). The fluctuation of space width, or simply their presence becomes an issue for OCR. Indeed, in the context of text mining of HTR or OCR output, lemmatization and tokenization of medieval western languages is quite often a pre-processing step for further research to sustain analyses such as authorship attribution, corpus linguistics or simply to allow full-text search 3 .
It must be stressed in this study that the difficulty inherent to segmentation is different for scripta continua than the one for languages such as Chinese, for which an already impressive amount of work has been done. Indeed, the dimensionality alone of the Chinese character set is different from Latin alphabets 4 , and the important presence of compound words is definitely an issue for segmentation 5 . Chinese word segmentation has lately been driven by deep learning methods: Chen et al. [2015] defines a process based on LSTM model, while Yu et al. [2019] uses bi-directional GRU and CRF. 6 Indeed, while the issue with Chinese seems to lie in the decomposition of relatively fixed characters, Old French or Medieval Latin present heavy variation of spelling. In Camps et al. [2017], Camps notes, in the same corpus, the existence of not less than 29 spellings of the word "cheval" (horse in Old and Modern French) whose occurrence counts range from 3907 to 1 7 . This makes a dictionary-based approach rather difficult as it would rely on a high number of different spellings, making the computation highly complex.

Encoding of input and decoding
The model used in this study is based on traditional text input encoding where each character is represented as an index. Output of the model is a mask that needs to be applied to the input: in the mask, characters are classified either as word boundary or word content (cf . Table 1.

Sample Input String
Ladamehaitees'enparti Mask String xSxxxSxxxxxSxxxSxxxxS Output String La dame haitee s'en parti For evaluation purposes, and to reduce the number of input classes, two options for data transcoding were used: a lower-case normalization and a "reduction to the ASCII character set" feature (fr. 2). On this latter point, several issues were encountered with the transliteration of medieval paelographic characters that were part of the original datasets, as they are poorly interpreted by the unidecode python package. Indeed, unidecode simply removed characters it did not understand. A derivative package named mufidecode was built for this reason (Clérice [2019b]): it takes precedent over unidecode equivalency tables when the character is a known entity in the Medieval Unicode Font Initiative (MUFI, Initiative [2015]).  Examples were generated automatically. They are between 2 and 8 words in length. In order to recreate the condition of OCR noise, full stop characters were added randomly (20% chance) between words. In order to augment the dataset, words were randomly (10% chance) copied from one sample to the next 9 . If a minimum size of 7 characters was not met in the input sample, another word would be added to the chain, independently of the maximum number of words. The examples, however, could not contain more than 100 characters. The word lengths in the results corpora were expected to vary as shown by Figure 3. The corpora contained 193 different characters when not normalized, in which certain MUFI characters appeared a few hundred times (cf.

Example of Outputs
The following inputs have been tagged with the CNN P model. Batches are constructed around the regular expression \W with package regex. This explains why inputs such as ".i." are automatically tagged as " . i . " by the tool. The input was stripped of its spaces before tagging, only the ground truth is shown for readability.

Evaluation on Latin data
For the following evaluations, the same process was deployed: CNN without Position was evaluated against the baseline on both a test set composed of excerpts from the texts of the training set, and an out-of-domain corpus composed of unseen texts. Evaluation has been done on three different categories of Latin texts (edited, classical Latin (1); medieval Latin of charters (2); epigraphic Latin (3)) as they show different levels of difficulty: they always present rich morphology, but medieval Latin displays spelling variations while epigraphic Latin displays both spelling variation and a high number of abbreviations.  The Latin data is much noisier than the Old French, as it was less curated than the digital editions provided for Old French. They are part of the Perseus corpus Crane et al. [2019]. The training, evaluation and test corpora contain prose works from Cicero and Suetonius. The out-of-domain corpus comes from the Epigrammata from Martial, from book 1 to book 2, which should be fairly different from the test corpus in word order, vocabulary, etc. Both corpora were generated without noise and word keeping, with a maximum sample size of 150 characters.   • Input : DnFlClIuliani • Output : D n Fl Cl Iuliani

Discussion
As opposed to being a graphical challenge, word segmentation in OCR from manuscripts can actually be treated as an NLP task. Word segmentation for some texts can even be difficult for humanists, as shown by the manuscript sample, and as such, it seems that the post-processing of OCR or HTR through tools like this one can enhance data-mining of raw datasets.
The negligible effects of the different normalization methods (lower-casing; ASCII reduction; both) were surprising. The presence of certain MUFI characters might provide enough information about segmentation and be of sufficient quantity for them not to impact the network weights.
While the baseline performed unexpectedly well on the test corpora, the CNN model definitely performed better on the out-of-domain corpora. In this context, the proposed model deals better with unknown corpora classical n-gram approaches. In light of the high accuracy of the CNN model on the different corpora, the model should perform equally well no matter to which Medieval Western European language it is applied.

Conclusion
Achieving 0.99 accuracy on word segmentation with a corpus as large as 25,000 test samples seems to be the first step for a more thorough data mining of OCRed manuscripts. Given the results, studying the importance of normalization and lowering should be the next step, as it will probably show greater influence in smaller corpora.

Acknowledgements
Boudams has been made possible by two open-source repositories from which I learned and copied bits of the implementation of certain modules and without which none of this paper would have been possible: Manjavacas et al. [2019] and Trevett [2019]. This tool was originally intended for post-processing OCR for the presentation Camps et al. [2019a] at DH2019 in Utrecht. The software was developed using PyTorch (Paszke et al. [2019]), Numpy (Oliphant [2006-]) and SK-Learn (Pedregosa et al. [2011]).