Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.


I INTRODUCTION
If many lemmatisers and POS taggers have been trained, and sometimes conceived, for French (e.g. Tellier et al. [2012], Urieli [2013]. . . ), they usually focus on contemporary French and tools for Ancien Régime French remain scarce. One important exception is the TreeTagger [Schmid, 1995] model developed by Diwersy et al. [2017] for the Presto project [Vigier andBlumenthal, 2013-2017]. They have prepared training data with c. 60,000 tokens annotated manually, using an adapted version of MULTEX [Ide and Veronis, 1994] and GRACE [Adda et al., 1998] for the parts of speech (POS), and the Lefff [Sagot, 2010], Morphalou [Romary et al., 2004] and LGeRM [Souvay and Pierrel, 2009] for the lemmas. Unfortunately, training data are not publicly available (yet?) 1 , they are made of non-normalised texts from the 16th to the 18th c. which prevents any comparison task, and no detailed evaluation of the model is offered by the designers of the model. Antoine Le Métel d'Ouville , Jean de Rotrou (1609-1650, Paul Scarron (1610-1660), Pierre Corneille (1606Corneille ( -1684, Molière (1622Molière ( -1673 and Thomas Corneille (1625-1709. All the plays have been written between the 1630's and the 1670's, that is to say within c. 40 years.

III BUILDING AN ANNOTATED CORPUS
The annotation scheme has been conceived to cope with diachronically spread data, especially earlier periods such as middle and Renaissance French.

Choice of authority lists
Since annotation principles are rather complex, we have decided to publish guidelines separately [Gabay et al., 2020a], and we will summarise here our most important choices.
Regarding POS tags, on top of the aforementioned MULTEX, many possibilities exist: EAGLES [Leech and Wilson, 1996], UD-POS [Petrov et al., 2011] and CATTEX . While EAGLES and UD-POS have been developed as international standards, CATTEX has been designed specifically for French medieval texts.
We have decided to choose CATTEX, because we are interested in a long diachronic perspective, and we want to maintain interoperability with several existing corpora that are already using it, such as the Base de français médiéval ("Medieval French Database", cf. Guillot-Barbance et al. [2017]) or the Geste corpus The annotation manual of CATTEX09 (Guillot et al. [2013b]) offers a detailed list of tagging rules that we strictly observed. Three options are offered to the annotator: morphological tagging, morpho-syntactical tagging 5 , or both. If adding both labels is ideal (to study processes such as adjectivisation, substantivisation. . . ) it remains far to costly in time, and we have therefore opted for a simple morpho-syntactical tagging, that appeared at the time as an interesting middle way.
Regarding lemmatisation, we have already mentioned LGeRM, the Lefff and Morphalou. The main interest of the last one, that we have chosen, is that not only the v3.1 includes the Lefff, but it is also used by the Frantext base -the data of which is partially available online to be used as additional material for our model -and the Trésor de la Langue Française informatisé. The LGeRM lexicon is irrelevant in our case, since it is an artificially archaised version of Morphalou to match 17th and 18th c. forms. Concerning proper names, we built a specific reference list, thanks to the characters and places index provided by Fièvre [2007], that we expanded when necessary.
Some of our choices diverge from those made by the authors of Morphalou. We were, for instance, more systematic in choosing the masculine singular form as a lemma for nouns (baronne is lemmatised as baron) but not only (la (det. def.) as le, sa or ses (poss.) as son). Concerning personal pronouns, the singular masculine subject (when relevant) case has been used as lemma: direct regimen forms (le, la, les, as in, il les donne) as well as indirect regimen forms (lui, elle(s)) have been lemmatised to the subject masculine singular (il) -one single pronoun can indeed be subject (il), reflexive (se), direct object (le or en), indirect object (lui or y) or disjunctive (lui). Still in line with our diachronic approach, we kept the difference between the old partitive des (contracted form of de les) and the new non-definite plural article des, and encoded the contracted form au(x) as a+le.
It has to be noted that, since lemmas are not numbered in Morphalou, it has not been possible to introduce a number-based disambiguation for homographs (e.g., son1 (poss. "his") vs. son2 (noun "sound"). . . ). Such a situation is however only partially problematic, since it remains possible to distinguish forms thanks to the POS, or the morphology.

Texts preprocessing and sampling
In order to limit model biases, each play of the corpus was sampled to create training and testing data. The text of Fièvre [2007] editions have been tokenised using TXM [Heiden, 2010] XML import. During the import an XSL filter was used to retain only the character's speeches, with exclusion of all other material (stage directions, act and scene numbering. . . ). Out of these data, a three-tier sample was constituted with the 2,000 first tokens of our 41 plays for training data (hereafter train set), the 100 median tokens for validation data (hereafter dev set), and the last 100 for testing data (hereafter test set). Case was not normalised, in order to keep information relevant to the identification of proper names.
The complete XML and annotation workflow is presented in fig. 1 6 .

Annotation and correction process
The annotation has been done in three phases. A first Pie  lemmatisation model has been trained only on the Frantext Open Access data Université de Lorraine, 1998-2018], and has been used to annotate a first sample of c. 40,000 tokens, in combination with an available Pie model for POS tags trained on Old French 7 . After being corrected, the same corpus has been used to train new models and annotate c. 40,000 other tokens that were, once again, corrected to create the final training corpus.
Lemma and POS-tags have all been corrected manually. This work was facilitated by the use of Pyrrha , a web-based correction interface able to do batch correcting as well as to handle authority lists, allowing efficient collaborative work ( fig. 2). Pyrrha also keeps tracks of all changes made on the corpus ( fig. 3), and makes it possible to import, correct, share, and download back corpora and authority lists.

Expanding annotation through available resources
If POS-tags have all been systematically corrected, through both linear reading and batch corrections, it is not the case of the morphology, which has only been mostly batch-corrected, because of time concerns. Thus, we can guarantee that every POS tag has been proofread at least once (and usually multiple times), which is not the case for the morphology.
Indeed, to save time, morphological information was not added manually, but was instead projected using the lexicon of inflected forms Morphalou [ATILF-CNRS and Université de 6 Since the initial publication of this article, the effects of Unicode NFKD normalisation have also been tested, with an unclear effect on training accuracy Gabay et al. [2020b]. 7 A recent version of the model for Old French can be found as part of the web application Deucalion ; they are also directly usable through Pyrrha's interface . Finally, the most up-to-date version of both the Old French and Classical French models is provided, along with functionalities to ease tagging of new documents, as part of the simple pie-extended Python Package [Clérice, 2020]. The models can be procured using their linguistic code (fro for Old French, fr for Classical French) by running the commands: pie-extended download fr and texts can be annotated by pie-extended tag fr MyFile.txt. They also are available online, on the École des chartes' Deucalion instance, at https://dh.chartes.psl. eu/deucalion/.  8 . Then, a simple algorithm looked for matching forms inside Morphalou: when the form was unambiguous, the morphological information was directly retrieved, otherwise the handcorrected POS was used to assess the correct morphological information to retrieve. If none was found, an unknown morph tag was added.    , 32 texts of which have been used to increase the training data (see appendix B). Marginal interventions have been made to correct some systematic errors, but also to ensure its consistency with our annotation choices. For instance, for pronouns (labelled as CLO, CLS and PRO in Frantext data), the lemmas je, me, m', M', moi, Moi, were mapped to je; likewise, ils, elle, elles, le, la, les, lui, leur, eux, Ils, Lui, Elle, Elles, to il, etc. On the other hand, some forms of celui and cela were originally lemmatised to il, and we changed the lemmatisation to celui and cela. Similarly, forms of chacun (or aucun) were lemmatised to un, and we changed it to chacun (or aucun).
A few minor corrections of obvious errors were also made (e.g., saurer to savoir), especially regarding homograph forms of some lemmas (e.g., between verbal forms of défaire, "undo',' and the noun défaite, "defeat", or between ver, "worm" and vers, "verse"). An additional adjustment has also been made regarding the ligature oe (coeur), which has been preferred over its decomposed form (coeur).

IV TRAINING SETUP
As previously mentioned, many tools are already available. TreeTagger [Schmid, 1995] remains one of the most widely used, even though it is outperformed by other solutions. For the French language, TALISMAN [Urieli, 2013]  . We have decided to use the two latter.

Lemmatisation
Concerning Pie as a lemmatiser, we tested three different configurations (table 3): 1. base (sent-lm), default configuration, based on the configuration that achieved best accuracy described in , using sentence context, RNN character embeddings, as well as forward and backward language models 9 ; 2. wembs the same as the previous one, but with the adjunction of word embeddings, initialised using pretrained embeddings; 3. bert same as the previous one, but using CamemBERT embeddings [Martin et al., 2019], reduced from 768 to 150 dims; 4. cnn+wembs a configuration using CNN character embeddings, with word embeddings, based on skipgram on a larger unlemmatised corpus. This configuration is the one used for Cafiero and Camps [2019], with limited additional tuning.  For each configuration, due to the stochastic nature of the process, five models were trained, using early stopping with threshold 0.001 and patience 6, and the best one was retained.
For the third configuration, word-embeddings were pretrained using Word2Vec Python implementation, on a large corpus of 343 theatre texts from Fièvre [2007] and those of the Frantext Open Access that we presented supra, for a total of c. 7M tokens. To the stylometric analyses. . .

POS tagging
For POS-tagging, we trained both Marmot and Pie on the training data we produced, without further augmentation. The configurations were the following: 1. Marmot: base configuration provided with Marmot, using the dev set during training, and the test set for final evaluation; 2. Pie: same configurations that for lemmatisation (base (sent-lm),wembs and bert, see above and table 3) but with a CRF output layer to predict part-of-speech tags; 3. +aux: for each of the POS-tagging configuration, we tried to see if we could obtain a gain in accuracy by using auxiliary tasks (the tasks used can be seen in table 2 in bold, with the addition of Case, mainly used for personal pronouns). In a multi-task setting, we trained linear classifiers for each morphological feature, but sharing weights with the main task (POS prediction).

Calibration and in-domain tests
Results of the Pie lemmatisation training are presented in 57.14 85.71 57.14 14 Table 4: Pie lemmatisation accuracies on the test set for the best model for each configuration. "Unknown tokens" are tokens never seen during training, while "ambiguous tokens" are forms that can correspond to different lemmas. "Unknown targets" are lemmas never seen in training, but that the neural network can still sometimes accurately predict, thanks to its character level modelling.

Out-of-domain tests
To evaluate the ability of the best models to generalise for data from other periods, we conceived two out-of-domain corpora. Since we want to evaluate generality with regard to diachronic, diaphasic and diagenic (i.e. gender based) variation, we have selected samples from 16th to the 20th century texts, either from theatre plays or from a variety of genres outside theatre, literary or practical (administrative texts, correspondence, etc.), from male and female authors, in order to have: • 20 samples of roughly 100 tokens for each century, 10 of theatre plays, 10 of a variety of other genres; • roughly as much tokens written by men or women for each century; • a comparable distribution of tokens by genres for each century. In addition, for the samples concerning 17th century theatre, we excluded verse comedies in general, and the authors from which were drawn our training corpus. A complete list is given in appendix C.
We evaluate the best models (wembs and wembs+aux) for lemma and for POS on the out-ofdomain data (   The lemmatisation model proves to be relatively robust: globally, the loss of accuracy is of roughly 1 percentage point, while it is closer to 3 percentage points for the POS model. This difference can be explained by the difference between the training corpora: the use of significant additional data to improve the efficiency of the lemmatisation model seems to reflect in its greater capacity to generalise. The same reason could also explain why the lemmatisation models transpose better to non-dramatic texts than the POS model.
In both cases, though, the better accuracies are reached for the 17th and 18th centuries -and, surprisingly, more specifically for 18th century data in most cases. It progressively decreases going backward or forward in time.

Improvement over previous corpus
In order to evaluate the importance of the addition of the newly annotated data (Theatre), we setup a second training experiment where we reuse the best configuration found in 4.1 (wembs) over 5 iterations using the same training, dev and test set, except for the files from the Theatre sample in train and dev. Using the best model, we are able to measure the effect of the introduction of Theatre data on top of the Frantext data. On lemmatization, the results are clear: the overall accuracy with the same configuration goes from 97.30% to 99.09%, an increase of 1.79 points. We cannot compare unknown and ambiguous tokens as well as unknown tokens as the support from these categories moved: -88 unknown tokens, +34 ambiguous tokens, -23 unknown targets. The same impact can be seen on out-of-domain data (Table 8) with a more visible impact on the Theatre corpus of the 18th century (+0.96 points). Strangely enough, neither the 17th century or the Theatre genre benefits the most from the inclusion of 17th century Theatre data, as the Not Theatre corpus and the 18th century are the most improved ones.

Most frequent confusions
The most frequent confusions of the best models on the out-of-domain data are presented in  Table 9: Sample from the confusion matrix for the best Pie models on the out-of-domain data.
Regarding lemmatisation, some errors related to homographs such as the token le (regimen pronoun (il) or determiner (le); or the token des (plural of the determiner un or partitive de_le). Some other errors are due to abbreviations not present in the training data (M. for monsieur). More interestingly, for the research of potential improvements, many errors are related to rare characters classes in the training data, such as capital letters or ligatures (oe).
For POS, the most frequent confusions are in nominal rather than verbal tags. In particular, confusions between common nouns (NOMcom), proper nouns (NOMpro), adjectives (ADJqua) and nominal forms of the verbs (participle, VERppe, and infinitive, VERinf). Some errors are due to the morpho-syntactic nature of our annotation, which, for instance, labels substantive adjective as common nouns (le beau is NOMcom).

VI DISCUSSION
The scores of up to 99% for in-domain lemmatisation and 97% for POS tagging put our best models above the expected state of the art, or in it's upper range (see I). Yet, these results were deserving of further investigation, through out-of-domain tests and more qualitative inspection.
Out-of-domain tests show that, even though they were trained for 17th century classical theatre, the model reach their best accuracies for 18th century texts. Such a fact seems counter-intuitive, but a plausible explanation might be as follow: the median date of the texts of the Frantext corpus is 1872, and mean date is 1859. By adding roughly 80k tokens of 17th century texts to the 2,3M of Frantext (with mean text date 1859), we may have somehow slightly pulled the corpus closer to the 18th century.
In any case, through the careful construction of a labelled corpus and the use of a neural lemmatiser and tagger, we were able to develop models very suitable to the annotation necessary for the stylometric analyses showed in Cafiero and Camps [2019]. But we think that the model are useful for the annotation of 17th century theatre, and even beyond that, of Early Modern French texts in normalised spelling, with encouraging results regarding generalisation beyond the original scope of the experiments.

VII FURTHER RESEARCH
Looking at our results, the main lead for improvements should be a more efficient way to deal with rare character classes, such as capital letters, diacritics or ligatures. Several methods could be used to reduce the number of classes (e.g., Unicode decomposed normalisation) or, alternatively, the training set could be sufficiently extended to provide enough cases.
More generally, three possible directions could be followed in the coming years. The first is to expand the training corpus to other dramatic genres (tragedy, tragi-comedy. . . ), and other genres in general (poetry, novels, short stories. . . ). The second would be to replace normalised tokens by non-normalised ones, and therefore offer a new model that takes full advantage of the ability of tools like Pie to deal with spelling variation in historical languages, and, doing so, strengthen the ability of model to deal with older varieties of French. The third is to expand dramatically the training with data taken from 18th or 16th c. texts. Results going in these directions and continuing the experiments described in this paper will be presented at DTUC'20 [Gabay et al., 2020b].

AUTHOR CONTRIBUTIONS
P.F. encoded the corpus and all its metadata. F.C., J.-B. C. and S.G. designed the research project. The preprocessing of the texts, the initial setup of the Pyrrha corpus and of the authority lists were performed by J.-B. C. Correction of the training data and expansion of the authority lists was shared equally between F.C., J.-B. C. and S.G. Post-processing of the trained corpus, injection of morphological data, and correction of the Frantext data was done by J.-B. C., as well as the training, testing of the models and their further use to annotate unseen data. T.C. programmed modifications to Pie code to include Bert and participated in the training and benchmarking of models, as well as additional debugging of the annotation tools. All authors contributed to the writing of this paper.
The authors have no competing interests to declare.

MATERIALS AND DATA AVAILABILITY
The most up-to-date version of the models can be easily obtained and used thanks to the pie-extended Python package, available on Pypi (https://pypi.org/project/ pie-extended/), with the command pie-extended download fr, and can be queried at https://tal.chartes.psl.eu/deucalion/.
The initial version created for Cafiero and Camps [2019] is available from the Science Advances website since the publication of the paper, on 27th Nov. 2019 at this address: https://advances.sciencemag.org/highwire/filestream/221312/field_highwire_ad-junct_files/0/aax5489_Data_file_S1.zip. The initial version of the models is available on Zenodo (10.5281/zenodo.3353421).
The version of the best models described in this paper, as well as training, validation and test data can be found on Zenodo as well, as version 3 of the same repository (doi: 10.5281/zenodo.3828644).
Licensing is specified separately for each repository.

A PLAYS SAMPLED TO CREATE THE INITIAL TRAINING CORPUS
The following plays selected from Fièvre [2007] were sampled.