Jean-Baptiste Camps ; Simon Gabay ; Paul Fièvre ; Thibault Clérice ; Florian Cafiero - Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

jdmdh:6485 - Journal of Data Mining & Digital Humanities, 14 février 2021, 2021 - https://doi.org/10.46298/jdmdh.6485
Corpus and Models for Lemmatisation and POS-tagging of Classical French TheatreArticle

Auteurs : Jean-Baptiste Camps ORCID; Simon Gabay ORCID; Paul Fièvre ; Thibault Clérice ORCID; Florian Cafiero ORCID

    This paper describes the process of building an annotated corpus and training models for classical French literature, with a focus on theatre, and particularly comedies in verse. It was originally developed as a preliminary step to the stylometric analyses presented in Cafiero and Camps [2019]. The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state-of-the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e.up to 20th c.novels.


    Volume : 2021
    Rubrique : Humanités numériques en langues
    Publié le : 14 février 2021
    Accepté le : 14 février 2021
    Soumis le : 18 mai 2020
    Mots-clés : Computer Science - Computation and Language

    Statistiques de consultation

    Cette page a été consultée 2450 fois.
    Le PDF de cet article a été téléchargé 696 fois.