Enrique Manjavacas ; Lauren Fonteyn - Adapting vs. Pre-training Language Models for Historical Languages

jdmdh:9152 - Journal of Data Mining & Digital Humanities, 13 juin 2022, NLP4DH - https://doi.org/10.46298/jdmdh.9152
Adapting vs. Pre-training Language Models for Historical LanguagesArticle

Auteurs : Enrique Manjavacas 1,2,3; Lauren Fonteyn ORCID1,2,3

As large language models such as BERT are becoming increasingly popular in Digital Humanities (DH), the question has arisen as to how such models can be made suitable for application to specific textual domains, including that of 'historical text'. Large language models like BERT can be pretrained from scratch on a specific textual domain and achieve strong performance on a series of downstream tasks. However, this is a costly endeavour, both in terms of the computational resources as well as the substantial amounts of training data it requires. An appealing alternative, then, is to employ existing 'general purpose' models (pre-trained on present-day language) and subsequently adapt them to a specific domain by further pre-training. Focusing on the domain of historical text in English, this paper demonstrates that pre-training on domain-specific (i.e. historical) data from scratch yields a generally stronger background model than adapting a present-day language model. We show this on the basis of a variety of downstream tasks, ranging from established tasks such as Part-of-Speech tagging, Named Entity Recognition and Word Sense Disambiguation, to ad-hoc tasks like Sentence Periodization, which are specifically designed to test historically relevant processing.


Volume : NLP4DH
Rubrique : Humanités numériques en langues
Publié le : 13 juin 2022
Accepté le : 5 avril 2022
Soumis le : 1 mars 2022
Mots-clés : [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]

6 Documents citant cet article

Statistiques de consultation

Cette page a été consultée 3362 fois.
Le PDF de cet article a été téléchargé 3640 fois.