Adapting vs. Pre-training Language Models for Historical Languages

Enrique Manjavacas; Lauren Fonteyn

doi:10.46298/jdmdh.9152

Enrique Manjavacas ; Lauren Fonteyn - Adapting vs. Pre-training Language Models for Historical Languages

jdmdh:9152 - Journal of Data Mining & Digital Humanities, 13 juin 2022, NLP4DH - https://doi.org/10.46298/jdmdh.9152

Adapting vs. Pre-training Language Models for Historical LanguagesArticle

Auteurs : Enrique Manjavacas ^1,^2,³; Lauren Fonteyn ^1,^2,³

1 Leiden University
2 Universiteit Leiden
3 Universiteit Leiden = Leiden University

As large language models such as BERT are becoming increasingly popular in Digital Humanities (DH), the question has arisen as to how such models can be made suitable for application to specific textual domains, including that of 'historical text'. Large language models like BERT can be pretrained from scratch on a specific textual domain and achieve strong performance on a series of downstream tasks. However, this is a costly endeavour, both in terms of the computational resources as well as the substantial amounts of training data it requires. An appealing alternative, then, is to employ existing 'general purpose' models (pre-trained on present-day language) and subsequently adapt them to a specific domain by further pre-training. Focusing on the domain of historical text in English, this paper demonstrates that pre-training on domain-specific (i.e. historical) data from scratch yields a generally stronger background model than adapting a present-day language model. We show this on the basis of a variety of downstream tasks, ranging from established tasks such as Part-of-Speech tagging, Named Entity Recognition and Word Sense Disambiguation, to ad-hoc tasks like Sentence Periodization, which are specifically designed to test historically relevant processing.

https://doi.org/10.46298/jdmdh.9152

Source : HAL:hal-03592137v3

Volume : NLP4DH

Rubrique : Humanités numériques en langues

Publié le : 13 juin 2022

Accepté le : 5 avril 2022

Soumis le : 1 mars 2022

Mots-clés : [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]

Références bibliographiques

14 Documents citant cet article

Dong Nguyen, 2025, Collaborative Growth: When Large Language Models Meet Sociolinguistics, Language and Linguistics Compass, 19, 2, 10.1111/lnc3.70010, https://doi.org/10.1111/lnc3.70010.

Dominik Schlechtweg;Frank D. Zamora-Reina;Felipe Bravo-Marquez;Nikolay Arefyev, 2024, Sense through time: diachronic word sense annotations for word sense induction and Lexical Semantic Change Detection, Language Resources and Evaluation, 10.1007/s10579-024-09771-7, https://doi.org/10.1007/s10579-024-09771-7.

Emanuela Boros;Maud Ehrmann, Lecture notes in computer science, Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents, pp. 54-66, 2024, 10.1007/978-981-96-0865-2_5.

Florian Atzenhofer-Baumgartner;Tamás Kovács, Lecture notes in computer science, Is Text Normalization Relevant for Classifying Medieval Charters?, pp. 125-132, 2024, 10.1007/978-3-031-72440-4_12.

Ida Marie S Lassen;Ross Deans Kristensen-McLachlan;Mina Almasi;Kenneth Enevoldsen;Kristoffer L Nielbo, 2024, Epistemic consequences of unfair tools, Digital Scholarship in the Humanities, 39, 1, pp. 198-214, 10.1093/llc/fqad091, https://doi.org/10.1093/llc/fqad091.

Lauren Fonteyn;Enrique Manjavacas;Nina Haket;Aletta G. Dorst;Eva Kruijt, 2024, Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings, Linguistics Vanguard, 10.1515/lingvan-2022-0142, https://doi.org/10.1515/lingvan-2022-0142.

Tomás Freitas Osório;Henrique Lopes Cardoso, 2024, Historical Portuguese corpora: a survey, Language Resources and Evaluation, 10.1007/s10579-024-09757-5, https://doi.org/10.1007/s10579-024-09757-5.

Angelica Lo Duca;Andrea Marchetti;Manuela Moretti;Francesca Diana;Mafalda Toniazzi;et al., 2023, Genealogical Data Mining from Historical Archives: The Case of the Jewish Community in Pisa, Informatics, 10, 2, pp. 42, 10.3390/informatics10020042, https://doi.org/10.3390/informatics10020042.

Carlos-Emiliano González-Gallardo;Emanuela Boros;Edward Giamphy;Ahmed Hamdi;José G. Moreno;et al., Lecture notes in computer science, Injecting Temporal-Aware Knowledge in Historical Named Entity Recognition, pp. 377-393, 2023, 10.1007/978-3-031-28244-7_24.

Katherine Aske;Marina Giardinetti, 2023, (Mis)Matching Metadata: Improving Accessibility in Digital Visual Archives through the EyCon Project, Journal on Computing and Cultural Heritage, 16, 4, pp. 1-20, 10.1145/3594726, https://doi.org/10.1145/3594726.

Maud Ehrmann;Ahmed Hamdi;Elvys Linhares Pontes;Matteo Romanello;Antoine Doucet, 2023, Named Entity Recognition and Classification in Historical Documents: A Survey, ACM Computing Surveys, 56, 2, pp. 1-47, 10.1145/3604931, https://doi.org/10.1145/3604931.

M. Besher Massri;Inna Novalija;Dunja Mladenić;Janez Brank;Sara Graça da Silva;et al., 2022, Harvesting Context and Mining Emotions Related to Olfactory Cultural Heritage, Multimodal Technologies and Interaction, 6, 7, pp. 57, 10.3390/mti6070057, https://doi.org/10.3390/mti6070057.

Rosa Filgueira, Zenodo (CERN European Organization for Nuclear Research), frances: A Deep Learning NLP and Text Mining Web Tool to Unlock Historical Digital Collections: A Case Study on the Encyclopaedia Britannica, pp. 246-255, 2022, Salt Lake City, UT, USA, 10.1109/escience55777.2022.00038, https://doi.org/10.5281/zenodo.7153081.

Алексей Валерьевич Кузнецов, Высокие технологии и инновации в науке: сборник избранных статей Международной научной конференции (Санкт-Петербург, Май 2022), NEURAL LANGUAGE MODELS FOR HISTORICAL RESEARCH, pp. 21-24, 2022, 10.37539/vt197.2022.25.51.002.

Sources : OpenCitations, OpenAlex & Crossref

Partager et exporter

Statistiques de consultation

Cette page a été consultée 3951 fois.

Le PDF de cet article a été téléchargé 4556 fois.