Exploring Historical Labor Markets: Computational Approaches to Job Title Extraction

Raven Adam; Klara Venglarova; Georg Vogeler

doi:10.46298/jdmdh.15038

Raven Adam ; Klara Venglarova ; Georg Vogeler - Exploring Historical Labor Markets: Computational Approaches to Job Title Extraction

jdmdh:15038 - Journal of Data Mining & Digital Humanities, 2 avril 2025, NLP4DH - https://doi.org/10.46298/jdmdh.15038

Exploring Historical Labor Markets: Computational Approaches to Job Title ExtractionArticle

Auteurs : Raven Adam ^1,²; Klara Venglarova ^1,²; Georg Vogeler ^1,²

Historical job advertisements provide invaluable insights into the evolution of labor markets and societaldynamics. However, extracting structured information, such as job titles, from these OCRed and unstructuredtexts presents significant challenges. This study evaluates four distinct computational approachesfor job title extraction: a dictionary-based method, a rule-based approach leveraging linguistic patterns,a Named Entity Recognition (NER) model fine-tuned on historical data, and a text generation modeldesigned to rewrite advertisements into structured lists.Our analysis spans multiple versions of the ANNO dataset, including raw OCR, automatically postcorrected,and human-corrected text, as well as an external dataset of German historical job advertisements.Results demonstrate that the NER approach consistently outperforms other methods, showcasingrobustness to OCR errors and variability in text quality. The text generation approach performs well onhigh-quality data but exhibits greater sensitivity to OCR-induced noise. While the rule-based method isless effective overall, it performs relatively well for ambiguous entities. The dictionary-based approach,though limited in precision, remains stable across datasets.This study highlights the impact of text quality on extraction performance and underscores the need foradaptable, generalizable methods. Future work should focus on integrating hybrid approaches, expandingannotated datasets, and improving OCR correction techniques to enhance the extraction of structuredinformation from historical texts. These advancements will enable deeper exploration of labor markettrends and contribute to the broader field of digital humanities.

https://doi.org/10.46298/jdmdh.15038

Source : HAL:hal-04869347v2

Volume : NLP4DH

Publié le : 2 avril 2025

Accepté le : 9 février 2025

Soumis le : 8 janvier 2025

Mots-clés : historical newspapers,job advertisements,job title extraction,NER,occupations dictionary,text-generation,OCR quality effect,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL],[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR],[SHS.INFO]Humanities and Social Sciences/Library and information sciences

Licence : Attribution 4.0 International (CC BY 4.0)

Références bibliographiques

Partager et exporter

Statistiques de consultation

Cette page a été consultée 451 fois.

Le PDF de cet article a été téléchargé 175 fois.