C. Annemieke Romein ; Tobias Hodel ; Femke Gordijn ; Joris J. van Zundert ; Alix Chagué et al. - Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done

jdmdh:10403 - Journal of Data Mining & Digital Humanities, 18 mars 2024, Documents historiques et reconnaissance automatique de texte - https://doi.org/10.46298/jdmdh.10403
Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It DoneArticle

Auteurs : Romein, C. Annemieke ORCID1; Hodel, Tobias ORCID2; Gordijn, Femke ORCID3; Zundert, Joris J. van ORCID3; Chagué, Alix ORCID4; Lange, Milan van ORCID5; Jensen, Helle Strandgaard ORCID6; Stauder, Andy ORCID7; Purcell, Jake ORCID8; Terras, Melissa M. ORCID9; Heuvel, Pauline van den ORCID10; Keijzer, Carlijn ORCID5; Rabus, Achim ORCID11; Sitaram, Chantal ORCID12; Bhatia, Aakriti ORCID12; Depuydt, Katrien ORCID13; Afolabi-Adeolu, Mary Aderonke ORCID14; Anikina, Anastasiia ORCID15; Bastianello, Elisa ORCID16; Benzinger, Lukas Vincent ORCID17; Bosse, Arno ORCID18; Brown, David ORCID19; Charlton, Ash ORCID20; Dannevig, André Nilsson ORCID21; Gelder, Klaas van ORCID22; Go, Sabine C.P.J. ORCID17; Goh, Marcus J.C. ORCID17; Gstrein, Silvia ORCID23; Hasan, Sewa ORCID17; Heide, Stefan von der ORCID24; Hindermann, Maximilian ORCID25; Huff, Dorothee ORCID26; Huysman, Ineke ORCID3; Idris, Ali ORCID17; Keijzer, Liesbeth ORCID27; Kemper, Simon ORCID27; Koenders, Sanne ORCID17; Kuijpers, Erika ORCID17; Rønsig Larsen, Lisette ORCID28; Lepa, Sven ORCID29; Link, Tommy O. ORCID17; Nispen, Annelies van ORCID5; Nockels, Joe ORCID20; Noort, Laura M. van ORCID17; Oosterhuis, Joost Johannes ORCID30; Popken, Vivien ORCID31; Estrella Puertollano, María ORCID17; Puusaag, Joosep J. ORCID17; Sheta, Ahmed ORCID32; Stoop, Lex ORCID33; Strutzenbladh, Ebba ORCID34; Sijs, Nicoline van der ORCID13; Spek, Jan Paul van der ORCID33; Trouw, Barry Benaissa ORCID33; Van Synghel, Geertrui ORCID3; Vučković, Vladimir ORCID17; Wilbrink, Heleen ORCID35; Weiss, Sonia ORCID7; Wrisley, David Joseph ORCID36; Zweistra, Riet ORCID33

  • 1 Huygens Institute for the History and Culture of the Netherlands; Vrije Universiteit Amsterdam
  • 2 University of Bern
  • 3 Huygens Institute for the History and Culture of the Netherlands
  • 4 ALMAnaCH, Inria, Paris; Université de Montréal
  • 5 NIOD Institute for War, Holocaust, and Genocide Studies
  • 6 Aarhus Universitet/ Aarhus University
  • 7 READ-COOP SCE
  • 8 American Historical Association
  • 9 University of Edinburgh
  • 10 Amsterdam City Archives
  • 11 Albert-Ludwigs-Universität: Freiburg im Breisgau
  • 12 KNAW Humanities Cluster Amsterdam
  • 13 Instituut voor de Nederlandse Taal
  • 14 Bonn Center for Dependency and Slavery Studies at the University of Bonn
  • 15 Universiteit van Amsterdam
  • 16 Bibliotheca Hertziana – Max Planck Institute for Art History
  • 17 Vrije Universiteit Amsterdam
  • 18 KNAW Humanities Cluster, Amsterdam
  • 19 Trinity College Dublin
  • 20 University of Edinburgh; National Library of Scotland
  • 21 National Archives of Norway
  • 22 Vrije Universiteit Brussel; State Archives Brussels
  • 23 University of Innsbruck; State Library of Tyrol
  • 24 CCS Content Conversion Specialists GmbH
  • 25 University of Basel
  • 26 University Library of Tübingen
  • 27 Dutch National Archives
  • 28 Danish National Archives
  • 29 Rahvusarhiiv Estonia
  • 30 University of Amsterdam
  • 31 Research Centre for Hanse and Baltic History (FGHO)
  • 32 Friedrich Alexander Universität Erlangen-Nürnberg
  • 33 independent citizen scientist
  • 34 University of Aberdeen
  • 35 Utrechts Archief
  • 36 NYU Abu Dhabi

This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance.


Volume : Documents historiques et reconnaissance automatique de texte
Publié le : 18 mars 2024
Accepté le : 8 décembre 2023
Soumis le : 30 novembre 2022
Mots-clés : Automatic Text Recognition,Handwritten Text Recognition,Data Publication,Open Data,Data Provenance,Data Curation,Ground Truth,Sharing

Fichiers

Nom Taille
Exploring_Data_Provenance_11-3.pdf
md5 : 9cabce0ad814a7add62b6616d4e559b2
3.83 MB

Publications

Référence
Romein, C. A., Hodel, T., Gordijn, F., Zundert, J. J. van, Chagué, A., Lange, M. van, Jensen, H. S., Stauder, A., Purcell, J., Terras, M. M., Heuvel, P. van den, Keijzer, C., Rabus, A., Sitaram, C., Bhatia, A., Depuydt, K., Afolabi-Adeolu, M. A., Anikina, A., Bastianello, E., … Zweistra, R. (2024). Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done. Zenodo. 10.5281/ZENODO.10804745

Statistiques de consultation

Cette page a été consultée 2024 fois.
Le PDF de cet article a été téléchargé 660 fois.