# Collecting, Preserving, and Disseminating Endangered Cultural Heritage for New Understandings through Multilingual Approaches

Special Issue of Journal of Data Mining and Digital Humanities (ISSN: 2416-5999)

## Special Issue Guest Editors

Amel Fraisse
Univ. Lille, EA 4073 - GERiiCO - Groupement d’Etudes et de Recherche Interdisciplinaire en Information et Communication, F-59000 Lille, France.
Website | E-Mail
Interests: library and information science, cultural and linguistic diversity, cultural heritage, multilingualism, digital humanities.

Ronald Jenn
Univ. Lille, EA 4074 - CECILLE - Centre d’Études en Civilisations Langues et Lettres Etrangères, F-59000 Lille, France.
Website | E-Mail
Interests: translation studies, translated texts, digital humanities,  Mark Twain.

Shelley Fisher Fishkin
Stanford University, English Departement.
Website | E-Mail
Interests:  transnational American studies, literature, translation studies, Mark Twain.

## A Foreword to the Special Issue

In an increasingly globalized world, preserving knowledge diversity and cultural heritage is more important than ever. Multilingualism and multiculturalism are central to that effort. The rapid growth of digitally and publicly available knowledge resources poses the challenge of knowing how to effectively preserve, analyze, understand, and disseminate that knowledge diversity. Indeed, over time, the gap between languages of dominant nations or cultures and other languages has been growing. This Special Issue features a selection of papers addressing the challenges of collecting, preserving, and disseminating endangered knowledge and cultural heritage by presenting some of the latest research based on new computational approaches. Some of these papers are an improved and reviewed version of presentations at the “Digital Humanities to Preserve Knowledge and Cultural Heritage” workshop (Stanford, April 15th 2019) and others answered the subsequent open Call For Papers.

The Special Issue presents original and high-quality papers, some technical and others survey in nature, addressing both theoretical and practical aspects, with emphasis on ethical and social implications. This Special Issue not only showcases promising research but also contributes to raising awareness about the import of sustaining linguistic and cultural diversity, an endeavor to which Digital Humanities can be instrumental.

Documentation Studies pioneer Paul Otlet envisioned a universal compilation of knowledge and the technology to make it globally available as he explained in numerous essays on the collection and organization of the world's knowledge (Otlet, 1934). Recent advances in information technologies have greatly improved access to knowledge, however, the language barrier is a key issue that knowledge and information systems still have to address ((Hudon,1997,1998) and (Agnes Hajdu Barat,2008)). Although information systems include knowledge encoded in vulnerable languages, their use and exploration needs to be further developed.

Today, we witness the advent of new collections of digital documents openly available in a broad range of fields and languages. National, international and academic institutions such as the Library of Congress[1] and Europeana [2] have developed ROARs (Registries of Open Access Repositories) and made millions of cultural items available in more than 470 languages thereby providing research materials that can be used as input for studies concerned with sustaining endangered languages and cultures. The process can work both ways because, as demonstrated in “Individual vs. Collaborative methods of Crowdsourced Transcription”, the way material is presented in ROARs can be improved. In this specific case, through annotations generated thanks to crowdsourcing. Moving from a closed, discontinuous, and out-of-context model to, open, continuous, and in-context knowledge creation and organization models is a concept that has shown its effectiveness by a wiki platform like wikipedia. The basic concept relies upon a collaborative approach that promotes the right of all people to use information system in their mother tongue as advocated by the UNESCO’s Information for All Programme (IFAP). The collaborative approach consists in giving up the idea of perfect and complete knowledge and publishing partial knowledge of variable quality, which will be improved incrementally when the information system is used--an ongoing and continuously improving process. This new orientation permits the incremental augmentation of both quality and quantity. The best-known example of this is the Wikipedia community, in which knowledge is added and improved continuously by contributors. In the same line of thought “A Collaborative Ecosystem for Digital Coptic Studies” demonstrates how an online collaborative platform in a specific domain can bring together scholars to produce searchable and annotated corpora. “ekdosis: Using LuaL A T E X for Producing TEI xml Compliant Critical Editions and Highlighting Parallel Writings” presents a major technical improvement for multilingual editors as it enables the parallel typsetting within the same document of different languages and alphabets in any direction. In “Spoken word corpus and dictionary definition for an African language” we are presented with a major breakthrough for preserving predominantly oral cultures and ushering them into the digital age. Its methodology transcribes hours of oral speech into XML-encoded textual corpus.

[1] https://www.loc.gov

# Individual vs. Collaborative methods of Crowdsourced Transcription

## by Samantha Blickhan1, Coleman Krawczyk 2, Daniel Hanson3, Amy Boyer1, Andrea Simenstad3, and Victoria Van Hyning4

### 4 Library of Congress, USA

Abstract : While online crowdsourced text transcription projects have proliferated in the last decade, there is a need within the broader field to understand differences in project outcomes as they relate to task design, as well as to experiment with different models of online crowdsourced transcription that have not yet been explored. The experiment discussed in this paper involves the evaluation of newly-built tools on the Zooniverse.org crowdsourcing platform, attempting to answer the research question: "Does the current Zooniverse methodology of multiple independent transcribers and aggregation of results render higher-quality outcomes than allowing volunteers to see previous transcriptions and/or markings by other users? How does each methodology impact the quality and depth of analysis and participation?" To answer these questions, the Zooniverse team ran an A/B experiment on the project Anti-Slavery Manuscripts at the Boston Public Library. This paper will share results of this study, and also describe the process of designing the experiment and the metrics used to evaluate each transcription method. These include the comparison of aggregate transcription results with ground truth data; evaluation of annotation methods; the time it took for volunteers to complete transcribing each dataset; and the level of engagement with other project elements such as posting on the message board or reading supporting documentation. Particular focus will be given to the (at times) competing goals of data quality, efficiency, volunteer engagement, and user retention, all of which are of high importance for projects that focus on data from galleries, libraries, archives and museums. Ultimately, this paper aims to provide a model for impactful, intentional design and study of online crowdsourcing transcription methods, as well as shed light on the associations between project design, methodology and outcomes.

# A Collaborative Ecosystem for Digital Coptic Studies

## by Caroline Schroeder1, Amir Zeldes 2

### 2 Georgetown University, United States of America

Abstract: Scholarship on underresourced languages bring with them a variety of challenges which make access to the full spectrum of source materials and their evaluation difficult. For Coptic in particular, large scale analyses and any kind of quantitative work become difficult due to the fragmentation of manuscripts, the highly fusional nature of an incorporational morphology, and the complications of dealing with influences from Hellenistic era Greek, among other concerns. Many of these challenges, however, can be addressed using Digital Humanities tools and standards. In this paper, we outline some of the latest developments in Coptic Scriptorium, a DH project dedicated to bringing Coptic resources online in uniform, machine readable, and openly available formats. Collaborative web-based tools create online 'virtual departments' in which scholars dispersed sparsely across the globe can collaborate, and natural language processing tools counterbalance the scarcity of trained editors by enabling machine processing of Coptic text to produce searchable, annotated corpora.

# ekdosis: Using LuaL A T E X for Producing TEI xml Compliant Critical Editions and Highlighting Parallel Writings

## by Robert Alessi1

### 1 OM - Orient & Méditerranée : textes, Archéologie, Histoire

Abstract: ekdosis is a LuaL A T E X package written by R. Alessi designed for multilingual critical editions. It can be used to typeset texts and different layers of critical notes in any direction accepted by LuaT E X. Texts can be arranged in running paragraphs or on facing pages, in any number of columns which in turn can be synchronized or not. Database-driven encoding under L A T E X allows extraction of texts entered segment by segment according to various criteria: main edited text, variant readings, translations or annotated borrowings between texts. In addition to printed texts, ekdosis can convert .tex source files so as to produce TEI xml compliant critical editions. It will be published under the terms of the GNU General Public License (GPL) version 3.

# Spoken word corpus and dictionary definition for an African language

## by Wanjuku Nganga1 and Ikechukwu Achebe 2

### 2 Igbo Archival Dictionary Project, Nnamdi Azikiwe University, Nigeria

Abstract : The preservation of languages is critical to maintaining and strengthening the cultures and identities of communities, and this is especially true for under-resourced languages with a predominantly oral culture. Most African languages have a relatively short literary past, and as such the task of dictionary making cannot rely on textual corpora as has been the standard practice in lexicography. This paper emphasizes the significance of the spoken word and the oral tradition as repositories of vocabulary, and argues that spoken word corpora greatly outweigh the value of printed texts for lexicography. We describe a methodology for creating a digital dialectal dictionary for the Igbo language from such a spoken word corpus. We also highlight the language technology tools and resources that have been created to support the transcription of thousands of hours of Igbo speech and the subsequent compilation of these transcriptions into an XML-encoded textual corpus of Igbo dialects. The methodology described in this paper can serve as a blueprint that can be adopted for other under-resourced languages that have predominantly oral cultures.

# Linguistic Fingerprints on Translation's Lens

## by J.D Porter1, IYulia Ilchuk 2and Quinn Dombrowski2

### 2Stanford University, USA

Abstract : What happens to the language fingerprints of a work when it is translated into another language? While translation studies has often prioritized concepts of equivalence (of form and function), and of textual function, digital humanities methodologies can provide a new analytical lens onto ways that stylistic traces of a text’s source language can persist in a translated text.

This paper presents initial findings of a project undertaken by the Stanford Literary Lab, which has identified distinctive grammatical features in short stories that have been translated into English. While the phenomenon of “translationese” has been well established particularly in corpus translation studies, we argue that digital humanities methods can be valuable for identifying specific traits for a vision of a world atlas of literary style.

## Call for papers

Digital Humanities continues to gain momentum, the field is intersecting with an ever-widening range of disciplines including Natural Language Processing, Library and Information Science, History, Literature, and Translation Studies to name only a few. The growth of these fields within DH enables us to break new scientific ground. For example, the existing reservoir of public domain multilingual texts, once tracked and digitalized, provides a new wealth of resources to sustain knowledge diversity, preserve our cultural heritage and help us map the global circulation and reception of knowledge.

In the wake of recent research works in this domain, the Journal of Data Mining and Digital Humanities will publish a special issue "Collecting, Preserving and Disseminating Endangered Cultural Heritage for New Understandings Through Multilingual Approaches" featuring a selection of papers presenting recent research that aims at collecting, preserving, and dissiminating endangered knowledge and cultural heritage for new understandings through multilingual approaches.

We welcome submissions including but not limited to the following topics:

• Knowledge circulation and organization in a transnational context
• Digital Humanities, crowdsourcing and digital libraries
• Digital Humanties and the circulation of translated texts
• Natural Language Processing to preserve knowledge diversity and cultural heritage
• Collecting and aligning translated texts in and for under-resourced languages
• Multilingual corpora and their circulation
• Open data, open access and data preservation
• Collaboration and computing for endangered knowledge
• Ethics and data privacy issues in a global context

## Manuscript Submission Information

To download the journal template go to the journal website JDMDH, click on "About the Journal "then "Submissions".

It is a two-step submission process: you first submit your paper on an open acess repository (arXiv, HAL) that will provide you with a document identifier. You then you go to the Journal website, click on "Submit an article". You will be asked to select the repository you have chosen before you type in your document identifier.

All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website.

## Bibliography:

Adler, Melissa A., Joseph T. Tennis, Daniel Martínez-Ávila, José Augusto Chaves Guimarães, Jens-Erik Mai, Ole Olesen-Bagneux, and Laura Skouvig. 2016. “Global/local Knowledge Organization: Contexts and Questions”. In: Proceedings of the Association for Information Science and Technology 53(1):1-4.

Arppe, Antti,  Jordan Lachler, Trond Trosterud, Lene Antonsen, and Sjur N. Moshagen. 2016. “Basic language resource kits for endangered languages: A case study of plains cree”. In Proceedings of the the 2nd Workshop on Collaboration and Computing for Under-Resourced Languages Workshop : 1–8.

Barát, Ágnes H. 2008. “Knowledge Organization in the Cross-Cultural and Multicultural Society”. In: Advances Knowledge Organization 11, Proceedings of the Tenth International ISKO Conference: 91–97.

Baron, Robert. 2012. “ ”All Power to the Periphery” The Public Folklore Thought of Alan Lomax”. In: Journal of Folklore Research , Vol. 49, No. 3. Indiana University Press : 275- 317. Stable URL: https://www.jstor.org/stable/10.2979/jfolkrese.49.3.275

Beghtol, Clare. 1986. “Semantic validity: concepts of warrant in bibliographic classification systems”, Library Resources and Technical Services, Vol. 30 No. 2:109‐25.

Beghtol, Clare. 2001. “Relationships in classificatory structure and meaning”, in Bean, C.A. and Green, R. (Eds), Relationships in the Organization of Knowledge, Kluwer, Dordrecht: 99‐113.

Beghtol, Clare. 2002. “Universal Concepts, Cultural Warrant and Cultural Hospitality”. In: Challenges in Knowledge Representation and Organization for the 21st Century Integration of Knowledge Across Boundaries, Proceedings of the Seventh International ISKO Conference: 45-49.

Beghtol, Clare. 2005. “Ethical Decision-Making for Knowledge Representation and Organization Systems for Global Use.” Journal of the American Society for Information Science and Technology 56 (9):903–12.

Dahlberg, Ingetraut.1992. “Ethics and Knowledge Organization: In Memory of Dr. S.R. Ranganathan in His Centenary Year.” International Classification 19 (1):1–2.

Eveleigh, Alexandra. 2014. “Crowding Out the Archivist? Locating Crowdsourcing  within the Broader Landscape of Participatory Archives,”. In: Crowdsourcing our Cultural Heritage, ed. Mia Ridge 211-229

Fishkin Fisher, Shelley. 2011. “Deep Maps: A Brief for Digital Palimpsest Mapping Projects (DPMPs, or “Deep Maps”)”. In: Journal of Transnational American Studies, 3(2). URL: https://escholarship.org/uc/item/92v100t0

Fraisse, Amel. 2010. “Localisation interne et en contexte des logiciels commerciaux et libres”. Ph. D. thesis, Université de Grenoble, France. URL : https://tel.archives-ouvertes.fr/tel-00995093

Fraisse, Amel, Boitet, Christian, Blanchon, Hervé, Bellynck, Valérie. 2009. “A solution for in context and collaborative localization of most commercial and free software”. In: Proceedings of the 4th Language and technologies Conference, vol 1/1:536-540, Poznan, Poland.

Fraisse, Amel, Zheng Zhang, Alex Zhai, Ronald Jenn, Shelley Fisher Fishkin, Pierre Zweigenbaum, Laurence Favier, Widad Mustafa El Hadi. 2019. “A Sustainable and Open Access Knowledge Organization Model to Preserve Cultural Heritage and Language Diversity”. Information, 10(10), 303.

Harvey, Todd, Andrew Peart and Nathan Salsburg. 2017. “Alan Lomax and the "Grass Roots" Idea”. In: Chicago Review, Vol. 60/61, No. 4/1: 37-45, Stable URL: https://www.jstor.org/stable/44820515

Hudon, Michèle. 1997. “Multilingual Thesaurus Construction-Integrating the Views of Different Cultures in One Gateway to Knowledge and Concepts”. In: Information Services and Use 17: 11–123.

Hudon, Michèle. 1998. “Information access in a multilingual and multicultural environment”. Congrès de l'American Society of Indexers. Seattle (WA).

Krauwer, Steven. 2003. “The basic language resource kit (blark) as the first milestone for the language resources roadmap”. In Proceedings of the International Workshop Speech and Computer.

López-Huertas, María. 2016. “The Integration of Culture in Knowledge Organization Systems.” In Advances in Knowledge Organization, Vol. 15: Knowledge Organization for a Sustainable World, Proceedings of the Fourteenth International ISKO Conference, Rio de Janeiro, Brazil, 13–28. International Society for Knowledge Organization.

Mustafa El Hadi, Widad. 2015. “Cultural Interoperability and Knowledge Organization Systems.” In Organização Do Conhecimento E Diversidade Cultural, Proceedings of the 3rd Brazilian ISKO-Conference, edited by José Augusto Chaves Guimarães and Vera Dodebei: 575–606. Marília, São Paulo: Fundação para o Desenvolvimento do Ensino, Pesquisa e Extensão (FUNDEPE).

Otlet, Paul. 1934. Traité de Documentation: Le livre sur le Livre: Théorie et Pratique, Mundaneum: Bruxelles, Belgium.

Ridge, Mia. (Ed.). 2014. Crowdsourcing our Cultural Heritage. Farnham: Ashgate.

Scannell, Kevin. 2007. “The crubadan project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora”. In: Proceedings of the 3rd Web as Corpus Workshop: 5–15.

Teets, Michael and Matthew Goldner. 2013. “Libraries’ Role in Curating and Exposing Big Data”. Future Internet, 5: 429–438.

Van Hyning ,Victoria, Samantha Blickhan, Chris Lintott, and Laura Trouille. 2017. “Transforming Libraries and Archives through Crowdsourcing”. In: D-Lib Mag. 23(5/6) .

Van Hyning, Victoria. 2019. “Harnessing Crowdsourcing for Scholarly and GLAM Purposes”. Literature Compass, 16(3-4). Available at https://doi.org/10.1111/lic3.12507.

Williams, Alex C., John F. Wallin, Haoyu Yu, Marco Perale, Hyrum D. Carroll, Anne-Francoise Lamblin, Lucy Fortson, Dirk Obbink, Chris J. Lintott, and James H. Brusuelas. 2014. “A computational pipeline for crowdsourced transcriptions of ancient greek papyrus fragments”. In :Proceedings of the International Conference on Big Data:100–105.