Special Issue of Journal of Data Mining and Digital Humanities (ISSN: 2416-5999)
Amel Fraisse
Univ. Lille, EA 4073 - GERiiCO - Groupement d’Etudes et de Recherche Interdisciplinaire en Information et Communication, F-59000 Lille, France.
Website | E-Mail
Interests: library and information science, cultural and linguistic diversity, cultural heritage, multilingualism, digital humanities.
Ronald Jenn
Univ. Lille, EA 4074 - CECILLE - Centre d’Études en Civilisations Langues et Lettres Etrangères, F-59000 Lille, France.
Website | E-Mail
Interests: translation studies, translated texts, digital humanities, Mark Twain.
Shelley Fisher Fishkin
Stanford University, English Departement.
Website | E-Mail
Interests: transnational American studies, literature, translation studies, Mark Twain.
In an increasingly globalized world, preserving knowledge diversity and cultural heritage is more important than ever. Multilingualism and multiculturalism are central to that effort. The rapid growth of digitally and publicly available knowledge resources poses the challenge of knowing how to effectively preserve, analyze, understand, and disseminate that knowledge diversity. Indeed, over time, the gap between languages of dominant nations or cultures and other languages has been growing. This Special Issue features a selection of papers addressing the challenges of collecting, preserving, and disseminating endangered knowledge and cultural heritage by presenting some of the latest research based on new computational approaches. Some of these papers are an improved and reviewed version of presentations at the “Digital Humanities to Preserve Knowledge and Cultural Heritage” workshop (Stanford, April 15th 2019) and others answered the subsequent open Call For Papers.
The Special Issue presents original and high-quality papers, some technical and others survey in nature, addressing both theoretical and practical aspects, with emphasis on ethical and social implications. This Special Issue not only showcases promising research but also contributes to raising awareness about the import of sustaining linguistic and cultural diversity, an endeavor to which Digital Humanities can be instrumental.
Documentation Studies pioneer Paul Otlet envisioned a universal compilation of knowledge and the technology to make it globally available as he explained in numerous essays on the collection and organization of the world's knowledge (Otlet, 1934). Recent advances in information technologies have greatly improved access to knowledge, however, the language barrier is a key issue that knowledge and information systems still have to address ((Hudon,1997,1998) and (Agnes Hajdu Barat,2008)). Although information systems include knowledge encoded in vulnerable languages, their use and exploration needs to be further developed.
Today, we witness the advent of new collections of digital documents openly available in a broad range of fields and languages. National, international and academic institutions such as the Library of Congress[1] and Europeana [2] have developed ROARs (Registries of Open Access Repositories) and made millions of cultural items available in more than 470 languages thereby providing research materials that can be used as input for studies concerned with sustaining endangered languages and cultures. The process can work both ways because, as demonstrated in “Individual vs. Collaborative methods of Crowdsourced Transcription”, the way material is presented in ROARs can be improved. In this specific case, through annotations generated thanks to crowdsourcing. Moving from a closed, discontinuous, and out-of-context model to, open, continuous, and in-context knowledge creation and organization models is a concept that has shown its effectiveness by a wiki platform like wikipedia. The basic concept relies upon a collaborative approach that promotes the right of all people to use information system in their mother tongue as advocated by the UNESCO’s Information for All Programme (IFAP). The collaborative approach consists in giving up the idea of perfect and complete knowledge and publishing partial knowledge of variable quality, which will be improved incrementally when the information system is used--an ongoing and continuously improving process. This new orientation permits the incremental augmentation of both quality and quantity. The best-known example of this is the Wikipedia community, in which knowledge is added and improved continuously by contributors. In the same line of thought “A Collaborative Ecosystem for Digital Coptic Studies” demonstrates how an online collaborative platform in a specific domain can bring together scholars to produce searchable and annotated corpora. “ekdosis: Using LuaL A T E X for Producing TEI xml Compliant Critical Editions and Highlighting Parallel Writings” presents a major technical improvement for multilingual editors as it enables the parallel typsetting within the same document of different languages and alphabets in any direction. In “Spoken word corpus and dictionary definition for an African language” we are presented with a major breakthrough for preserving predominantly oral cultures and ushering them into the digital age. Its methodology transcribes hours of oral speech into XML-encoded textual corpus.
[1] https://www.loc.gov
Abstract : While online crowdsourced text transcription projects have proliferated in the last decade, there is a need within the broader field to understand differences in project outcomes as they relate to task design, as well as to experiment with different models of online crowdsourced transcription that have not yet been explored. The experiment discussed in this paper involves the evaluation of newly-built tools on the Zooniverse.org crowdsourcing platform, attempting to answer the research question: "Does the current Zooniverse methodology of multiple independent transcribers and aggregation of results render higher-quality outcomes than allowing volunteers to see previous transcriptions and/or markings by other users? How does each methodology impact the quality and depth of analysis and participation?" To answer these questions, the Zooniverse team ran an A/B experiment on the project Anti-Slavery Manuscripts at the Boston Public Library. This paper will share results of this study, and also describe the process of designing the experiment and the metrics used to evaluate each transcription method. These include the comparison of aggregate transcription results with ground truth data; evaluation of annotation methods; the time it took for volunteers to complete transcribing each dataset; and the level of engagement with other project elements such as posting on the message board or reading supporting documentation. Particular focus will be given to the (at times) competing goals of data quality, efficiency, volunteer engagement, and user retention, all of which are of high importance for projects that focus on data from galleries, libraries, archives and museums. Ultimately, this paper aims to provide a model for impactful, intentional design and study of online crowdsourcing transcription methods, as well as shed light on the associations between project design, methodology and outcomes.
Abstract: Scholarship on underresourced languages bring with them a variety of challenges which make access to the full spectrum of source materials and their evaluation difficult. For Coptic in particular, large scale analyses and any kind of quantitative work become difficult due to the fragmentation of manuscripts, the highly fusional nature of an incorporational morphology, and the complications of dealing with influences from Hellenistic era Greek, among other concerns. Many of these challenges, however, can be addressed using Digital Humanities tools and standards. In this paper, we outline some of the latest developments in Coptic Scriptorium, a DH project dedicated to bringing Coptic resources online in uniform, machine readable, and openly available formats. Collaborative web-based tools create online 'virtual departments' in which scholars dispersed sparsely across the globe can collaborate, and natural language processing tools counterbalance the scarcity of trained editors by enabling machine processing of Coptic text to produce searchable, annotated corpora.
Abstract: ekdosis is a LuaL A T E X package written by R. Alessi designed for multilingual critical editions. It can be used to typeset texts and different layers of critical notes in any direction accepted by LuaT E X. Texts can be arranged in running paragraphs or on facing pages, in any number of columns which in turn can be synchronized or not. Database-driven encoding under L A T E X allows extraction of texts entered segment by segment according to various criteria: main edited text, variant readings, translations or annotated borrowings between texts. In addition to printed texts, ekdosis can convert .tex source files so as to produce TEI xml compliant critical editions. It will be published under the terms of the GNU General Public License (GPL) version 3.
Abstract : The preservation of languages is critical to maintaining and strengthening the cultures and identities of communities, and this is especially true for under-resourced languages with a predominantly oral culture. Most African languages have a relatively short literary past, and as such the task of dictionary making cannot rely on textual corpora as has been the standard practice in lexicography. This paper emphasizes the significance of the spoken word and the oral tradition as repositories of vocabulary, and argues that spoken word corpora greatly outweigh the value of printed texts for lexicography. We describe a methodology for creating a digital dialectal dictionary for the Igbo language from such a spoken word corpus. We also highlight the language technology tools and resources that have been created to support the transcription of thousands of hours of Igbo speech and the subsequent compilation of these transcriptions into an XML-encoded textual corpus of Igbo dialects. The methodology described in this paper can serve as a blueprint that can be adopted for other under-resourced languages that have predominantly oral cultures.
Abstract : What happens to the language fingerprints of a work when it is translated into another language? While translation studies has often prioritized concepts of equivalence (of form and function), and of textual function, digital humanities methodologies can provide a new analytical lens onto ways that stylistic traces of a text’s source language can persist in a translated text.
This paper presents initial findings of a project undertaken by the Stanford Literary Lab, which has identified distinctive grammatical features in short stories that have been translated into English. While the phenomenon of “translationese” has been well established particularly in corpus translation studies, we argue that digital humanities methods can be valuable for identifying specific traits for a vision of a world atlas of literary style.
Digital Humanities continues to gain momentum, the field is intersecting with an ever-widening range of disciplines including Natural Language Processing, Library and Information Science, History, Literature, and Translation Studies to name only a few. The growth of these fields within DH enables us to break new scientific ground. For example, the existing reservoir of public domain multilingual texts, once tracked and digitalized, provides a new wealth of resources to sustain knowledge diversity, preserve our cultural heritage and help us map the global circulation and reception of knowledge.
In the wake of recent research works in this domain, the Journal of Data Mining and Digital Humanities will publish a special issue "Collecting, Preserving and Disseminating Endangered Cultural Heritage for New Understandings Through Multilingual Approaches" featuring a selection of papers presenting recent research that aims at collecting, preserving, and dissiminating endangered knowledge and cultural heritage for new understandings through multilingual approaches.
We welcome submissions including but not limited to the following topics:
To download the journal template go to the journal website JDMDH, click on "About the Journal "then "Submissions".
It is a two-step submission process: you first submit your paper on an open acess repository (arXiv, HAL) that will provide you with a document identifier. You then you go to the Journal website, click on "Submit an article". You will be asked to select the repository you have chosen before you type in your document identifier.
All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website.
Fraisse, Amel, Boitet, Christian, Blanchon, Hervé, Bellynck, Valérie. 2009. “A solution for in context and collaborative localization of most commercial and free software”. In: Proceedings of the 4th Language and technologies Conference, vol 1/1:536-540, Poznan, Poland.
Hudon, Michèle. 1998. “Information access in a multilingual and multicultural environment”. Congrès de l'American Society of Indexers. Seattle (WA).
Ridge, Mia. (Ed.). 2014. Crowdsourcing our Cultural Heritage. Farnham: Ashgate.