M. Zakaria Kurdi.
The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language (ESL) learners. To present language learners with texts that are suitable to their level of English, a set of features that can describe the phonological, morphological, lexical, syntactic, discursive, and psychological complexity of a given text were identified. Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms. The results showed that the adopted linguistic features provide a good overall classification performance (F-Score = 0.97). A scalability evaluation was conducted to test if such a classifier could be used within real applications, where it can be, for example, plugged into a search engine or a web-scraping module. In this evaluation, the texts in the test set are not only different from those from the training set but also of different types (ESL texts vs. children reading texts). Although the overall performance of the classifier decreased significantly (F-Score = 0.65), the confusion matrix shows that most of the classification errors are between the classes two and three (the middle-level classes) and that the system has a robust performance in categorizing texts of class one and four. This behavior can be explained by the difference in classification criteria between […]
Chaya Liebeskind ; Shmuel Liebeskind.
In this study, we address the interesting task of classifying historical texts by their assumed period of writ-ing. This task is useful in digital humanity studies where many texts have unidentified publication dates.For years, the typical approach for temporal text classification was supervised using machine-learningalgorithms. These algorithms require careful feature engineering and considerable domain expertise todesign a feature extractor to transform the raw text into a feature vector from which the classifier couldlearn to classify any unseen valid input. Recently, deep learning has produced extremely promising re-sults for various tasks in natural language processing (NLP). The primary advantage of deep learning isthat human engineers did not design the feature layers, but the features were extrapolated from data witha general-purpose learning procedure. We investigated deep learning models for period classification ofhistorical texts. We compared three common models: paragraph vectors, convolutional neural networks (CNN) and recurrent neural networks (RNN), and conventional machine-learning methods. We demon-strate that the CNN and RNN models outperformed the paragraph vector model and the conventionalsupervised machine-learning algorithms. In addition, we constructed word embeddings for each timeperiod and analyzed semantic changes of word meanings over time.
Thibault Clérice.
Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.
Section:
Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities
Barbara McGillivray ; Gard Jenset ; Dominik Heil.
Open-ended survey data constitute an important basis in research as well as for making business decisions. Collecting and manually analysing free-text survey data is generally more costly than collecting and analysing survey data consisting of answers to multiple-choice questions. Yet free-text data allow for new content to be expressed beyond predefined categories and are a very valuable source of new insights into people's opinions. At the same time, surveys always make ontological assumptions about the nature of the entities that are researched, and this has vital ethical consequences. Human interpretations and opinions can only be properly ascertained in their richness using textual data sources; if these sources are analyzed appropriately, the essential linguistic nature of humans and social entities is safeguarded. Natural Language Processing (NLP) offers possibilities for meeting this ethical business challenge by automating the analysis of natural language and thus allowing for insightful investigations of human judgements. We present a computational pipeline for analysing large amounts of responses to open-ended questions in surveys and extract keywords that appropriately represent people's opinions. This pipeline addresses the need to perform such tasks outside the scope of both commercial software and bespoke analysis, exceeds the performance to state-of-the-art systems, and performs this task in a transparent way that allows for scrutinising and exposing […]
Section:
Project
Oumayma Bounou ; Tom Monnier ; Ilaria Pastrolin ; Xi SHEN ; Christine Benevent ; Marie-Françoise Limon-Bonnet ; François Bougard ; Mathieu Aubry ; Marc H. Smith ; Olivier Poncet et al.
The study of watermarks is a key step for archivists and historians as it enables them to reveal the origin of paper. Although highly practical, automatic watermark recognition comes with many difficulties and is still considered an unsolved challenge. Nonetheless, Shen et al. [2019] recently introduced a new approach for this specific task which showed promising results. Building upon this approach, this work proposes a new public web application dedicated to automatic watermark recognition entitled Filigranes pour tous. The application not only hosts a detailed catalog of more than 17k watermarks manually collected from the French National Archives (Minutier central) or extracted from existing online resources (Briquet database), but it also enables non-specialists to identify a watermark from a simple photograph in a few seconds. Moreover, additional watermarks can easily be added by the users making the enrichment of the existing catalog possible through crowdsourcing. Our Web application is available at https://filigranes.inria.fr/.
Section:
Data deluge: which skills for wich data?
Régis Schlagdenhauffen.
This article proposes use the Transkribus software to report on a "user experiment" in a French-speaking context. It is based on the semi-automated transcription project using the diary of the jurist Eugène Wilhelm (1866-1951). This diary presents two main challenges. The first is related to the time covered by the writing process-66 years. This leads to variations in the form of the writing, which becomes increasingly "unreadable" with time. The second challenge is related to the concomitant use of two alphabets: Roman for everyday text and Greek for private issues. After presenting the project and the specificities related to the use of the tool, the experiment presented in this contribution is structured around two aspects. Firstly, I will summarise the main obstacles encountered and the solutions provided to overcome them. Secondly, I will come back to the collaborative transcription experiment carried out with students in the classroom, presenting the difficulties observed and the solutions found to overcome them. In conclusion, I will propose an assessment of the use of this Human Text Recognition software in a French-speaking context and in a teaching situation.
Section:
Digital humanities in languages
Caroline T. Schroeder ; Amir Zeldes.
Scholarship on underresourced languages bring with them a variety of challenges which make access to the full spectrum of source materials and their evaluation difficult. For Coptic in particular, large scale analyses and any kind of quantitative work become difficult due to the fragmentation of manuscripts, the highly fusional nature of an incorporational morphology, and the complications of dealing with influences from Hellenistic era Greek, among other concerns. Many of these challenges, however, can be addressed using Digital Humanities tools and standards. In this paper, we outline some of the latest developments in Coptic Scriptorium, a DH project dedicated to bringing Coptic resources online in uniform, machine readable, and openly available formats. Collaborative web-based tools create online 'virtual departments' in which scholars dispersed sparsely across the globe can collaborate, and natural language processing tools counterbalance the scarcity of trained editors by enabling machine processing of Coptic text to produce searchable, annotated corpora.
Robert Alessi.
ekdosis is a LuaL A T E X package written by R. Alessi designed for multilingual critical editions. It can be used to typeset texts and different layers of critical notes in any direction accepted by LuaT E X. Texts can be arranged in running paragraphs or on facing pages, in any number of columns which in turn can be synchronized or not. Database-driven encoding under L A T E X allows extraction of texts entered segment by segment according to various criteria: main edited text, variant readings, translations or annotated borrowings between texts. In addition to printed texts, ekdosis can convert .tex source files so as to produce TEI xml compliant critical editions. It will be published under the terms of the GNU General Public License (GPL) version 3.
Section:
Visualisation of intertextuality and text reuse
Wanjiku Nganga ; Ikechukwu Achebe.
The preservation of languages is critical to maintaining and strengthening the cultures and identities of communities, and this is especially true for under-resourced languages with a predominantly oral culture. Most African languages have a relatively short literary past, and as such the task of dictionary making cannot rely on textual corpora as has been the standard practice in lexicography. This paper emphasizes the significance of the spoken word and the oral tradition as repositories of vocabulary, and argues that spoken word corpora greatly outweigh the value of printed texts for lexicography. We describe a methodology for creating a digital dialectal dictionary for the Igbo language from such a spoken word corpus. We also highlight the language technology tools and resources that have been created to support the transcription of thousands of hours of Igbo speech and the subsequent compilation of these transcriptions into an XML-encoded textual corpus of Igbo dialects. The methodology described in this paper can serve as a blueprint that can be adopted for other under-resourced languages that have predominantly oral cultures.
Section:
Digital humanities in languages
Jani Marjanen ; Jussi Kurunmäki ; Lidia Pivovarova ; Elaine Zosa.
Words with the suffix-ism are reductionist terms that help us navigate complex social issues by using a simple one-word label for them. On the one hand they are often associated with political ideologies, but on the other they are present in many other domains of language, especially culture, science, and religion. This has not always been the case. This paper studies isms in a historical record of digitized newspapers from 1820 to 1917 published in Finland to find out how the language of isms developed historically. We use diachronic word embeddings and affinity propagation clustering to trace how new isms entered the lexicon and how they relate to one another over time. We are able to show how they became more common and entered more and more domains. Still, the uses of isms as traditions for political action and thinking stand out in our analysis.
Eva Andersen ; Maria Biryukov ; Roman Kalyakin ; Lars Wieneke.
Historians are confronted with an overabundance of sources that require new perspectives and tools to make use of large-scale corpora. Based on a use case from the history of psychiatry this paper describes the work of an interdisciplinary team to tackle these challenges by combining different NLP tools with new visual interfaces that foster the exploration of the corpus. The paper highlights several research challenges in the preparation and processing of the corpus and sketches new insights for historical research that were gathered due to the use of the tools.
Section:
HistoInformatics
Christin Beck ; Miriam Butt.
In this paper we present a case study in which Visual Analytic methods for interactive data exploration are applied to the study of historical linguistics. We discuss why diachronic linguistic data poses special challenges for Visual Analytics and show how these are handled in a collaboratively developed web-based tool: HistoBankVis. HistoBankVis allows an immediate and efficient interaction with underlying diachronic data and we go through an investigation of the interplay between case marking and word order in Icelandic and Old Saxon to illustrate its features. We then discuss challenges posed by the lack of annotation standardization across different corpora as well as the problems we encountered with respect to errors, uncertainty and issues of data provenance. Overall we conclude that the integration of Visual Analytics methodology into the study of language change has an immense potential but that the full realization of its potential will depend on whether issues of data interoperability and annotation standards can be resolved.
Benjamin Molineaux ; Bettelou Los ; Martti Mäkinen.
The advent of ever-larger and more diverse historical corpora for different historical periods and linguistic varieties has led to the impossibility of obtaining simple, direct-and yet balancedrepresentations of the core patterns in the data. In order to draw insights from heterogeneous and complex materials of this type, historical linguists have begun to reach for a growing number of data visualisation techniques, from the statistical, to the cartographical, the network-based and beyond. An exploration of the state of this art was the objective of a workshop at the 2018 International Conference on English Historical Linguistics, from whence most of the materials of this Special Issue are drawn. This brief introductory paper outlines the background and relevance of this line of methodological research and presents a summary of the individual papers that make up the collection.
Thijs Lubbers ; Bettelou Los.
This paper offers a data-driven analysis of the development of English prose styles in a single genre (instructive writing) dealing with a single topic (the correct way of feeding a horse) in 13 texts with publication dates ranging between 1565 to 2009. The texts are subjected to three investigations that offer visualizations of the findings: (i) a correspondence analysis of POS-tag trigrams; (ii) an association plot analysis; (iii) hierarchical clustering (dendrograms). As the period selected – Early Modern English to Present-Day English – does not involve any major changes in English syntax, we expect to find developments that are predominantly stylistic.
Christian Hessle ; John Kirk.
This article addresses the issue of variation in the lexicon-specifically the hyponymy (or synonymy) among onomasiological responses for the same concept or referent-and how the range of responses from a national elicitation in Scotland seeking 'local' words should be judged. How do responses being offered as 'local' square with their geographical distribution on the one hand, and their status as 'Scots' or 'English', or as 'dialect' or 'standard' on the other? How are 'dialect' or 'standard' responses offered as 'local' responses from the same individual to be considered? Is the issue that of a straightforward dialect-standard binary opposition, or is there a third value between the two? Does that third value encompass a middle ground between dialect and standard, or include both? How is the absence of responses to be regarded? For elucidation of such linguistic issues, the article invokes the mathematical principle of the excluded middle. This study shows that it is possible and necessary to establish a theoretical framework for the digitalisation of a historical data collection. The data for these reflections come from the lexical material in The Linguistic Atlas of Scotland ([Mather and Speitel, 1975]; [1977]), which is currently being digitised at the University of Vienna. This study presents three pilot studies from the North Mid Scots area: the atlas concepts of 'ankle', […]
Benjamin Molineaux ; Warren Maguire ; Vasilios Karaiskos ; Rhona Alcorn ; Joanna Kopaczyk ; Bettelou Los.
Alphabetic spelling systems rarely display perfectly consistent one-to-one relationships between graphic marks and speech sounds. This is particularly true for languages without a standard written form. Nevertheless, such non-standard spelling systems are far from being anarchic, as they take on a conventional structure resulting from shared communities and histories of practice. Elucidating said structure can be a substantial challenge for researchers presented with textual evidence alone, since attested variation may represent differences in sound structure as well as differences in the graphophonological mapping itself. In order to tease apart these factors, we present a tool-Medusa-that allows users to create visual representations of the relationship between sounds and spellings (sound substitution sets and spelling substitution sets). Our case study for the tool deals with a longstanding issue in the historical record of mediaeval Scots, where word-final <cht>, <ch>, <tht> and <th> appear to be interchangeable, despite representing reflexes of distinct pre-Scots sounds: [x], [xt] and [θ]. Focusing on the documentary record in the Linguistic Atlas of Older Scots ([LAOS, 2013]), our exploration surveys key graphemic categories, mapping their lexical distributions and taking us through evidence from etymology, phonological typology, palaeography and historical orthograpy. The result is a novel reconstruction of the underlying sound values for each […]
Julia Schlüter ; Fabian Vetter.
Using the re-emergence of the /h/ onset from Early Modern to Present-Day English as a case study, we illustrate the making and the functions of a purpose-built web application named (an:a) lyzer for the interactive visualization of the raw n-gram data provided by Google Books Ngrams (GBN). The database has been compiled from the full text of over 4.5 million books in English, totalling over 468 billion words and covering roughly five centuries. We focus on bigrams consisting of words beginning with graphic <h> preceded by the indefinite article allomorphs a and an, which serve as a diagnostic of the consonantal strength of the initial /h/. The sheer size of this database affords us the possibility to attain a maximal diachronic resolution, to distinguish highly specific groups of <h>-initial lexical items, and even to trace the diffusion of the observed changes across individual lexical units. The functions programmed into the app enable us to explore the data interactively by filtering, selecting and viewing them according to various parameters that were manually annotated into the data frame. We also discuss limitations of the database, of the app and of the explorative data analysis. The app is publicly accessible online at https://osf.io/ht8se/.
Martti Mäkinen.
Automated approaches to identifying authorship of a text have become commonplace in the stylometric studies. The current article applies an unsupervised stylometric approach on Middle English documents using the script Stylo in R, in an attempt to distinguish between texts from different dialectal areas. The approach is based on the distribution of character 3-grams generated from the texts of the corpus of Middle English Local Documents (MELD). The article adopts the middle ground in the study of Middle English spelling variation, between the concept of relational linguistic space and the real linguistic continuum of medieval England. Stylo can distinguish between Middle English dialects by using the less frequent character 3-grams.
Hermann Moisl.
Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis.