Numéro spécial conçu comme une publication de suivi de l'atelier sur les visualisations en linguistique historique, organisé dans le cadre de la conférence ICEHL (27-31 août 2018), avec des articles supplémentaires résultant d'un appel à contributions additionnel.
This paper offers a data-driven analysis of the development of English prose styles in a single genre (instructive writing) dealing with a single topic (the correct way of feeding a horse) in 13 texts with publication dates ranging between 1565 to 2009. The texts are subjected to three investigations that offer visualizations of the findings: (i) a correspondence analysis of POS-tag trigrams; (ii) an association plot analysis; (iii) hierarchical clustering (dendrograms). As the period selected – Early Modern English to Present-Day English – does not involve any major changes in English syntax, we expect to find developments that are predominantly stylistic.
Automated approaches to identifying authorship of a text have become commonplace in the stylometric studies. The current article applies an unsupervised stylometric approach on Middle English documents using the script Stylo in R, in an attempt to distinguish between texts from different dialectal areas. The approach is based on the distribution of character 3-grams generated from the texts of the corpus of Middle English Local Documents (MELD). The article adopts the middle ground in the study of Middle English spelling variation, between the concept of relational linguistic space and the real linguistic continuum of medieval England. Stylo can distinguish between Middle English dialects by using the less frequent character 3-grams.
This article addresses the issue of variation in the lexicon-specifically the hyponymy (or synonymy) among onomasiological responses for the same concept or referent-and how the range of responses from a national elicitation in Scotland seeking 'local' words should be judged. How do responses being offered as 'local' square with their geographical distribution on the one hand, and their status as 'Scots' or 'English', or as 'dialect' or 'standard' on the other? How are 'dialect' or 'standard' responses offered as 'local' responses from the same individual to be considered? Is the issue that of a straightforward dialect-standard binary opposition, or is there a third value between the two? Does that third value encompass a middle ground between dialect and standard, or include both? How is the absence of responses to be regarded? For elucidation of such linguistic issues, the article invokes the mathematical principle of the excluded middle. This study shows that it is possible and necessary to establish a theoretical framework for the digitalisation of a historical data collection. The data for these reflections come from the lexical material in The Linguistic Atlas of Scotland ([Mather and Speitel, 1975]; [1977]), which is currently being digitised at the University of Vienna. This study presents three pilot studies from the North Mid Scots area: the atlas concepts of 'ankle', […]
Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis.
Alphabetic spelling systems rarely display perfectly consistent one-to-one relationships between graphic marks and speech sounds. This is particularly true for languages without a standard written form. Nevertheless, such non-standard spelling systems are far from being anarchic, as they take on a conventional structure resulting from shared communities and histories of practice. Elucidating said structure can be a substantial challenge for researchers presented with textual evidence alone, since attested variation may represent differences in sound structure as well as differences in the graphophonological mapping itself. In order to tease apart these factors, we present a tool-Medusa-that allows users to create visual representations of the relationship between sounds and spellings (sound substitution sets and spelling substitution sets). Our case study for the tool deals with a longstanding issue in the historical record of mediaeval Scots, where word-final <cht>, <ch>, <tht> and <th> appear to be interchangeable, despite representing reflexes of distinct pre-Scots sounds: [x], [xt] and [θ]. Focusing on the documentary record in the Linguistic Atlas of Older Scots ([LAOS, 2013]), our exploration surveys key graphemic categories, mapping their lexical distributions and taking us through evidence from etymology, phonological typology, palaeography and historical orthograpy. The result is a novel reconstruction of the underlying sound values for each […]
Using the re-emergence of the /h/ onset from Early Modern to Present-Day English as a case study, we illustrate the making and the functions of a purpose-built web application named (an:a) lyzer for the interactive visualization of the raw n-gram data provided by Google Books Ngrams (GBN). The database has been compiled from the full text of over 4.5 million books in English, totalling over 468 billion words and covering roughly five centuries. We focus on bigrams consisting of words beginning with graphic <h> preceded by the indefinite article allomorphs a and an, which serve as a diagnostic of the consonantal strength of the initial /h/. The sheer size of this database affords us the possibility to attain a maximal diachronic resolution, to distinguish highly specific groups of <h>-initial lexical items, and even to trace the diffusion of the observed changes across individual lexical units. The functions programmed into the app enable us to explore the data interactively by filtering, selecting and viewing them according to various parameters that were manually annotated into the data frame. We also discuss limitations of the database, of the app and of the explorative data analysis. The app is publicly accessible online at https://osf.io/ht8se/.
In this paper we present a case study in which Visual Analytic methods for interactive data exploration are applied to the study of historical linguistics. We discuss why diachronic linguistic data poses special challenges for Visual Analytics and show how these are handled in a collaboratively developed web-based tool: HistoBankVis. HistoBankVis allows an immediate and efficient interaction with underlying diachronic data and we go through an investigation of the interplay between case marking and word order in Icelandic and Old Saxon to illustrate its features. We then discuss challenges posed by the lack of annotation standardization across different corpora as well as the problems we encountered with respect to errors, uncertainty and issues of data provenance. Overall we conclude that the integration of Visual Analytics methodology into the study of language change has an immense potential but that the full realization of its potential will depend on whether issues of data interoperability and annotation standards can be resolved.
The advent of ever-larger and more diverse historical corpora for different historical periods and linguistic varieties has led to the impossibility of obtaining simple, direct-and yet balancedrepresentations of the core patterns in the data. In order to draw insights from heterogeneous and complex materials of this type, historical linguists have begun to reach for a growing number of data visualisation techniques, from the statistical, to the cartographical, the network-based and beyond. An exploration of the state of this art was the objective of a workshop at the 2018 International Conference on English Historical Linguistics, from whence most of the materials of this Special Issue are drawn. This brief introductory paper outlines the background and relevance of this line of methodological research and presents a summary of the individual papers that make up the collection.