Visualisations in Historical Linguistics

A collection of papers from the Angus McIntosh Centre (AMC) Workshop on Visualisations in Historical Linguistics, organised by Rhona Alcorn, Benjamin Molineaux and Bettelou Los during the 20th International Conference on English Historical Linguistics (ICEHL XX).

Edited by:
Benjamin Molineaux, Bettelou Los and Martti Mäkinen


In 2018, in order not to compete with ICEHL XX, the Angus McIntosh Centre did not run its biennial Historical Linguistics Symposium as a free-standing event. In its place, the AMC organised a specialised workshop within the ICEHL programme: The AMC Workshop on Visualisations in Historical Linguistics.

The event brought speakers together to present new methods or tools which help historical linguists acquire insights through visual representations of data.

This special issue of the JDMDH originates in the papers of this workshop, with one additional invited contribution.



1. Introduction (Benjamin Molineaux, Bettelou Los and Martti Mäkinen)

The advent of ever-larger and more diverse historical corpora for different historical periods and linguistic varieties has led to the impossibility of obtaining simple, direct —and yet balanced— visual representations of the core patterns in the data. In order to draw insights from heterogeneous and complex materials of this type, historical linguists have begun to reach for a growing number of data visualisation techniques, from the statistical, to the cartographical, the network-based and beyond. An exploration of the state of this art was the objective of a workshop at the 2018 International Conference on English Historical Linguistics, from whence most of the materials of this special issue are drawn. This brief introductory paper outlines the background and relevance of this line of methodological research, presenting a summary of the individual papers that make up the collection.

2. How to visualize high-dimensional data: a roadmap (Hermann Moisl)

Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis.

3. Digitising Collections of Historical Linguistic Data: The Example of The Linguistic Atlas of Scotland (Christian Hessle and John Kirk)

This article addresses the issue of variation in the lexicon – specifically the hyponymy (or synonymy) among onomasiological responses for the same concept or referent – and how the range of responses from a national elicitation in Scotland seeking ‘local’ words should be judged. How do responses being offered as ‘local’ square with their geographical distribution on the one hand, and their status as ‘Scots’ or ‘English’, or as ‘dialect’ or ‘standard’ on the other? How are ‘dialect’ or ‘standard’ responses offered as ‘local’ responses from the same individual to be considered? Is the issue that of a straightforward dialect-standard binary opposition, or is there a third value between the two? Does that third value encompass a middle ground between dialect and standard, or include both? How is the absence of responses to be regarded? For elucidation of such linguistic issues, the article invokes the mathematical principle of the excluded middle. This study shows that it is possible and necessary to establish a theoretical framework for the digitalisation of a historical data collection.

The data for these reflections come from the lexical material in The Linguistic Atlas of Scotland ([Mather and Speitel, 1975]; [1977]), which is currently being digitised at the University of Vienna. This study presents three pilot studies from the North Mid Scots area: the atlas concepts of ‘ankle’, ‘splinter’ and ‘the youngest of a brood or litter’. The original data are re-analysed in terms of lexical types or ‘lexemes’, and the results are digitally generated by new dot maps, with separate gestations for respondent age and gender. In the process, topological issues (such as those pertaining to the data) and topographical issues (such as those pertaining to geography and the physical terrain) are addressed.

4. Visual analytics for historical linguistics: opportunities and challenges (Christin Beck and Miriam Butt)

In this paper we present a case study in which Visual Analytic methods for interactive data exploration are applied to the study of historical linguistics. We discuss why diachronic linguistic data poses special challenges for Visual Analytics and show how these are handled in a collaboratively developed web- based tool: HistoBankVis. HistoBankVis allows an immediate and efficient interaction with underlying diachronic data and we go through an investigation of the interplay between case marking and word order in Icelandic and Old Saxon to illustrate its features. We then discuss challenges posed by the lack of annotation standardization across different corpora as well as the problems we encountered with respect to errors, uncertainty and issues of data provenance. Overall we conclude that the integration of Visual Analytics methodology into the study of language change has an immense potential but that the full realization of its potential will depend on whether issues of data interoperability and annotation standards can be resolved.

5. Visualizing the development of prose styles in Horse Manuals from Early Modern English to Present-Day English (Thijs Lubbers and Bettelou Los)

This paper offers a data-driven analysis of the development of English prose styles in a single genre (instructive writing) dealing with a single topic (the correct way of feeding a horse) in 13 texts with publication dates ranging between 1565 to 2009. The texts are subjected to three investigations that offer visualizations of the findings: (i) a correspondence analysis of POStag trigrams; (ii) an association plot analysis; (iii) hierarchical clustering (dendograms). As the period selected – Early Modern English to Present-Day English – does not involve any major changes in English syntax, we expect to find developments that are predominantly stylistic.

6. Stylo visualisations of Middle English documents (Martti Mäkinen)

Automated approaches to identifying authorship of a text have become commonplace in the stylometric studies. The current article applies an unsupervised stylometric approach on Middle English documents using the script Stylo in R, in an attempt to distinguish between texts from different dialectal areas. The approach is based on the distribution of character 3-grams generated from the texts of the corpus of Middle English Local Documents (MELD). The article adopts the middle ground in the study of Middle English spelling variation, between the concept of relational linguistic space and the real linguistic continuum of medieval England. Stylo can distinguish between Middle English dialects by using the less frequent character 3-grams.

7. An interactive visualization of Google Books Ngrams with R and Shiny: Exploring a(n) historical increase in onset strength in a(n) huge database (Julia Schlüter and Fabian Vetter)

Using the re-emergence of the /h/ onset from Early Modern to Present-Day English as a case study, we illustrate the making and the functions of a purpose-built web application named (an:a)‑lyzer for the interactive visualization of the raw n-gram data provided by Google Books Ngrams (GBN). The database has been compiled from the full text of over 4.5 million books in English, totalling over 468 billion words and covering roughly five centuries. We focus on bigrams consisting of words beginning with graphic <h> preceded by the indefinite article allomorphs a and an, which serve as a diagnostic of the consonantal strength of the initial /h/. The sheer size of this database affords us the possibility to attain a maximal diachronic resolution, to distinguish highly specific groups of <h>-initial lexical items, and even to trace the diffusion of the observed changes across individual lexical units. The functions programmed into the app enable us to explore the data interactively by filtering, selecting and viewing them according to various parameters that were manually annotated into the data frame. We also discuss limitations of the database and of the explorative data analysis.

8. Visualising pre-standard spelling practice: The interchange of <ch(t)> and <th(t)> in Older Scots (Benjamin Molineaux, Warren Maguire, Vasilios Karaiskos, Rhona Alcorn, Joanna Kopaczyk and Bettelou Los)

Alphabetic spelling systems rarely display perfectly consistent one-to-one relationships between graphic marks and speech sounds. This is particularly true for languages without a standard written form. Nevertheless, such non-standard spelling systems are far from being anarchic, as they take on a conventional structure resulting from shared communities and histories of practice. Elucidating said structure can be a substantial challenge for researchers presented with textual evidence alone, since attested variation may represent differences in sound structure as well as differences in the grapho-phonological mapping itself. In order to tease apart these factors, we present a tool — Medusa — that allows users to create visual representations of the relationship between sounds and spellings (sound substitution sets and spelling substitution sets). Our case study for the tool deals with a longstanding issue in the historical record of mediaeval Scots, where word-final <cht>, <ch>, <tht> and <th> appear to be interchangeable, despite representing reflexes of distinct pre-Scots sounds: [x], [xt] and [θ]. Focusing on the documentary record in the Linguistic Atlas of Older Scots (LAOS 2013), our exploration surveys key graphemic categories, mapping their lexical distributions and taking us through evidence from etymology, phonological typology, paleography and historical orthograpy. The result is a novel reconstruction of the underlying sound values for each one of the target items in the record, alongside a series of sound and spelling changes that account for the data.


Citations: [Surname, Forename. 2020. Special Issue on Visualisations in Historical Linguistics, Journal of Data Mining and Digital Humanities,, p. 1-X.]