Visual analytics for historical linguistics: opportunities and challenges

In this paper we present a case study in which Visual Analytic methods for interactive data exploration are applied to the study of historical linguistics. We discuss why diachronic linguistic data poses special challenges for Visual Analytics and show how these are handled in a collaboratively developed web-based tool: HistoBankVis. HistoBankVis allows an immediate and efﬁcient interaction with underlying diachronic data and we go through an investigation of the interplay between case marking and word order in Icelandic and Old Saxon to illustrate its features. We then discuss challenges posed by the lack of annotation standardization across different corpora as well as the problems we encountered with respect to errors, uncertainty and issues of data provenance. Overall we conclude that the integration of Visual Analytics methodology into the study of language change has an immense potential but that the full realization of its potential will depend on whether issues of data interoperability and annotation standards can be resolved.


I INTRODUCTION
In this paper we discuss the potential of methods from Visual Analytics [Thomas and Cook, 2005] for the study of language change by focusing on a web-based tool we have built in collaboration with colleagues from computer science: HistoBankVis [Schätzle et al., 2017, Schätzle et al., 2019] is a multilayer visualization system developed specifically for historical linguistic research.HistoBankVis allows for an interactive and exploratory access to a complex data set by using several interlinked visualization and filtering techniques in combination with a structured analysis process.We highlight the effectiveness of HistoBankVis by presenting a concrete test case which investigates syntactic change in Germanic, using historical corpora annotated according to the Penn Treebank format.Our tool goes a long way towards ameliorating several current methodological challenges for historical linguistics, these are discussed in section II.
As necessary background information, we provide an introduction to Visual Analytics in section III, describe the functionalities of our HistoBankVis system in section IV and show how it works with respect to investigating an interaction between dative case and word order in Icelandic.We use the IcePaHC corpus (Icelandic Parsed Historical Corpus; Wallenberg et al. [2011]) for this purpose.Like many existing corpora, IcePaHC is annotated broadly according to the Penn Treebank format [Marcus et al., 1993].In order to compare our results for Icelandic with other Germanic languages, we identified further suitable corpora and experimented with these, most prominently the Penn Parsed Corpora of Historical English [Kroch and Taylor, 2000, Kroch et al., 2004, 2016] and the Heliand Parsed Database (HeliPaD; Walkden [2015Walkden [ , 2016]]).
In seeking to extend our case studies, however, we encountered several challenges.One concerns the fact that although families of linguistic corpora are annotated according to broadly similar standards, whether this be according to the Universal Dependencies format [Nivre et al., 2016] as in PROIEL (Pragmatic Resources of Indo-European;Haug and Jøhndal [2008]) or the Penn Treebank style, the annotations are in fact not fully interoperable.In addition, we were confronted with further challenges, i.e., annotation errors and issues pertaining to data uncertainty and data provenance.We present and discuss these challenges in section V, setting the stage for future work which seeks to mitigate such problems in processes of corpus annotation and analysis.We conclude that while the integration of methodology from Visual Analytics into historical linguistic research has great potential, this potential will only be unlocked to its full extent once issues of annotation interoperability are comprehensively dealt with.This includes developing systematic methods of dealing with inconsistencies and errors as well as annotation uncertainty and data provenance.

II METHODOLOGICAL CHALLENGES FOR HISTORICAL LINGUISTICS
Over the past two decades, a multitude of digitized text corpora has been made available for historical linguistic research.These text corpora are often enhanced with elaborate linguistic annotations, including annotations for inflectional morphology, parts-of-speech, syntactic constituents, syntactic hierarchies and/or dependencies.Prototypical annotation standards are the Penn Treebank format [Marcus et al., 1993, Santorini, 2010] and the Universal Dependencies (UD) framework [Nivre et al., 2016].A great advantage of such corpora is that they allow for the quantitative investigation of structurally complex phenomena.However, the intricacies involved in producing high quality linguistic annotation and the difficulty of understanding highly complex interactions between various linguistic and extra-linguistic features and structures over a temporal dimension poses myriad new challenges.
Linguistically annotated corpora have usually undergone a manual annotation process in addition to an automatic preprocessing.Although the manual annotation process is time-consuming, manual annotations allow for sophisticated and high quality annotations.The annotations often reflect a deep linguistic analysis, e.g., in the form of syntactic hierarchies, dependencies between phrase structure constituents and semantic information.This kind of linguistic information is typically stored ('banked') in treebanks.Examples for historical treebanks are the Penn Parsed Corpora of Historical English [Kroch and Taylor, 2000, Kroch et al., 2004, 2016], the Icelandic Parsed Historical Corpus (IcePaHC; Wallenberg et al. [2011]), the Heliand Parsed Database (HeliPaD; Walkden [2015Walkden [ , 2016]]), and PROIEL (Pragmatic Resources of Indo-European;Haug and Jøhndal [2008]).While the latter contains annotations in the Universal Dependencies format, the Penn Parsed Corpora of Historical English, IcePaHC, and HeliPaD are annotated in the Penn Treebank-style.
Figure 1 shows a sample for a Penn Treebank annotated sentence from IcePaHC with annotations for clause type, constituents, noun type, grammatical relations, case marking, verb type, lemmas, tense, and voice.The sentence in Figure 1 is a matrix declarative clause (IP-MAT) with the pronominal dative subject mér 'me' (NP-SBJ for subject NP, PRO-D for pronoun and dative case) and the main verb finna 'find', which occurs in the middle form finnst 'think, seem' in present tense (VBPI).The availability of such elaborate syntactic annotations allows for the automatic extraction and quantitative investigation of intricate linguistic patterns over time.
Statistical methods for the quantitative analysis of extracted data have become a standard part of the methodological toolkit in historical linguistics.Typical examples include the calculation of correlations and/or dispersion statistics, multifactorial regression modeling, or the use of clus- tering methods (see Hilpert and Gries [2016] for an overview).Typical standard programming languages employed in historical corpus linguistics for the automatic extraction of the relevant linguistic patterns are Python and R [Bird et al., 2009, Baayen, 2008].Additionally, off-theshelf tools such as CorpusSearch [Randall, 2000] are available for the extraction of linguistic patterns from annotated corpora.Yet, the use of statistics, in particular of inferential statistics, is not always appropriate in historical linguistics, since data sparsity is a well-known problem of diachronic corpora.
Since multiple feature interactions have to be taken into account in the diachronic analysis of a single phenomenon, a multitude of high-dimensional data tables with different characteristics are usually generated.A prototypical historical linguistic data table is given in Table 1, with diachronic data extracted from IcePaHC showing the interaction between subject case marking (here NOM(INATIVE) vs. DAT(IVE)) and voice (active, passive, middle) across the Icelandic diachrony (see also Schätzle [2018] for similar and more detailed data).Another example of a prototypical historical linguistic data table is given in Table 2, showing data extracted by Taylor and Pintzuk [2011] from the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE, Taylor et al. [2003]) and the Penn-Helsinki Parsed Corpus of Middle English (PPCME2, Kroch and Taylor [2000]).

Periods
Finding significant patterns in such multidimensional tables is by no means easy.For one, the task is complex since identifying patterns and feature interactions across many such tables requires the pair-wise comparison of the relevant bits of information in the form of numbers and percentages, while keeping the temporal component in mind.Moreover, data sparsity is an issue and statistical significance is often calculated on the basis of only few occurrences of the actual observation.This may degrade the statistical conclusions for the data, making the comparison of results across the feature interactions extremely difficult.Meaningful patterns may also be obscured.In contrast, irrelevant patterns might be interpreted as significant   [Taylor and Pintzuk, 2011, 91].
further level of complexity is added by the fact that statistical calculations generally require the definition of fixed parameters, e.g., time periods.This is problematic when the selected time periods are too coarse or too fine-grained for the analysis such that transitioning periods and interesting patterns therein are absorbed by the periodization.Addressing this issue, Schätzle and Booth [2019] developed DiaHClust, a data-driven method for identifying stages of language change based on hierarchical clustering (i.e., Variability-based Neighbor Clustering; Gries and Hilpert, 2008), which groups corpus data into time periods with respect to the relevant changing linguistic features.However, the technique relies on calculating differences between features which are known to have changed over time in the language, knowledge which is often not (yet) available.
The factors involved in a change are often elusive, either because the phenomenon has not yet been investigated by the community or because the matter is generally under dispute.Therefore, a researcher may have to investigate a multitude of different interactions in order to test existing hypotheses and to generate new ones, creating numerous high-dimensional data tables with different features and characteristics along with a new statistical model for each new hypothesis.This is a costly and time-consuming process, resulting in data which is difficult to navigate.Alternatively, at times, researchers calculate all possible statistical models for a given data set to test for multiple hypotheses at once.Again, this results in data that is difficult to navigate and misinterpretations of the data might occur, with irrelevant patterns surfacing as significant.
In the next section, we show how these methodological challenges for historical corpus linguistics can be overcome by integrating the use of Visual Analytics into diachronic investigations.

III VISUAL ANALYTICS FOR HISTORICAL LINGUISTICS
Visual Analytics (VA), "the science of analytical reasoning facilitated by interactive visual interfaces" [Thomas and Cook, 2005, 28], presents a significant methodological opportunity for historical linguistic research.VA methods combine automated algorithmic analyses with interactive visual components, integrating the human into the analysis loop [see Thomas andCook, 2005, Keim et al., 2008].The general aim of VA is to present potentially interesting and significant correlations in a high-dimensional data set saliently so as to enable significant patterns to emerge visually.The interactive and exploratory analysis process is guided by the VA Mantra: "Analyze first, show the important, zoom, filter and analyze further, details on demand" [Keim et al., 2008].Figure 2 illustrates the coupled process of knowledge generation in VA, where the left hand side illustrates the parts involved in a visual analytics system and the right hand side depicts the reasoning process of the human, composed of exploration, verification, and knowledge generation loops [see Sacha et al., 2014].Since historical linguistic data is inherently multidimensional, with complex feature interactions being the norm rather than the exception, historical linguistics constitutes an ideal test bed for VA applications.VA tools and techniques enable an exploratory and interactive access to subspaces contained in historical linguistic data, i.e., significant correlation patterns embedded in a high-dimensional data space, which immensely facilitates the identification of language change and relevant interacting factors.
In recent years, sophisticated visualizations as developed within the field of computer science have increasingly been applied to the investigation of language change.Previous visualizations have mainly focused on the investigation of semantic change by visualizing the diachronic development of word senses.Examples are the scatterplot visualization developed by Rohrdantz et al. [2011], the pixel visualization by Rohrdantz et al. [2012] (see also Rohrdantz [2014]), and the similarity plots based on line charts by Jatowt et al. [2018].Other approaches make use of parallel coordinate plots, e.g., Culy et al. [2011] employ Structured Parallel Coordinates for the investigation of diachronic changes in the use of modal verbs across different registers of academic discourse and Theron and Fontanillo [2015] visualize the diachrony of word meanings in historical dictionaries as parallel coordinates.There are also some visualizations which were designed for the investigation of syntactic change, focusing on the diachronic visualization of syntactic phenomena and potentially interacting factors.Examples are the glyph visualization by Butt et al. [2014] (see also Schätzle and Sacha [2016]) and the ParHistVis tool for the investigation of linguistic change in parallel corpora which employs steamgraphs and Sankey Diagrams [Kalouli et al., 2019].
In the following, we demonstrate the efficacy of using VA for historical linguistic research by introducing our HistoBankVis system and by applying the system to a concrete case study on syntactic change in Germanic.

The HistoBankVis system
HistoBankVis was first presented in Schätzle et al. [2017] and further extended and improved in Schätzle et al. [2019].We developed HistoBankVis with the overall goal of providing a generically applicable system for the flexible investigation of the type of high-dimensional and complex data typically underlying historical linguistic work.Overall, HistoBankVis combines several interlinked visualization and filtering techniques with a structured analysis process, creating the iterative workflow shown in Figure 3.A tabular data set was chosen as required input format for the system because integrating a data processing module for treebanks into HistoBankVis is difficult.The reason for this is that although corpora are in theory annotated according to mutually agreed upon standards, in practice annotations tend to diverge significantly enough to make work across several different treebanks with the same automatized tools difficult.We discuss this in more detail in Section V.
Bringing the data into a tabular form conforming to the csv (comma-separated values) format has the additional advantage that with respect to the backend of the system, the data is stored in a relational SQL database (supporting SQLite4 and PostgreSQL5 ).The SQL database allows efficient and fast access to the tab-separated data.

Filtering component, dimension selection and time periods
Before visualization, the data set can be explored in the filtering component of the system, providing insights into the data quality.The user can build a task-specific data set by filtering for data points with specific (task-relevant) features and/or from a specific time period, and by selecting the dimensions which are to be investigated, see Figure 5.This is done via the visual construction of SQL-like filters, based on logical AND/OR functions.For each data point, e.g., sentence in the IcePaHC data set, detailed information about all extracted features and the underlying annotations can be accessed via mouse-click.6 Figure 5: Sentence filter (top): The data is filtered with respect to the dimensions word order and subject case so that the resulting table only contains sentences which have the orders O1SV (Direct Object-Subject-Finite Verb), VSO1, SO1V, O1VS, VO1S, and SVO1, and a subject which is marked with dative case.As time range, data from the period 1900-2008 is selected.Result table (bottom): The filtered data set is displayed with respect to previously selected data dimensions, e.g., voice, word order, and verb.
For the subsequent visualization of the selected dimensions, the user has to choose time periods for the visual analysis.HistoBankVis supports a set of predefined periodization schemes for the Icelandic diachrony, but also allows for an individual definition of time periods by the user.

Visualization -Overview, difference, interaction
HistoBankVis has three visualization components which are combined in the data analysis process, providing different views of the data at different levels of detail.All visualizations are designed in D3.js7 as Scalable Vector Graphics.
Compact matrix visualization.The first visualization component is a compact matrix which provides an overview of the data, see Figure 6.Each row and column of the matrix represents one time period.The matrix can be mirrored at the diagonal.This design in particular facilitates the comparison of the data in the first period to all other periods and the comparison of consecutive periods along the diagonal, letting patterns of change emerge visually.The compact matrix visualizes differences between the distributions of the selected dimensions across the individual time periods.A colormap encodes the size of the difference: red indicates a large difference, white a small one.The utility of this visualization is a first at-a-glance look at which of the data dimensions are likely to be significantly different across time periods and therefore worthy of more detailed investigation.Two modes are available for computing differences: statistical significance and distance measure.For calculating statistical significance, we employ χ 2 -tests, mapping the p-values onto the colormap (red p = 0, white p ≥ 0.2).Statistically significant differences between time periods (with α ≤ 0.05) are marked by a dot in the middle of the cell.When the necessary preconditions for χ 2 are missing, e.g., when the data is too sparse, a cross marks the corresponding cells.Alternatively, differences can be computed via Euclidean distance, whereby a high distance indicates a large difference.

Difference histograms visualization.
The second visualization component is the difference histograms visualization.The difference histograms provide a more nuanced view on the diachrony of the investigated features and dimensions.Each time period is visualized as a composed bar chart, where the dimensions are encoded via different colors, allowing for a parallel inspection of the data from the different dimensions.Figure 7 provides an example.In this example, two dimensions are being investigated: word order and subject case.The bars representing the dimension subject case are blue; the dimension word order is orange.The distribution of the respective linguistic features is shown for each time period.The height of a bar corresponds to the percentage of data points (e.g., sentences), in which a feature occurs in comparison to all other features from the corresponding data dimension in the given time period.In order to facilitate the temporal comparison of features, differences between the features across the periods are visualized as separate bar charts below each feature bar.A green bar indicates a feature increase (e.g., as in the last row for the feature SVO1) with respect to the previous time period.In contrast, a red bar indicates that this feature has decreased in comparison to its distribution in the previous time period.The height of the bar reflects the size of the change.For example, in Figure 7 the word order SVO1 increases over time, whereas VSO1 decreases in the period 1900-2008 compared to the previous stage (1750-1899).More detailed information in the form of numbers can be displayed via mouse over.In addition to the comparison of each time period with its previous time range, the system supports further comparison modes, e.g., the comparison of each time period with the average of all preceding time periods.

Dimension interaction visualization.
Although the difference histograms provide an insight into the diachrony of individual features from different dimensions, correlations between changes occurring in the investigated data dimensions cannot be read off the composed bar charts directly.That is, whether a feature change in one dimension is indeed connected to a feature change in another dimension cannot yet be determined on the basis of the difference histograms.Therefore, HistoBankVis employs a third visualization component: the dimension interaction visualization.The dimension interactions implement the Parallel Sets technique [Bendix et al., 2005, Kosara et al., 2006] for the visualization of interrelations between features from multiple data dimensions.Parallel sets are based on parallel coordinates [Inselberg, 1985[Inselberg, , 2009]], but allow for a better investigation of frequency-based categorical data.
Parallel coordinates represent each data dimension as as a vertical axis.The features are placed on the axes as coordinates.Related features between dimensions are connected by a line.Instead of connecting individual data points via polylines across different dimensions, parallel sets visualize connections between data dimensions via colored ribbons, enabling the representation of frequency-based interrelations.The size of a ribbon represents the share which a feature holds of a feature from another dimension.In the dimension interaction visualization component of HistoBankVis, each time period is visualized as a parallel sets visualization.For example, Figure 8 shows the dimension interaction between the dimension voice and the dimension word order in the period 1150-1349.The shares of the different voices (active, passive, middle) are mapped onto the shares they hold of the different possibilities of the dimension word order from left to right, allowing for a detailed investigation of interactions in the form of frequencies.Figure 8 shows that the majority of active clauses in the time period 1150-1349 have VSO1 word order, followed by SVO1.The same holds true for passives and middles, indicating that verb-initial word order was the most common in this time period, regardless of voice.So if one had hypothesized that voice played a determining factor in word order in this period of Icelandic, one would have been wrong.The advantage of the VA system is that the investigation of such a hypothesis can be effected almost as quickly as its initial formulation (if all the data have already been fed into the system, of course).The version of parallel sets implemented in HistoBankVis also provides for the flexible investigation of interrelations by allowing for the reordering of dimensions and features via drag&drop.8For a better overview, the features on each vertical axis can furthermore be sorted alphabetically or according to size (ascending or descending).More detailed information on frequencies and feature correspondences are available via mouse interaction techniques, see the mouse over on the middle/VSO1 ribbon in Figure 8.

Hypothesis generation and feedback loop
A researcher may have to test several different hypotheses in attempting to understand the causes and mechanisms of language change.HistoBankVis has been designed specifically to foster an easy and seamless iterative process of hypothesis testing and generation.Once the researcher has identified a change in one data dimension or detected a potentially interesting correlation between several different features across time, the researcher can react to these insights immediately by feeding the knowledge gained back into the system and then interacting directly with the different VA parts of the system.This can be done within just minutes by filtering the data anew, choosing different data dimensions and/or investigating the data with respect to a different set of time periods.

Investigating syntactic change in Icelandic
We have been interested in the correlation between case marking and word order as part of a larger project and had been investigating this issue for Icelandic.Once HistoBankVis was developed, we began to work with this system and found that the dimension interaction in combination with the overall flexibility of the system indeed facilitated our diachronic investigations immensely.More than once, we were able to identify correlations we had not been able to otherwise anticipate given the current state of the art [see Schätzle, 2018].We illustrate the general way of working with HistoBankVis in this section via a concrete case study which examines the interrelation between subject case and word order in Icelandic (see also Schätzle et al. [2017], Schätzle [2018], Schätzle et al. [2019]).
Icelandic is generally acknowledged to be the most conservative Germanic language in terms of syntactic change.Some changes that have been observed involve word order and case marking.For one, Icelandic follows the Germanic change from OV (Object-Verb) to VO (Verb-Object) in the verb phrase [Hróarsdóttir, 1996, 2000, Rögnvaldsson, 1996].Moreover, a decrease of V1 (verb-first) order in matrix declarative sentences [see, e.g., Butt et al., 2014], and an increasing preference for subjects to occur in the prefinite position [Booth et al., 2017] have been attested.A further change affects the case marking system of the language in that subjects are increasingly marked with dative case [Barðdal, 2011, Schätzle et al., 2015].
While changes in word order and subject case marking have been observed, interrelations between the changes have only rarely been investigated.In this paper, we show how interrelations between changes in word order and subject case marking can be identified and examined within minutes using the HistoBankVis system.For our investigation, we use the IcePaHC data set as described in Section 4.1.1.By means of just a few clicks, we were able to uncover a previously unknown link between dative subjects, word order, lexical semantics and voice in the history of Icelandic.
To begin our investigations, we chose the dimensions subject case (nominative, accusative, dative, or genitive) and word order in the filtering component.In order to avoid overly complicating the picture in the initial stages of exploration, we decided to look at transitive sentences only.Thus, we filtered for sentences which contain a subject (S), a verb (V) and a direct object (O1, henceforth O) in the dimension word order.9Moreover, we decided to investigate the diachrony of word order and subject case marking with respect to the following time periods: 1150-1349, 1350-1549, 1550-1749, 1750-1899, 1900-2008(based on Haugen [1984]]).
Moving on to the visualizations, the compact matrix, i.e., the matrix in Figure 6, showed at a glance that the distribution of word order and subject case changes over time, in particular in the last two time periods .The difference histograms provide a more nuanced view of these changes, see Figure 7.The difference histograms show that over time, SVO increases (green bars), while the other word order possibilities, in particular VSO, decrease (red bars).The most striking change with respect to word order occurs in the last time period (post-1900): While SVO increases significantly, VSO decreases substantially.At the same time, subject case marking changes as well: the use of dative subjects (sbj_DAT) increases slightly at the expense of nominative subjects (sbj_NOM).
Whether these changes are interlinked can now be easily investigated by means of the dimension interaction visualizations.Figure 9 shows the dimension interactions for subject case and word order in the first time period (top) and the period post-1900 (bottom).In the difference histogram for the period post-1900, most sentences occur together with the SVO order and have a subject with nominative case marking (Figure 7-bottom).The dimension interaction for this period provides more insights into how the different word order possibilities interact with the different options for subject case marking, see Figure 9-bottom.What becomes visible in the dimension interaction is a difference as to how the word orders are distributed across the subject cases: While the vast majority of nominative subjects are SVO, which results in the preponderance of SVO in the difference histogram, only approximately half of the dative subjects occur together with SVO.Dative subjects also frequently appear postverbally, i.e., in the orders VSO and OSV (the green ribbon above VSO1 in Figure 9-bottom).By comparing the dimension interactions, we found that the interrelation between subject case and word order has changed over time in IcePaHC.In the period 1150-1349, nominative sub-jects already preferably occur together with SVO.Yet, this preference is much smaller than in the last time period.However, dative subjects have a strong preference to occur together with the VSO order in the first time period, appearing only marginally with SVO.Although the cooccurrence frequency of dative subjects with SVO increases diachronically, it is only in the last time period when dative subjects begin to mainly occur together with SVO order.
In order to find an explanation for why dative subjects lag behind with respect to the overall developments, we decided to take a closer look at the dative subject sentences in the last time period.In order to do this, we simply went back to the filtering component and filtered for dative subject sentences only.This has already been illustrated in Figure 5. Since voice has been determined as a conditioning factor for dative subjects in Icelandic by the existing literature [Zaenen et al., 1985, Sigurðsson, 1989], we now also included the dimension voice in our investigation.Moreover, we looked at the main verbs involved in the clauses to provide insights into the role of lexical semantics -a further determining factor for dative subjects in the language (see, e.g., Jónsson [2003], Barðdal [2011]).In the result table, a large amount of dative subject clauses appeared together with middle voice.In particular, the experiencer verb finna 'find, feel' lexicalized in its middle form finnast 'think, find, feel, seem' occurred most frequently.Overall, dative subjects were found most often together with experiencer predicates in the corpus, e.g., þykja 'think, seem' and líka 'like, please' (in addition to finnast).
We continued our investigation by visualizing the dimensions word order and voice with respect to the filtered data set.In the dimension interaction for the first time period, dative subjects occurred most frequently in active constructions, see Figure 8.In these constructions, as well as in the passive and middle voice, dative subjects were mainly found together with VSO.In the last time period, however, dative subjects occurred most often together with middle voice, and SVO order is most frequently used together with dative subjects in active and middle constructions, see Figure 10-bottom.Thus, the change from OVS to SVO as the preferred word order in sentences with a dative subjects correlates with an increase of dative subjects together with middle voice.This increase is most striking between the last two time periods, compare Figure 10-top and Figure 10-bottom (see also Table 1).Furthermore, while dative subjects still preferably occur with VSO in active and passive constructions in the second to last time period, they already preferably occur in the SVO order in clauses with middle voice, see Figure 10-top.Hence, the increasing realization of dative subjects in the prefinite position, i.e., together with SVO order, not only correlates with the increase of dative subjects in middle constructions, but is moreover driven by this increase.
In sum, our investigation of subject case and word order in IcePaHC by means of HistoBankVis confirmed previous corpus investigations into the Icelandic diachrony, but also led to new insights by discovering previously unknown interrelations between subject case and word order.We confirmed that subjects are increasingly realized in the prefinite position in Icelandic [cf.Booth et al., 2017].Additionally, we showed that the usage of dative subjects increases over time (see also Schätzle et al. [2015]).The dimension interaction visualization as implemented in HistoBankVis allowed for a flexible and quick, but still detailed, analysis of interrelations between several different data dimensions.By means of the dimension interactions, we were able to show that dative subjects consistently lag behind with respect to the overall developments of word order.It is only around 1900 when they eventually begin to follow suit.This change is driven by an increased use of middle verbs, which are mainly experiencer predicates, together with a dative subject.The middle verbs in question mainly appear to have been subject to being lexicalized as experiencer verbs, as part of which the formerly locative oblique argu- ment becomes reanalyzed as an experiencer subject [Schätzle, 2018].This historical change at the lexical level interconnected with the change in syntactic alignment immediately explains the slower tendency for dative subjects to occur in the prefinite position: Only once the dative experiencers are tightly coupled with the subject role and the SVO order has been more firmly established, can the dative subjects conform to the overall word order settings.

Subject case and word order in Germanic
As a language which developed a more fixed word order while retaining a rich case morphology, Icelandic is cross-linguistically atypical.Kiparsky [1997] observed that in the history of Germanic, the development of a rigid word order generally correlates with the loss of inflectional morphology.
In order to gain a better understanding on how the Icelandic changes fit in with the historical developments of the other Germanic languages, we need to investigate correlations between subject case and word order more broadly in Germanic.This can be done fairly easily via HistoBankVis, since all we need is a well-structured tabular data set which contains the relevant data dimensions for the analysis.In terms of cross-linguistic comparison, we initially decided to make use of the family of historical Penn treebanks available for Germanic, e.g., the Penn Parsed Corpora of Historical English and HeliPaD.These have all been annotated according to the same overall guidelines so we assumed that comparability would be guaranteed.However, as discussed in section V, after having successfully worked with one new corpus (see below), we found we needed to take a step back and first address issues of data comparability.
We began our investigations by looking at the interaction between subject case and word order in HeliPaD, a corpus of Old Low German, i.e., Old Saxon.The HeliPaD annotation is similar to IcePaHC, containing sophisticated annotations for case marking and grammatical relations.We could thus automatically extract information about the verbs and verb types, subject and object case marking, and word order for each matrix declarative sentence on the basis of the annotation.However, although the annotations in both HeliPaD and IcePaHC comply with the guidelines of the Penn historical corpora in general, there are differences. 10In addition to providing annotations for case marking on nouns, HeliPaD annotates verbs and nouns for person and number, using the caret symbol (ˆ) to delimit these morphological annotations from the parts-of-speech.For example, in Figure 11, PRO ˆDˆ3ˆS G stands for a third person singular pronoun marked with dative case.The HeliPaD corpus consists of one text only, i.e., the Heliand, stemming from around 1100 CE.Although HeliPaD does not provide us with a diachronic perspective, we could gain insights into the interrelation between subject case and word order at this particular language stage by means of HistoBankVis.In analogy to the IcePaHC study, we filtered the data so that only transitive sentences were considered in the analysis.Then, we moved on to the visualizations of the data dimensions subject case and word order.Since HeliPaD lacks a diachronic component, the compact matrix is not suitable for presenting the data.However, the difference histogram and the dimension interaction visualization can be used for analysis.Both the difference histogram and the dimension interaction visualization indicate that the word order is rather flexible in HeliPaD, with many different possibilities, e.g., see the dimension interaction in Figure 12.
Yet, SVO seems to be the preferred word order option, occupying the largest proportion of the word order axis.With respect to subject case marking, subjects are almost exclusively nominative.Only very few accusative and dative subjects, and no genitive subjects, were found in the transitive clauses.Suggestively, the non-nominative subjects we found did not occur together with the SVO order.
Given the small amount of dative subjects in transitive sentences in the corpus, we decided to go back to the filtering component in order to gain insights about the data from a qualitative perspective.To do this, we simply disabled the previous filter settings and filtered for sentences with a dative subject instead.In total, we found seven sentences containing a dative subject.In the result table, which is shown in Figure 13, we looked more closely at these sentences with respect to the dimensions verb and word order.The dative subjects in HeliPaD do not show a clear preference for appearing in a particular position, thus differing from nominative subjects.This is similar to our findings for the earlier stages of Icelandic based on the IcePaHC data.Moreover, like in Icelandic, the predicates occurring together with a dative subject in HeliPaD are mainly experiencer verbs, i.e., thunkian 'think, seem, feel', see, e.g., the sentence in Figure 11, and likon 'please'.Overall, the findings for Old Saxon seem to line up with our findings for earlier Icelandic.In order to be able to add a cross-linguistic and diachronic perspective, we aimed at investigating the relationship between subject case and word order in the Penn Parsed Corpora of Historical English, which include the Penn-Helsinki Parsed Corpus of Middle English (PPCME2; Kroch and Taylor [2000]), Early Modern English (PPCEME; Kroch et al. [2004]), and Modern British English (PPCMBE2; Kroch et al. [2016]).Moreover, we planned to add the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE; Taylor et al. [2003]), which uses the same annotation format, in order to be able to capture a larger span of the English diachrony.However, as already noted above, we encountered several problems during the process of data extraction.These problems are related to a strong variation in the annotation of grammatical relations and case marking across the corpora, resulting in issues of data comparability and interoperability.These issues are rooted in the annotation process, affecting areas of data un-certainty and provenance.Instead of providing an ad hoc-solution which serves our present investigations only, but would not be transferable to investigations looking at different sets of linguistic factors and different treebanks, we aim at developing a more unified approach to the processing and analysis of historical treebanks which takes into account issues of data comparability and interoperability as well as data uncertainty and provenance as part of future work (see also Beck et al. [2020]).In the next section, we restrict ourselves to giving an overview of the relevant issues to provide a more concrete idea of what kind of problems such an approach would have to tackle.We use examples from IcePaHC, HeliPaD, the Penn Parsed Corpora of Historical English (PPCME2, PPCEME, and PPCMBE2) and YCOE for illustration.Moreover, we lay out how some of the issues can already be addressed by using a Visual Analytics tool such as HistoBankVis.

V UNCERTAINTY AND PROVENANCE ISSUES IN LINGUISTIC ANNOTATIONS
Historical corpora are generally annotated manually after a round of automatic pre-processing (see, e.g., Rögnvaldsson et al., 2012 on IcePaHC).While this allows for the precise annotation of quite complex structures and relations, manual annotations are still prone to errors.These errors might be caused by human unsystematicity, when the same linguistic structure is annotated differently across a single corpus.This is particular problematic with constructions that are rare, since the annotator might not be aware that, after having seen a large amount of other sentences, a certain structure has been annotated in a different way before.Moreover, problems arise when constructions are undergoing a change over time and a uniform annotation cannot be maintained across the time stages covered by the corpus.This raises issues of data uncertainty, with consequences for the reproducibility and replicability of analyses.
A further source of concern is data provenance [Buneman et al., 2000, Cui andWidom, 2003].The field of data provenance within computer science is concerned with understanding how to model, record, and share metadata about the origin of data and the further sharing or processing that data has undergone.We find only little discussion of this topic within linguistics even though data provenance is of major concern.Issues arise already with respect to the raw data collection in terms of the origin of the data and its authenticity/reliability.Data may have been copied incorrectly in manuscripts or papers and once we proceed to the annotation of linguistic information, various steps in the annotation process may introduce errors into the annotated data.Keeping track of such potential sources for errors via a systematically organized set of meta-data seems to us to be a necessary next step within corpus linguistics.
The current state of the art has so far generally addressed issues of data soundness by conducting several rounds of annotation with different annotators, with intermediate cross-comparisons and validation of the resulting annotations, calculating, e.g., inter-annotator agreement.In this system, the creation of an annotated corpus is essentially a learning process consisting of a cycle of annotation, corpus correction, revision and reannotation.
However, manual annotations and re-checking are time-consuming and costly activities, so often a version of a corpus is published after just one or very few of such cycle iterations.Typically, after publication of the corpus more errors are found and reported by the users.Ideally, the reported errors would be corrected and a new version of the corpus released; however, due to funding and time limitations, the official versions of the corpora often remain in a faulty state for a longer period of time, while different individual sites may end up maintaining different versions of the original annotated resource.This in turn almost inevitably leads to imperfect research results, with implications for the reproducibility and replicability of diachronic corpus studies.Although it is good practice to provide information about the version of a corpus when presenting a corpus study, the issues with respect to the comparability of research results persist.In addition, researchers investigating a certain phenomenon often 'cure' the corpus data with respect to their research endeavor to be able to deal with errors and inconsistencies contained in the data, e.g., by excluding the erroneous cases from their investigations or by correcting the annotations for the subset of the data.
For example, during our investigation of subject case marking in IcePaHC, we encountered a range of annotation mistakes, e.g., a number of dative objects were annotated as subjects by mistake in constructions with an empty (non-overt) expletive subject.We manually corrected these mistakes.Thus, replicating our studies on the basis of the official version of IcePaHC is rather difficult.But it is possible to replicate our experiments via the 'cured' data set provided as part of HistoBankVis.Since our correction of the data immediately raises issues of data provenance (who decided when which of the dative grammatical relations are to be counted as subjects?),we also provide the original annotation from IcePaHC for comparison as part of our 'cured' version.That is, in the result table, all extracted features for a sentence can be displayed along with the original underlying annotation.For example, Figure 14-right shows the annotation of a clause with the main verb breyta 'change', which takes a dative object, as provided by HistoBankVis.In the annotation, the dative argument has been erroneously annotated as a subject since there is no other overt subject candidate.The extracted features which are used in the analysis however, see Figure 14-left, have been corrected accordingly and can be inspected in comparison with the underlying annotation.The example of dative subjects illustrates a further problem with annotated data: uncertainty in linguistic annotations.Uncertainty with respect to linguistic annotations can arise when structures are inherently ambiguous and the surrounding contexts are not informative enough to decide on an interpretation.This is particularly difficult in transitioning periods, when structures are in the process of undergoing a functional change and vary between certain linguistic interpretations.As discussed above, we found that many instances of modern dative subjects arose from initial dative marked locatives/obliques in the process of a predicate gaining experiencer verb semantics.It is difficult for the annotator to decide exactly when the dative NP has transitioned to functioning as a subject rather than an object/oblique.Ambiguity and variation are part and parcel of historical change and annotating ambiguous structures is difficult.However, the standard method has been to decide on one of the possible options for analysis and to annotate that, rather than explicitly tagging/annotating instances of data uncertainty.In most cases, a stochastic decision is made and an ambiguous structure is tagged according to the option which occurs more frequently overall.However, if the ambiguity is not made explicit to the end-user of the annotated resource, such a priori decision making might introduce unnecessary artefacts into the resulting corpus data.
Further issues arise with the annotation of changing structures across different stages of the language.For instance, the annotation of case marking in the historical English corpora poses such a problem.While YCOE annotates for case marking, the other Penn Corpora of Historical English do not.This is due to the fact that English lost its morphological case marking system over the course of time.Thus, the linguistic characteristics of a language stage determine the availability of annotations.While data analysis must find a way of coping with annotation inconsistencies that directly reflect a changing language structure, other design decisions appear to unnecessarily complicate automatized cross-corpora historical analysis.
Even though the family of English Penn corpora generally adhere to the same guidelines, we found severe inconsistencies across the English corpora so that we were not able to apply a standardized approach to data processing via HistoBankVis unless we would have invested a considerable amount of time into the historical analysis of the data.Without such a solid historical analysis it is not possible for us to clean or standardize the existing corpora, as our "cleaning" would most likely introduce errors given that we are not experts in Old and Middle English.
For example, YCOE takes nominative case marking as proxy for subjects.Thus, NPs are generally only annotated for case marking, but grammatical relations are not marked explicitly, see Figure 15.Only non-nominative subjects receive the extra subject tag -SBJ.These annotation decisions result from the fact that subjects and objects cannot be clearly demarcated in the Old English stage and nominative case marking might not always indicate the subject constituent (see, e.g., Allen, 1995).In the PPCME2 corpus for Middle English on the other hand, case marking is no longer annotated, but grammatical relations are clearly marked, see Figure 16.In turn, the English Penn corpora differ from IcePaHC and HeliPaD which annotate for both grammatical relations and case marking.Yet, the way in which the information is encoded again differs between the corpora, see Figures 1 and 11 respectively.Effecting the necessary changes to make these corpora comparable requires a deep linguistic analysis, which is time-consuming and requires expert knowledge.
Overall, we conclude that although guidelines exist, there is a lack of a uniform standard for treebank creation.Often, inconsistencies are introduced to be able to deal with historically changing and/or ambiguous constructions.Moreover, inconsistencies across corpora might be the result of crucial differences in the linguistic systems of the different languages and language stages represented by a corpus.Although the resulting variation in the corpus annotation is often well-motivated from a linguistic perspective, this makes it difficult to process the annotated data in a standardized way, causing issues of data reproducibility and comparability of results.Moreover, data uncertainty is a core but only rarely addressed issue in historical linguistic work [cf.Merten andSeemann, 2018, Booth et al., 2020].Here again, we note that VA also has as yet unexplored potential in addressing these issues, as a promising line of research on the visualization of data uncertainty (see, e.g., Bonneau et al. [2014] for an overview) as well as data provenance [e.g., Stitz et al., 2016, Herschel et al., 2017, Ben Lahmar et al., 2018] exists.To our knowledge, such methods have not yet been applied to linguistic research.Integrating methods from the fields of uncertainty and provenance visualization into linguistic annotation processes and into the analysis process could be a great opportunity for mitigating issues of uncertainty, provenance, reproducibility and replicability in linguistic research.

VI CONCLUSION
In this paper we introduced a Visual Analytics system named HistoBankVis and showed how it has the potential to greatly facilitate historical linguistic research by allowing for efficient and fast interactive exploration of the underlying data.This is coupled with visual presentations of the computed correlations and statistics.The parallel sets technique provides an overview of interrelations found between various linguistic features of the corpus, allowing the researcher to formulate and test various different hypotheses with just a few clicks.HistoBankVis is furthermore good at generating at-a-glance overviews while still providing the ability to interact with the individual data points and annotations from the original corpus.We showed how the access to the underlying data does justice to one issue of data provenance in that we provide access to both our corrected version of the corpus and the original annotations of the official release.
However, in experimenting with the family of Penn-style treebanks for historical English, we also found that we could not usefully and systematically extend our investigations because of issues of annotation interoperability across corpora.We discuss specific issues with respect to data uncertainty and annotation standards that have come up in our work and note that these are general issues for any type of corpus work involving annotated data.These need to be solved in order to ensure replicability of results and analyses and we suggest that here, again, Visual Analytics provides a promising way forward and should thus become part and parcel of the methodological corpus linguistic toolkit.

(Figure 1 :
Figure 1: Sample annotation for a sentence from IcePaHC.

Figure 4 :
Figure 4: IcePaHC data set for the diachronic investigation of subject case and word order.

Figure 6 :
Figure 6: Compact matrix showing statistically significant differences between time periods.

Figure 7 :
Figure 7: Difference histograms for the dimensions subject case (blue) and word order (orange) in transitive sentences from IcePaHC.

Figure 8 :
Figure 8: Dimension interaction for voice and word order from 1150-1349 in IcePaHC.

Figure 9 :
Figure 9: Dimension interactions for subject case and word order in transitive sentences from IcePaHC in the periods 1150-1349 (top) and 1900-2008 (bottom).

Figure 10 :
Figure 10: Dimension interactions for voice and word order in transitive sentences with a dative subject in the periods 1750-1899 (top) and 1900-2008 (bottom).

Figure 11 :
Figure 11: Sample annotation for a sentence from HeliPaD.

Figure 12 :
Figure 12: Dimension interaction for subject case and word order in transitive sentences from HeliPaD.

Figure 13 :
Figure 13: Result table showing the dimensions word order and verb for sentences which have a dative subject in HeliPaD.

Figure 14 :
Figure 14: Original annotation and extracted features for a sentence from IcePaHC as provided by His-toBankVis.
Figure 15: Sample annotation for a sentence from YCOE.

Figure 16 :
Figure 16: Sample annotation for a sentence from PPCME2.

Table 2 :
Distribution of new and given objects across VO (Verb-Object) vs. OV (Object-Verb) order in AuxV (Auxiliary-Verb) clauses in Old and Middle English texts