How to read the 52.000 pages of the British Journal of Psychiatry? A collaborative approach to source exploration

Historians are confronted with an overabundance of sources that require new perspectives and tools to make use of large-scale corpora. Based on a use case from the history of psychiatry this paper describes the work of an interdisciplinary team to tackle these challenges by combining different NLP tools with new visual interfaces that foster the exploration of the corpus. The paper highlights several research challenges in the preparation and processing of the corpus and sketches new insights for historical research that were gathered due to the use of the tools


INTRODUCTION
Contemporary historians face an overabundance of digitised and digital born sources.As Roy Rosenzweig pointed out more than 15 years ago "Surely, the injunction of traditional historians to look at 'everything' cannot survive in a digital era in which 'everything' has survived.''(Rosenzweig 2003) Navigating, exploring and analysing these sources can form a major research obstacle for historians and humanists alike.In this paper we therefore want to discuss how to foster the process of corpus exploration through the application of Natural Language Processing (NLP) and interface design while closely supporting the research process.Our case study focuses on an ongoing PhD project concerned with the dissemination of psychiatric knowledge across Europe between 1843 and 1925 through five different psychiatric journals in different languages2 with a total of about 250000 pages.The sheer quantity of this material formed a severe obstacle to perform a valuable and thorough analysis of the sources without computational support and to provide answers to the specific research questions.To mediate this, we set up a probing exercise to explore the feasibility of potential solutions to the problem, creating at the same time an interesting challenge for computer science due to the unstructured nature of the historical sources as data.
In the following we will briefly outline the specific challenges posed by historical psychiatric journals and describe the context of our case study.After this we will discuss how we performed a semi-automatic cleaning of the dataset and describe different approaches to topic modelling in order to enable the historian to find relevant material in the sources.We will then introduce histograph as an interface that enables corpus exploration.Here we will present a new type of visualisation that improves the practical usability of the topic modelling output for the sake of content exploration as well as the addition of further content lenses or perspectives that support the exploration tasks of the historian.We will conclude the paper with a discussion of the historical findings and in how far the use of digital methods enabled additional insights beyond what is feasible through purely paper based research.Overall we would like to highlight that the problem at hand mandated a highly interdisciplinary approach, bringing together the work and professional perspectives of a historian, a computer scientist and a software engineer.The mutual understanding that this project has created amongst its participants on the structure and content of the (digital) corpus, as well as the different processes that were involved, ranging from data cleaning and NLP and topic modelling, to the validation of the results and its ultimate integration into an interface (histograph) was integral to obtain relevant results, and hence the collaboration between all the researchers involved cannot be underestimated.

Psychiatric journals and the case of the Asylum Journal
Psychiatric journals were at the time multipurpose publications.Aside from meeting reports, they included original scientific contributions and observations, along with literature reviews of domestic and foreign periodicals and books.In addition to these evident features, they also contained announcements for conferences and contests for prize essays, while also keeping track of the rotating job-positions and deaths of asylum physicians and university professors.This variety posed a serious challenge to the research endeavour, as the exploration of analogue sources with this extent and diversity through close reading only, is an almost impossible task.To tackle this issue we started a probing exercise where only one journal was taken as a case-study in order to reduce the complexity that different languages and source structures would create.For this we chose the Asylum Journal3 which was created in 1853, and is currently known as the British Journal of Psychiatry (BJP) which was, and still is, published under the auspices of the Royal College of Psychiatrists (°1841) 4 .This specific journal covers a period of 72 years   5 , resulting in 52167 pages to explore.As a first step, we collected the different issues of this journal in a digital form 6 . Te material in question was downloaded in PDF format, along with plain text transcripts when available.If transcripts were unavailable, an optical character recognition (OCR) process was performed using ABBYY recognition server7 to process all missing segments and to produce plain text output.Both the existing text transcripts as well as the post-OCRed text showed significant recognition errors.We did not perform any post-OCR corrections, and although this is an important aspect to take into account, we did not have the time nor the means to create an OCR gold standard for the corpus and apply thorough OCR cleaning.Currently the OCR errors do not seem to affect the creation of a useful topic modells.We do however plan to review this in more detail in the future.
The complex interaction of different transformations (pdf -> OCR -> plain text) on the data in various degrees of quality posed a challenge for the historian.Particularly because, to work with such corpora the historian needs to be able to manipulate digitised "originals" in order to use them in a more consistent manner.Damerow and Wintergrün have put forward that historians always need full control of a corpus as even within "a digital framework, historical research relies on trust in its sources" (Damerow and Wintergrün 2019).This trust in sources is a precarious balancing act for the historian.How can we control a large and, in terms of data quality, inconsistent digitised corpussuch as the Asylum Journaland give the researcher as much exploration possibilities as possible to do historical research?This was one of the prevailing issues as well as the driving force within our project.

I NLP/TOPIC MODELLING
In this part we provide a motivation for the entire text analysis pipeline, from cleaning the raw data to topic modeling.

Corpus preprocessing
The discovery of domain-specific ideas in the corpus would call for an analysis of all relevant scientific publications, such as articles, book reviews, honorary lectures and to a certain extent asylum reports.However, manual inspection of our input data revealed other types of heterogeneous content in the corpus such as financial accounts, obituaries, letters to the editor, etc. Importantly, these publications were almost as equally present in the journal volumes as the more relevant content types.Absence of clear separation marks between the publications, missing or incomplete table of contents, coupled with less than perfect OCR quality, made it impossible for us to automatically extract only relevant publication types from the corpus.Following the practice of customisation of document length for the topic modelling (Schofield, Magnusson, and Mimno 2017) and the author style detection (Tschuggnall et al. 2017), we defined one page as the document unit, and splitted the entire scope of 72 volumes into 52396 pages.Next, we aimed to reduce pages that bear little content, such as membership lists as well as drawings, sketches and tables.To identify such pages in an unsupervised way, we applied a corpus-statistics methodology (Gries 2009), exploiting the observation that such pages will have far less stopwords8 than a page with "regular" content.After the tokenization9 , we calculated the average number of stopwords per page in every given year, and removed all the pages that either had less stopwords than the threshold defined for the given year, or had less content-bearing words than stopwords while the number of content bearing words was below the threshold set to 30.In this way we reduced10 the number of pages to 47085.

Topic modelling
The diversity of the corpus in terms of publication types prevented us from proceeding directly towards idea extraction.An intermediate step, which would help to structure the entire corpus into semantically distinct groups, was necessary.Doing so would allow to identify document clusters potentially pertinent to the main research questions and suitable for more in-depth investigation; at the same time, we expected to isolate irrelevant clusters which could be excluded from future analysis.A widely used unsupervised text clustering technique which meets our requirements is called "topic modelling" and aims at unsupervised identification of latent topics within a collection of text documents.It has been employed in digital humanities in a variety of ways, such as to derive and analyse topics in eighteenth century newspapers (D.J. Newman and Block 2006); reason about Classics as a field (Mimno 2012); or to analyze the development of themes in the field of computational linguistics (Hall, Jurafsky, and Manning 2008).As opposed to those use cases, we do not recur to topic modelling in order to reason about a field as a whole; rather we use it in order to get access to relevant material which would enable further investigation.Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) and Non-Negative Matrix Factorization (NMF) (Lee and Seung 1999;Greene and Cross 2016;Luo et al. 2017) are popular algorithms underlying topic modeling.Both methods, in their original definition, use stochastic initialisation which leads to an instability of generated topics and make experiment results hardly reproducible.Several studies have been performed with the idea of finding ways to mitigate such instability issues.For LDA, approaches such as the selection of the most frequently assigned topics or the clustering of topics generated during repetitive runs (Riedl and Biemann 2012;Mäntylä, Claes, and Farooq 2018); freezing the topic labels or lists of top topic descriptors, generated during model updates (Yang et al. 2016), or optimisation via differential evolution (Agrawal, Fu, and Menzies 2018) have been proposed.For NMF, initialisation with Non-negative Double Singular Value Decomposition (NNDSVD), which does not contain a stochastic element, has been shown to effectively reduce instability of the generated topics (Belford, Namee, and Greene 2018).

Selection of the topic modelling algorithm
Being aware of these properties of LDA and NMF, we made preliminary runs with both and compared the performance in terms of the stability of the generated topics.We ran LDA11 and NMF12 on our data to generate sets from 2 to 10 topics, and repeated runs for 50 times using each of the methods.Then, we compared the similarity between the topic sets generated by LDA and NMF across all the runs, applying Average Jaccard Similarity measure (Greene, O'Callaghan, and Cunningham 2014), which accounts for both term overlap and their ranking.The results show that NMF stability outscores that of LDA on all topic sets, ranging from [0.36 -0.45] for LDA, versus [0.65 -0.90] for NMF.This concern became relevant for us due to the interdisciplinary nature of our project as we were aware that unstable topic sets make data exploration more difficult for the historian.Besides topic stability, we considered two more factors when choosing between LDA and NMF.One of them is the number of hyper-parameters to be specified.Both algorithms require parameter which indicates the number of desired topics.In addition, LDA requires two more parameters: ɑ, which is responsible for topic distribution over the corpus, and β, which controls the word distribution over topics.Even though "off the shelf" values are sometimes applied, (e.g., MALLET package defaults which correspond to those suggested in (Steyvers, Griffiths, and Kintsch 2006)), it has been demonstrated that there are no reliable defaults and that parameter values should be learned from the given data set.This makes the use of LDA more demanding in terms of preparation.The second aspect concerns topic properties.It has been noticed that LDA tends to produce rather generic topics with substantial overlap between the topic descriptors.NMF on the other hand yields more specific topics and may be more suitable for the analysis of narrow "nonmainstream" domains (O'Callaghan et al. 2015).Furthermore the historical research questions are concerned with temporal changes in psychiatric knowledge acquisition.This perspective calls for a technique that would allow us to follow the dynamic evolution of the identified topics.Various approaches have been proposed to track topic evolution over time.Many of them extend the LDA technique to associate topics with time frames (X.Wang and McCallum 2006;Blei and Lafferty 2006;Cui et al. 2011;Beykikhoshk et al. 2018).However we opted for the NMF-based approach to maintain integrity of the workflow.We use a toolkit 13 which was designed by D. Greene and J. P. Cross, to model topics sequentially, moving from discrete time frames (called "window topics") towards their combined representation over time (called "dynamic topics") 14 .In the first phase, documents are organised into disjoint sets, with each set corresponding to one time window (i.e. the BJP of 1880).Topics generated from this input are called "window topics" and independently represent every time window.Each document in the window is scored with respect to each topic in the topic set.For the second, dynamic phase, the original corpus is represented in an abstract manner by "topic documents", where each topic document is composed of the top-ranked terms from each topic in the window topic model.The underlying assumption is that thematically close topics coming from different windows will share similar topic documents 15 .Dynamic topics can be seen as a generalisation of the window topics 16 , such that multiple individual window topics can be associated with a single dynamic topic.Each window topic is scored with respect to each dynamic topic which quantifies their relatedness.

Corpus preparation for the topic modelling with NMF
As mentioned earlier, one printed page is considered a document in our corpus.Documents which have been retained after the initial cleaning, are lemmatised and lower-cased.The advantage of working with the lemmatised text is that it helps to reduce the vocabulary size and ties morphologically distinct words with the same meaning into one lexical unit, thus making future topic descriptors more diverse.A document is considered valid if it contains at least 50 terms after the removal of stop words and terms with less than 3 characters.Additionally, terms which occur in less than 10 documents are removed.A term-document matrix is constructed for each time window.In our experiments window size is equal to 1 year, which corresponds to one yearly volume of the journal.Tf-idf term weighting and document length normalization are applied.Overall, our entire data set consists of 72 time windows, spanning the period from 1854 to 1925.These are composed of 47069 documents and 139422 terms.Matrix size ranges from 160 documents in 1896 to 1140 in 1881, with an average of 661.2 documents and 1921.08 terms per window.

Topic coherence evaluation and selection of the number of topics
We have mentioned earlier that topic modeling algorithms require the user to specify the number of topics to be generated.As it is often the case, such a number is not known in advance.The dynamic-nmf toolkit allows the user to adjust the number of topics via 13 Available at https://github.com/derekgreene/dynamic-nmf. 14In what follows we use the original notation proposed by the authors of the toolkit, to designate the topic types."Window topic" refers to a discrete time frame, which in our study is equal to one year."Dynamic topic" refers to an aggregation of window topics, generated for a specific time span, into one topic. 15Detailed description of the entire procedure of window and dynamic topic generation implemented in dynamic-nmf toolkit is described in (Greene and Cross 2016). 16The span of window topics which serves the ground for the dynamic topic generation is defined by the user and may range from a subset of windows to the entire set.
generation and "on the fly" evaluation of multiple topic sets.Topics are evaluated from the point of view of their semantic interpretability, or "coherence".The intuition behind this is that a coherent topic consists of descriptors that tend to co-occur or belong to the same semantic space in the reference corpus17 .Various metrics for topic coherence calculation have been proposed, including Pointwise Mutual Information (D. Newman et al. 2010) or log conditional probability (Mimno et al. 2011).The dynamic-nmf toolkit calculates topic coherence as follows: the coherence of an individual topic is calculated as the mean cosine similarity between vectors corresponding to the topic descriptors, where the vector space is constructed using the Word2Vec algorithm (Mikolov et al. 2013).Coherence of the entire model is represented as the mean coherence across all the topics.In practice, we can plot model coherence values calculated for each topic set in range of [ min , max ] and select with the highest score.The number of topics used in the experiments described here range from 5 to 10.This choice was initially motivated by the limited amount of time we had for the experiment and the number of domain experts that could evaluate the topics.

Selection of the reference corpus
The Word2Vec algorithm responsible for the construction of the vector space can be applied to external data, which is not used for topic modeling (D.Newman et al. 2010), or to the same modelled text (Mimno et al. 2011;Greene and Cross 2016).To select an appropriate strategy we explored three different Word2Vec models, generated from potentially relevant corpora.The models we considered are the one generated from our data18 ; a model generated from full texts of the biomedical articles available via the PubMed Central (PMC) portal19 ; and GloVe (Pennington, Socher, and Manning 2014), trained on the Wikipedia, newswire and web crawled sources.To choose which of these models suits best to our task, we select "syphilis" and "paralysis" as keywords, which are in the research scope of the historian, and generate lists of their closest semantic neighbours.
Comparing the lists (see Figure 1) we notice that despite occasional commonalities between them, PMC-based and GloVe models demonstrate rather contemporary vocabulary, which does not exist in the same form within our corpus.On the other hand, non-words that are present in abundance in our corpus due to OCR errors, might not be found in the two other models.Both will contribute to the "out of vocabulary" 20 problem.Based on these observations we trained a Word2Vec model on our data 21 , as we believed that it would be the most representative for evaluating the coherence of the topic models.As a side remark, we notice that semantic neighbors of "paralysis" and "syphilis" yielded by the BJP-based model, are almost the same.It is not the case for the corresponding lists generated by the PMC-based and GloVe models.It suggests that the BJP-based model reflects an appropriate nineteenth century view on the two diseases. 22We could further assume that a combination of models generated from a variety of corpora might benefit some large-scale historical study of psychiatry, highlighting changes that happened over time.

Topic modelling experiments
With all the preliminaries being defined, we proceeded with the window and dynamic topic generation.Window topics are reviewed by the historian from the point of view of their interpretability and agreement with the automatic recommendation regarding the best number of topics.Evaluation at this stage has a two-fold role: assess the quality of the topics and, importantly, help to find the best setup for the generation of dynamic topics as they are built upon the per-window perspective.Among the window topics some have been judged generic while others appeared more focused and could be easily labelled.This turned out to be a recurrent phenomenon observed throughout the entire time span.Figure 2 shows an 8-topics set, generated for the window 1915.Note for example how topic # 5 (marked as 1915_5 in figure 2) is focused on biology or topic # 1 (1915_1) brings together experimental methods, disease diagnostics, and particular diseases (syphilis and paralysis) which involve the brain and nervous system.On the other hand, topic # 7 (1915_7) is generic in nature and most likely discusses the administrative practices within psychiatric institutions.It is interesting to note that in most of the cases the number of topics suggested by the toolkit, based on the topic coherence evaluation, corresponds to the expert choice.In a few cases of disagreement, the expert mostly preferred more topics that offered a more fine-grained view on the domain and would include topics that otherwise would have stayed underscored.There was also an opposite example, when the expert suggested to collapse certain topics into one for the sake of logical generalisation.These observations guided our strategy in the next phase.Here, for every window which is supposed to participate in the dynamic topic construction, one topic model has to be specified out of several, generated with respect to the 20 Out-of-vocabulary words are words which occur in the test corpus but do not exist (or are accounted for) in the training corpus. 21We use the entire dataset of 47069 documents for the training of the Word2Vec model. 22General paralysis and syphilis were throughout most of the 19th century seen as separate diseases.Via continuous research scientists over time began to understand that the symptoms of general paralysis were caused by syphilis.The characteristics of general paralysis are now known as belonging to neurosyphilis which is one of the forms that can manifest itself within the tertiary stage of syphilis.requested range.All 72 windows were accounted for the dynamic topic stage, as well as all 10 topics.For each year, we picked up the model with the number of topics approved or suggested by the expert.In cases where explicit instruction was not available, we selected the model with the highest number of topics out of three top-ranked topic partitionings.The reason why we preferred this solution to a mere selection of the partitioning with the highest number of topics (i.e., =10 in our experiments) is that even though the expert inclined for a higher number of topics, there were cases when too many topics led to over-specialisation of the model.The specific use of the dynamic topics will be further discussed in the historic case study.

Remarks on the number of window topics and window topic coherence
Even though our working set was composed of maximum 10 topics per window, we did generate topic sets with larger , thus creating sets of models with in range [4:20].By doing so we wanted to address two questions: a) if there were topics which remained hidden from the researcher's view in the smaller partitioning; b) how much topic coherence scores help to identify the most appropriate number of topics.We analyse partitioning into 10 and 20 topics from the researcher's and model coherence perspectives.The researcher's analysis of topics, generated via various partitionings, revealed the following: a) certain topics remained stable and were visible in both -10 and 20 topic partitioning.Specifically, these are topics related to biology (especially nerve, blood and brain cells and tissues, which is logical, given the thematic orientation of the entire corpus), criminal insanity, general paralysis or psychoanalysis.Another omnipresent topic had to do with various aspects of psychiatric institutions.b) the constant interplay between the advantages and disadvantages of the growing number of topics.While higher topic partitioning ( >10) often offered a more detailed view of the field and rescued small yet important topics, there were also clear cases of over-specialisation and redundancy of certain topics.These observations lead to the second question of whether coherence measure provides reliable guidance for selection of the number of topics.As suggested in (Röder, Both, and Hinneburg 2015; Greene and Cross 2016), we analyse median coherence scores, computed over the entire time span of the corpus.Figure 3   We observe that coherence scores depend mainly on the number of topic terms taken into consideration and that the fluctuation in the scores between the number of topics is only marginal.We also calculated the average number of best suited topics for 10 and 20-topic partitioning, and the impact of the partitioning change onto the number of most appropriate topics suggested by the coherence score for every window over the whole time span.The change rate is calculated as the number of years in which the best number of suggested topics in the 20-topic partitioning differs with respect to the 10-topics partitioning, divided by the number of years in the time span.We observe the following: a) the average number of topics does not not exceed 5.9 even with the 20-topic partitioning (which is not in full agreement with the researcher's feedback described above); b) the average number of topics, as well as the change rate depend strongly on how many topic descriptors participate in the coherence calculation.The latter is 0.3 with top 10 descriptors, and goes down to 0.1 with top 20 descriptors.It also turned out that the mean difference between the maximal and minimal coherence values computed over the entire time span was 0.01 for 10 and 0.03 for 20 topics partitioning with the corresponding standard deviation of 0.09 and 0.11.It means that the values as such do not have a discriminatory power.This observation corresponds to the researcher's remarques when several partitionings for a window could be accepted and seemed equally valid.Taking into account the researcher's feedback on the quality of the topic partitioning and its statistical properties, we can conclude that coherence scores can only serve as guidance but cannot substitute the expert's assistance in selecting the best number of topics and even more for the number of topics in the partitioning.Another observation has to do with the compatibility of the Word2Vec coherence measure with the very essence of the topic modelling.One of the strengths of the latter is its ability to capture polysemy23 .On the contrary, Word2Vec constructs one vector per word, thus collapsing all possible meanings and contexts of a polysemic word.As a result, as long as the topic descriptors occur in contextual proximity, the topic will be considered coherent, irrespective of whether or not such polysemic words have been placed into one or multiple topics.It suggests that even though a Word2vec-based measure is suitable for topic coherence evaluation, it is less efficient for judging the best number of topics.It would be interesting to experiment with the contextual word embeddings, such as those proposed within Flair (Akbik, Blythe, and Vollgraf 2018) or Bert (Devlin et al. 2019) and see if they can help approximate topic modelling results and thus help to estimate the most appropriate number of topics.

II HISTOGRAPH
The massive digitisation of sources during the past decade by (national) libraries and archives has produced an abundance of interfaces to search collections.These search capabilities have also evolved over the years, now consisting of a mixture of simple keyword queries, n-gram frequencies, NER and topic modelling.However, not all platforms make all these options available and the most common feature is still keyword search 24 .A specific characteristic of these kind of repositories 25 is that sources and tools are often merged.A researcher is required to use the digitised sources that these national and private institutions provide with the tools they offer.This aspect is quite revealing within the HathiTrust "data capsules" 26 : the data the researcher can use is encapsulated within a very specific framework and with very specific material.In essence, this does not have to be problematic if the researcher can find the material(s) he or she needs within one database.However, when this is not the casejournal X is available in digital library A and journal Y in digital library Ba consistent analysis becomes unattainable because digital libraries use different interfaces and a varied range of algorithms to structure and search their content.Furthermore, it is not always likely that the researcher can export data in a PDF, image or text format for analysis.Since the psychiatric journals in our possession came from multiple platforms it was key that the interface and algorithms could operate on different materials, and that the researcher was able to decide which sources needed to be implemented within the tool in order to be systematically analysed.Based on our initial experiments with the historical dataset we decided to employ histograph, a tool initially built for graph based exploration of multimedia collections (Novak et al. 2014;Wieneke et al. 2014;Düring, Marten, Wieneke, and Croce 2015).Histograph 27 is a tool for the exploration and collective annotation of historical source material.It allows users to browse and search documents but also applies advanced visualisation tools for content exploration.As such it offers for example the ability to use the results of face recognition and named entity extraction processes to visualise the co-occurrence of persons within documents in a social graph of relationships.This makes the discovery of unexpected patterns possible.The decision to use histograph as a platform for the exploration of the BJP corpus was based on two observations.First, while the raw topic modelling output was readable for the researcher it required a constant switch between the output and the source documents (PDFs) to review the content, thereby slowing down the exploration process.An integration into histograph promised to speed-up this process by enabling a direct link between the scores of the individual documents and the ability to view and access the relevant content both in the form of a transcript as well as a scan of the original page.Second, in discussions with the historian it became apparent that, even though topic modelling allowed us to structure the corpus, an efficient exploration strategy would demand multiple perspectives on the content which up to now were achieved through the use of different tools.In a spirit of building the historians macroscope (Graham, Milligan, and Weingart 2015) it became our goal to build an exploration facility that would provide multiple perspectives on the whole corpus in one view with the ability to identify and zoom into relevant sections that require a more detailed investigation.While the technologies and some of the visual approaches used show an overlap with projects such as Antconc (Anthony, Laurence 2019), Paper Machines (Guldi, Jo and Johnson-Robertson, Chris 2019) or even Googles NGram viewer (Michel et al. 2011) the main difference lies in the focus on exploration: data and visualisations are not used to make statements about the content of the corpus on their own, but provide a tool to find "the needle in the haystack" or simply the occurrence of very specific information.In so far, our approach is similar to other projects in the digital humanities such as the impresso project 28 .Clearly, this approach falls in the broader domain of Information Retrieval, defined as "[…] finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)."(Manning, Raghavan, and Schütze 2008).While we also exploit conventional approaches such as full text search, we want to show in the following that we transcend this scope by incorporating a contextualised view that not only returns a list of potentially relevant documents but also lays out an integral map for exploration by providing additional indicators of relevance for the research question at hand.As a point of departure, we imported the corpus as individual documents, each containing a single page, and linked in sequence with each other.With this approach we remained very close to the original concept of histograph thereby enabling us to make use of various existing filters in the interface.As it turned out, the time granularity of the corpus (one issue per year) led to a very sparsely populated timeline that made the selection of individual documents more complex than necessary.To mitigate this issue, we decided to evenly map each individual issue to the whole year in question, starting with page one on the first of January up to the last page on the thirty-first of December.While this process could be considered ahistorical in the sense that the individual pages were obviously not published on the specific dates within the year, it allowed an easier integration into the tool and had the added effect of making the number of pages per year more visible through displaying the density of documents per year.

On the development of a new type of corpus visualisations
Based on the topic modelling results described in the previous chapter, our first task was to implement a visualisation that would enable users to cross reference the topics with the individual pages.We also wanted to experiment with different means to use and visualise the topic modelling scores as structural indicatorsbased on the assumption that individual sections formed by pages within a journal would show a coherence across topicsand to foster the exploration of individual as well as combined topics.
To this end we developed two visualisations to foster the corpus exploration: the first interface (topic view) gives quick access to the topic terms, and provides filtering capabilities for topic scores, keyword mentions in the documents/pages as well as ordering by time or relevance, a display of documents that match the criteria on a timeline as well as direct access to the different documents that match the selected criteria (see figure 4).While this interface builds on established mechanisms such as a filtering pane and direct display of the documents, surplus value emerges from access to the topic term list (which can become keyword filters when selected) as well as the visualisation of hits across the corpus.The second visualisation we developed aims in providing an extensive perspective on the overall corpus with the ability to drill down into relevant sections by zooming with the added feature of adding multiple lenses on the corpus and to display matches in an integral view on the corpus.This view is currently titled Bucket of Explorables and is based on the idea of content "buckets" that aggregate a variable number of documents in a visual unit (see figure 5).On the largest zoom level, the full corpus is distributed into the available buckets and is based on two user controlled methods: an equal number of documents in a bucket and an aggregation by year.The former method is based on the fact that comparing groups of the same size is statistically correct while the latter method is closer to the working practices of historians for whom units of time have great significance.Selecting an individual bucket opens a preview of all documents in the bucket below the visualisation.With the "Zoom into current bin" feature, users are able to open the specific year or an aggregated selection of documents to get a more fine grained view.Initial experiments with different kinds of visualisations for the topic modelling scores led to the conclusion that bubble charts could provide a relevant point of departure for our task as they allow to show two dimensions at the same time through colour and circle size which we used to represent the relative intensity of a topic in a group of documents.Based on this initial view we added several other features through an iterative process in close cooperation with the historian.First, we experimented with a number of standard ways to calculate topic modelling scores for a group of documents.In the end we decided to let the user choose between two methods: mean value and maximum value.Grouping by the mean value of topic scores in a group of documents allows the researcher to see the general presence of the topic in the group.Whereas the maximum value allows the user to pinpoint the group that contains at least one document with a high score for a particular topic and "zoom in" into this group for further exploration.Furthermore, we added an option to edit the names of the topics which were initially only numbered in sequence to make them more meaningful for the user and to foster the evaluation of the results.The "Persons" view contains an aggregated number for each bucket with the number of individual persons mentioned.This is based on the results of a named entity recognition (NER) and named entity linking (NEL) task that was performed during the import of the documents into histograph For the NER task we used the flair framework and the 'ner-fast' model trained on the Conll-03 dataset29 .Identified entities were either linked to the Google Knowledge Graph which yielded mainly Wikipedia references for our dataset30 or, based on a string comparison, linked to a Nodegoat dataset procured from the 2TBI and TIC collaborative project31 .The settings of this layer permit users to filter people with certain characteristicsdepending on the available datawhich is in turn displayed in the visualisation.As an experimental feature we integrated the ability to exclude people based on their nationality which allows, in the case of a British journal, to quickly identify contributions of, or references by actors outside of the British community.While this feature could become useful to identify relevant sources in the transnational migration of ideas, the current quality of the data poses significant limitations on its practical use: linking to Wikipedia entries relies on information available in the Wikipedia which limits results to people that actually have a Wikipedia entry and at the same time introduces noise by providing false positives through non-contemporary persons that have (vaguely) similar names.To mitigate this issue, we currently implement mass annotation features that will speed up manual annotation and cleaning but plan to further investigate the development of NEL tools that take historical context into account.The "Keyword mentions" layer allows users to plot the occurrence of user defined keywords across the corpus and the selected bucket range.Users can combine keywords through logical operators (AND, OR) and can add multiple layers to visualize the occurrence over time across the corpus (see figure 6).Both visualisations were used by the historian with a clear preference for the performance of the first view as illustrated in the following chapter.Nevertheless, we believe that the second visualisation (Bucket of Explorables) shows a significant potential for corpus exploration as it provides a unique and extendable view on a large number of pages over time.We foresee in particular further finetuning in the display of the documents within a bucket by including the ability to access document ranges with particular features.A first version of this additional filtering capability has been implemented but initial testing demonstrated the need for more granular access to the filter definitions.We see further potential in the development of additional layers for the visualisation, allowing for example additional access to features of entities and more complex query mechanisms that we will implement and evaluate in the future.

III CASE STUDY GENERAL PARALYSIS
Developing an algorithm and interface to solve problems of an overabundance of historical documents with less than optimal scanning and OCR quality meant that the problem not only needed to be unpacked in all its facetswhich we outlined abovebut it also required a case study to validate our tool and to verify the usefulness of our approach and collaboration efforts.

Selecting general paralysis as a case study
Choosing a case study started with observing the different topics that had been generated by the topic modelling algorithm.The medical historian in our team evaluated the keyword output [case insanity disease mental symptom dementia form mania paralysis general disorder attack patient condition acute melancholia state treatment delusion epilepsy], as well as explored and close read multiple pages that were suggested by the system for this particular topic.This information correlated with a subject that we could identify and describe as "general paralysis", a common disease within psychiatric institutions during the nineteenth and early twentieth century.Furthermore, it was a stable and frequently present topic throughout the corpus (see figure 4).However, to thoroughly vet our approach, we also required material against which to test our selected case study.Before illustrating the usefulness of the chosen topic modelling algorithm and visualisation facilities in histograph as an aid for researching the history of psychiatry, it is important to give the reader some background knowledge about general paralysis and the status of current historical research about this disease.General paralysis was regularly witnessed within asylums in the nineteenth century, and had by some contemporaries been dubbed as "the disease of the nineteenth century". 32Its physical characteristics ranged from speech and writing impairment, over a diminishing of locomotion (i.e.difficulty with walking), ataxia and seizures, ultimately progressing into complete paralysis.Some of its mental symptoms were the presence of (grandiose) delusions as well as the onset of dementia, resulting in a total loss of intellectual capabilities. 33Once a patient was diagnosed with the disease, it was a certain death sentence as no real cure was available.Psychiatrists had difficulty with disambiguating and understanding general paralysis because of its resemblance to other diseases.Its symptoms, causes and treatment stayed for the larger part of the nineteenth and the beginning of the twentieth century a question mark for many physicians and psychiatrists, and was debated in extenso.Early on physicians laid a link with (an excess of) sexual "indulgences", but also many other factors were speculated to be the cause of general paralysis.Psychiatrists attributed a "fast life", prolonged mental efforts, excessive use of alcohol, hereditary, syphilis, and even sunstroke or a combination of them as a possible cause 34 .It was only from 1913 onwards that general paralysis would become known as neurosyphilis when its cause and the link with syphilis was acknowledged by the medical world.Before this revelation, general paralysis was for a considerable amount of time categorised as a separate disease entity with its own symptoms and disease pattern, while the relationship with syphilis drifted in and out of focus in the medical world across Europe.

Topic model keyword explorations and research validation
As mentioned earlier, the link between general paralysis and syphilis only slowly gained traction amongst physicians and psychiatrists, which is something we can also observe within the keyword lists of our topic models.We can observe that in the topic called "general paralysis", the words "syphilitic" and "syphilis" are listed as keywords.Furthermore, another topic that was identified by our algorithm consisted of the keywords [paralysis, general, syphilis, reaction, fluid, case, spinal, positive, syphilitic, blood, serum, cerebro, test, negative, paralytic, disease, wassermann, cent, result, organism].This topic could be identified by the historian as either "Syphilis" or as "general paralysis".In a first instance we categorised it as the former since the Wasserman test used to detect syphilis (keywords: wasserman, serum, reaction, fluid, test) was a clear reference to this disease.In a later stage we renamed this topic into "Syphilis/general paralysis" because both became in the early twentieth century so closely connected and keywords related to general paralysis were also visible within the list of terms.This primary exploration illustrates that the diagnostic and conceptual difficulties accompanying this disease shimmers through the keywords and topics proposed by the system.It also highlights that the annotation of topics with a keyword is not only a convenience on the level of the interface but also a critical step of historical interpretation of simple keyword lists that require reflection.When we investigate the aspect of time for the topic "general paralysis" and "syphilis/general paralysis" via keyword tracking, a couple of interesting aspects came to the surface (figure 7). 35The "general paralysis" topic shows that words related to syphilis only show up sporadically from the 1890's onwards.In addition, the "syphilis/general paralysis" topic shows how keywords related to syphilis are more present "Bijdrage tot de statistiek der dementia paralytica in Nederland door v. C.", Psychiatrische bladen, 1884, volume 2, p.55-56. 33A full overview on the range of symptoms can be found in: (Davis 2008, 87-96).
34 "The Pathology of General Paresis.By W. H. O. Sankey,M.D. Lond., Medical Superintendent, Female Department, Middlesex County Asylum, Hanwell", The Journal of Mental Science, 1864, volume 9, number 48, p. 467-493; "Pseudo-General Paralysis.By THEO.B. HYSLOP, M.D., Assistant Physician, Bethlem Royal Hospital", The Journal of Mental Science, 1896, volume 42, number 177, p. 314.  3This exploration was done using text files to track keyword changes throughout the years.These files were made by the computer scientist to explore and assess the correctness of our pipeline (the tracking of keyword changes is not yet available within the HG environment due to its internal structure).
from the 1890's onwards.Also note that, for our second topic, the system mostly shows years from the second half of the nineteenth century as relevant for this specific topic.In essence it is a continuation of the previously mentioned general paralysis topic, while at the same time illustrating that a change occurred in how psychiatrists talked about general paralysis and syphilis as (a standalone) medical subject(s).
Figure 7. list of keywords per year for the topic "syphilis/general paralysis".The tracking of keyword changes through time was done using plain text files.These files were made by the NLP expert to explore and assess the correctness of part of our topic modeling algorithm.The tracking of keyword changes is not yet available within the histograph environment due to its specific internal structure.
While we could interpret these findings as an indication that the system is capable to capture the delicacy of historical subjects, it should be noted, that topic modelling remains a rather coarse tool.Due to its statistical nature only very strong "signals" will create an output and their interpretation is often not as straightforward as in our case.

Validating existing knowledge and beyond
In order to validate our system from a content point of view, we compared it with current historical research about general paralysis.We especially contrasted our outcomes with the findings of Juliet Hurn (Hurn 1998).Her thesis deals partially with the same corpus which she had to manually investigate via the table of contents or indexes of these journals. 36In addition, she asked similar questions about general paralysis and syphilis.The comparison of existing research with our explorations (keyword tracking, reading through the highest scoring pages, etc.) proved to be consistent with the broad lines depicted in existing historiography.We were in particular able to confirm that the pages used by Hurn in her thesis were also picked up by our algorithm and scored accordingly for the specific topic.While confirmation of existing knowledge is a necessary precondition to demonstrate that new digital tools are capable to reproduce this knowledge, our goal was not limited to a reproduction of the status quo but also to identify how tools can enable researchers to contribute to existing research by identifying new and relevant content in large amounts of source material.The combination of topic modelling and the exploration of the corpus through histograph highlighted two particular domains where an application of the toolchain provided new inputs for historical research.
1.In the identification of "hidden" pages that are not captured through classical approaches such as table of content analysis 2. Through limiting the search space by narrowing down the amount of relevant pages in combination with topic modelling and keyword search 3.4 Exploring "Hidden" pages One of the added values of our topic modelling and page ranking algorithm, is that it can supply us with more specific and relevant content.The basis of historical research are the sources that are available to the historian, which is furthermore guided by the particular ways through which we have access to them (i.e.physical or digital; in full or only partially).To the best of our knowledge, Judith Hurn only had the opportunity to manually investigate the BJP journal to acquire relevant content.This is not to say that researchers should not use the table of contents at all, far from it, as these finding aids give the researcher an idea of the original structure of the information.Nonetheless, human-made indexes and table of contents can contain mistakes (keywords that are missing, wrong page number attributions, etc).In addition, the historian could, while sifting through these indexes and table of contents, accidently skip across useful material.Furthermore, article titles do not capture the full breadth of an article and it is difficult to judge from the title alone if it could contain useful information.Instead, our system allows the researcher to not only rely on human made indexes or table of contents, but to trace the presence of themes and ideas in the full corpus.
In particular we were also able to identify pages in other sections of the journal which were not visible or accessible through analogue table of contents and index analysis.For example, the index of the year 188037 , only contains a limited number of articles that would be considered relevant in a TOC analysis (figure 8), such as for example the article "On Syphilitic Epilepsy" shown in the original article section of the journal.
Figure 8. Left: table of contents from the BJP of 1880 (26/113-115) with only one or two articles relating to general paralysis or syphilis.Right: within the section of "psychological retrospect" we find information extracted from a French journal about general paralysis.This could not have been discovered via the table of contents When we explore the same year of the BJP with the topic modelling scores, we can identify pages that carry titles or headings that do not explicitly refer to general paralysis or syphilis, but instead are called for example "psychological retrospect".38A strict table of contents analysis therefore misses this particular section of information about general paralysis.Such weak and "hidden" signals are just as important to form an idea about the history of general paralysis and syphilis as are the original or translated articles we can find in the BJP.Especially since it gives us a more nuanced view of where the information that was distributed amongst British physicians originated from, as well as what foreign physicians found important to inform British psychiatrists about.These features enable the historian to trace the finer and sometimes implicit connections and references that are made to general paralysis, expanding and fine tuning our understanding of the topic at hand.This means that the information that Juliet Hurn derived from the journals, and the information we extracted is somewhat different.Hurn for example placed an emphasis on French influences (Hurn 1998, 24), but through the sources compiled by our algorithm this should be re-assessed to a certain extent.The pages selected by our system show that information did not only arrive in Britain via France, but also in some instances from Germany.Not only did the BJP include reviews of German books, but also translations of German articles, which indicates the importance British psychiatrists placed upon this German research.All of this indicates that German influences should be studied more.The possible importance of this German influence, will also resurface in the following section about "limiting the search space".The possibility of finding "hidden pages" is a very important feature of histograph, however it is difficult to absolutely quantify how many pages are identified (and how many are still missing) because the creation of a gold standard that we could use to measure the amount of hidden pages would require a stable set of characteristics.This kind of definition is not possible because what qualifies as a hidden page, and how many there are, largely depends on the topic under study and the specific facets the historian is interested in, in a given moment.Hence we would be only able to give a number of hidden vs visible pages for a very specific research subject and under a significant time investment of the historian, as they would have to perform close reading and annotation of the corpus to create this gold standard.While this time investment was not feasible within the limits of the project, we will explore and review this issue in future research.

Limiting the search space
To get a more precise idea of the development of the syphilis-general paralysis connection it is useful to specifically look at those pages within the corpus that mention both words together.The "general paralysis" topic for example contains 46.847 pages that include information on this topic.As therefore almost all pages of the corpus have a score for the topiceven though this score diminishes tremendously towards the end of the setfurther indicators become necessary to identify relevant pages within the corpus.The occurrence of specific keywords became a very practical indicator to limit the amount of pages that need to be reviewed.Histograph fosters an iterative approach where a user selects and tests different keywords directly in the interface and sees the results plotted across the full corpus in time.
By combining ["paralysis + syphilis"] or ["paralytic + syphilitic"] within the general paralysis topic, only pages are shown where both words occur.This not only leads to even more relevant pages to explore but also to less information that needs to be consulted.The "paralysis + syphilis" keywords resulted into 1103 pages, the "paralytic + syphilitic" keywords result in 1455 pages to explore (see figure 9).When analysing the high ranked pages in histograph, we see that some of these articles/pages were also investigated in former historical research.While the information on these particular pages are in line with current historiography, its content can in addition be used as a basis for further (re)examination through a more focused use of the "keyword mentions" feature in histograph.One specific article for example mentioned more than 85 psychiatrist who were concerned with syphilis and general paralysis. 39No historian has taken an interest in these different psychiatrists nor did they take (or have) the opportunity to investigate the relevance of these European psychiatrists throughout a substantial corpus.Using the "keyword mentions" in histograph allows us to limit the search space to target very specific pages where these persons are mentioned in the relevant context.In prospect, a better quality of NER/NEL that correctly annotates relevant persons could further streamline the process of drilling down to relevant pages.
To further experiment with this, we selected one name (Karl Friedrich Otto Westphal, an important German psychiatrist) from the earlier mentioned list.Through logical reasoning we know that Westphal's name will most likely be mentioned in relation to a very diverse range of topics.To counteract this, we opened the "general paralysis" topic and specified that the system should only search for documents where "Wespthal", "syphilis" and "paralysis" cooccur on the same page.This resulted in an output of 13 pages.While opening the first page from within this specific selection we read the following: "[...] Several German writers, Meyer, Westphal, Oedmansson, and Griesinger, have specially studied this kind of case [general paralysis]: and, while they do not altogether agree as to their conclusions, the weight of evidence is in favour of the view that those are cases of general paralysis, who, having previously had syphilis, have the character of their symptoms influenced by it to some extent, a casual relation only existing between the two diseases". 40This statement was made in 1873 in a lecture by the British physician David Skae.This once more confirms that German physicians were actively involved in the research of general paralysis and that, although we have forgotten many of them now, psychiatrists at the time were well aware of whom across Europe participated in solving the medical questions surrounding this disease.

CONCLUSION AND FUTURE WORK
While historians and humanists are grateful for the availability of sources through online repositories and archives, this can, nonetheless, create various research obstacles.One of these is that researchers can easily lose focus and oversight while exploring large corpora, such as The British Journal of Psychiatry containing more than 50000 pages.A potential remedy to these challenges lies in the development of new tools that are tightly aligned with the research questions at hand and support the researcher in the exploration task.As discussed in this paper, we experimented with different forms of natural language processingin particular topic modellingand data visualisation to provide new modes for interrogating the corpus.Throughout this probing exercise, communication between the four team members was crucial for the completion of this project.
In terms of text pre-processing of the source documents our current work did not include specific measures to clean up the text itself.We can observe that overall the topics make sense but face occasional occurrences of non-words.Cleaning of the OCR errors would improve the text consistency and allow us to generate more informative topics.On the part of developing new interfaces for historical research, our approaches to distant reading (Bucket of Explorables, keyword time tracking) can provide the historian a first "lay of the land", a map to navigate a corpus through dominant topics, especially in cases wherein the researcher is not familiar with the corpus.Quick overviews like this provide a distant reading or abstraction that cannot be acquired through either analogue or digital close reading.Furthermore, these mechanisms can help to observe changes across multiple journals within the same country or across borders.This can aid researchers for example to identify the inception of technical terms and, later on via close reading, the particular definitions and points of view of psychiatrists and physicians that lay behind these terms.The acceptance of the general paralysis-syphilis connection for example, is often seen as globally acknowledged between the late 1890's and early 1910's.Our system could help historians to be more precise about these developments in other journals or countries by making it easier to identify and 4. In depth testing of a topic model for configurations with a larger number of topics and for different language corpora.We recently started experimenting with the creation from 4 up to 20 different topics to better understand the impact of variable topic numbers for the exploration of corpora.Allowing for the option of different numbers of topics makes the exploration of certain themes even more fine-grained.Besides this experiment, we have also started to explore topic modelling for corpora in other languages (French).These experiments are still in an early stage.However, it became apparent rather quickly that certain topics (i.e."the criminally insane" or "general paralysis") can be clearly observed in the British as well as the Belgium corpus, signifying a certain unity in psychiatrists' interests across different countries.

Figure 2
Figure 2. 8-topics partitioning of the BJP volume from 1915.
shows median coherence scores for the sets of topics generated with k ∈ [4,10] and k ∈ [4,20], calculated based on 10 and 20 top-ranked topic descriptors for every partitioning in BJP.

Figure 3 .
Figure 3. Median coherence scored calculated for calculated for the topic sets with k ∈ [4,10] and k ∈ [4,20] based on 10 and 20 top ranked topic descriptors for BJP in 1854-1925.

Figure 4 .
Figure 4. topic view with topic terms, filtering features, timeline view and access to individual documents.

Figure 5 .
Figure 5. Aggregated topic model scores displayed in individual bins by year.The bucket for the year 1876 is highlighted and the associated documents for this bucket are displayed below the visualisation.

Figure 6 .
Figure 6.View of the corpus with multiple layers, which enables e.g. the identification of parts of the corpus that score high in certain topics while containing a set of keywords.

Figure 9 :
Figure 9: keyword exploration of co-occurrences of "syphilis" and "paralysis" in the topic view printed over time