The Database of Byzantine Book Epigrams Project: Principles, Challenges, Opportunities

This paper presents an overview of the history, conceptualization, and development of the Database of Byzantine Book Epigrams, an ongoing research project hosted at Ghent University. It also offers a glimpse into current and future research threads carried out within the project, with an eye on long-term sustainability

talked strictly about the medium of printed books. Medievalists have been applying the term "paratext" also to a wide range of other "dependent" texts, such as summaries, commentaries, etc (see Andrist 2018). Our project, however, is concerned with poems that are literally on this threshold between the materiality of the book and the world of the main text, and thus provide a space of communication between on the one hand author, scribe, and/or patron, and on the other hand, the reader.
The interest of these book epigrams lies in the fact that they are interconnected in many ways.
First, there is the material embedment. The epigrams are still to be found in their original context of use. Visually and physically, we encounter them in the same shape as the original readers encountered them. They are to be found in liminal places of the book: beginning or end, quire divisions, at the end of texts. Even when they only offer a one-line title of a work, it is meaningful that they do so in a metrical shape and often in a visually distinctive script.
Hence, they are the place where codicology and palaeography interact with techniques and attitudes of reading.
Second, there is the social entanglement. These poems are the direct expression of all the actors and communities involved with book culture and text transmission. Their study allows us a direct insight into the history of reading culture, and the social and economic background of manuscript production. Moreover, the initial project was not adapted to the thickness and complexity of contextual data relevant to a more comprehensive understanding of book epigrams.
In a second phase (2015-2020, funded by the Special Research Fund of Ghent University), the focus shifted from collecting data to exploration and analysis. The contextual datasets (persons, place in manuscript, geographical origin) underwent a major revision that resulted in a more encompassing, rational and granular treatment of data. Text critical problems were

II.1 Database Exploration and Navigation
The first version of the project data platform was launched online in 2015, principally featuring objects (i.e., manuscripts) and epigrams. Technical development has not stopped ever since, and in 2019 we proudly announced the launch of the current version of the DBBE, including textual and metatextual data presented through a new web application and stored using a completely revised data model. This section of the paper aims at illustrating the appearance of the front-end of the relational database and at clarifying the principles that underlie its contents. It is an update of existing publications (e.g., Bernard & Demoen 2012) and draws from current practices, as described on the project's Help page. 8 The book epigrams recorded in DBBE are arranged according to a fundamental distinction between Occurrences and Types. Occurrences are all the instances of epigrams, exactly as they are found in manuscripts, including all kinds of idiosyncrasies in terms of orthography, punctuation etc. The texts offered in Occurrence records are usually based on the diplomatic transcriptions made by DBBE team members on the basis of (reproductions of) manuscripts.
In some cases, however, we still must rely on manuscript catalogues or other related publications, which do not always apply rigorous standards and very often normalise the manuscript text.
Each of these Occurrences is linked to one or more overarching texts, called Types. The concept of "type" has been introduced to point to a reconstructed text that groups one or more Occurrences with an identical or similar text. Type records provide normalised texts, "readable" adaptations of the evidence (i.e., the Occurrences) found in manuscripts. The text source of Type texts is always mentioned: this is either an external source, such as an existing critical edition or a manuscript catalogue, or "DBBE", especially for previously unedited Types or for edited texts that have been substantially updated by the DBBE team. In some instances, however, the identification of a Type as a normalised, readable version of one or more Occurrences is unsatisfactory. A standardised Type text does no justice to the reality of These five categories constitute the menu items on the database website, each of which leads to the search page for the respective category. Guidelines on how to search these categories have been meticulously described on the Search tips and tricks page; 10 they will not be elaborated on here. Instead, in the next paragraphs, we will navigate through the different detail pages of the five categories and show how these pages and the data they contain are linked to each other.
As an illustration of Type detail pages, we will consider Type 3818 (Figure 1), 11 which is a short poem summarising the content of the Byzantine novel Hysmine and Hysminias and a "typical" type: it does not pose specific philological challenges and can be perfectly presented within the current infrastructure of the DBBE. A very important section of any Type record is the list of Occurrences and of related Types.
Looking at these lists, users immediately get a glimpse into the transmission of a particular epigram. For instance, all preserved Occurrences related to Type 3818 are written in manuscripts from the 14 th century onwards, they all count three lines (which is a valuable piece of information, as we will see below, in section II.2) and one of them (preserved in a 15 th -century Parisian manuscript) was possibly added at a later stage, as it is dated to the 16 th century. All this information is conveniently epitomised on the Type record itself and is presented more diffusely in each of the pages linked to in the screenshot in Figure 2.  The text of Occurrence records is presented as faithfully as possible, mostly by means of manuscript transcriptions. The editorial conventions we use can be found on the Search tips and tricks page. 16 The detail pages of Occurrences and Types have the fields on Metre(s), Genre(s), Subject(s), Comment, Bibliography, Number of verses and Acknowledgements in common; note, however, that for some of these information fields Occurrences can diverge from their Type. In addition, the following essential information is given on Occurrence  Acknowledgements we acknowledge all DBBE team members who worked on this record as well as anyone who has provided us with information.   Date either an exact 'born' and 'died' date or date intervals to account for inexact dates. In addition, in order to be as objective as possible, we also provide 'attested' dates and intervals whenever known: these are based on references to the Person in primary sources.
Office office(s) held by the person in question, transliterated instead of translated Identification references to external identifiers or prosopographical databases 22 Bibliography bibliographical references     The alternative readings provided in Occurrences 22293 and 24352, which do not belong to the Types text field, are not searchable through the Type Search Page. A better visualisation of this kind of textual variation is one of the challenges that is being dealt with in the current, third phase of the project (see infra, III.2).
As said, the text of Occurrence 22293 is the faithful transcription from a manuscript where the scribe actually merged two epigrams into one. This Occurrence is therefore linked both to Type 3654 and to Type 3878 (incipit Τοῦ Πνεύματος τὰ θεῖα τόξα καὶ βέλη), to which the first six lines of the poem correspond. An Occurrence record is thus designed to link a specific poem to one or more Types in a flexible way, in order to show how texts are freely combined.
In addition, a more refined system is in place to visualise portions of text on the level of the verse. Each line of any Occurrence is ideally linked to a so-called verse variant page, where parallels and deviations of single verses are clearly shown next to each other. The first verse of Occurrence 22293 (Τοῦ πνεύματος τὰ θεῖα τόξα καὶ βέλη) is for instance to be found twenty-six times in DBBE, with a minimal degree of variation (  First and foremost, such overviews are a helpful tool to collate verses, whose various attestations are conveniently displayed one after another in the first column, which is useful from a philological point of view. Secondly, these pages also help to visualise, up to a certain extent, the textual fluidity typical of book epigrams. In the second and third column of the grid appear respectively the place  The verse variant page 29 corresponding to the first line of Type 4568 30 (Δαυϊτικὴ πέφυκα δέλτος ᾀσμάτων, "I am the book of David's songs", also referring to the book of Psalms) includes twenty Occurrences. As visualised in Figure 9, the text of this verse significantly varies from manuscript to manuscript. In fact, the decision to group these verses together is based on the intuitive observation that their text is similar enough to belong together. However fortunate the verse linking through verse variant pages can be, nonetheless, a more refined digital approach to the concept of similarity within the corpus and its objective measurement will only prove beneficial (see infra, III.2).

II.3 Technical Description
The data platform released in 2015 made it possible to publish a sizable amount of research data that had previously only been available to a select group of researchers. Upon extensive usage of the platform, we realised that additional features, such as the possibility to link some types of records and the ability to find records based on certain fields, that could improve the platform's usability, would have been desirable. The way the data model was set up, however, made it impossible to implement some of the required features. As major changes were going to be made to the data model, it was decided to create a new web application as well, including both a public interface with search and detail pages and an editing interface. The new web application (launched in 2019) was developed by the Ghent Centre for Digital Humanities, which brings together Digital Humanities efforts in order to make the implementation and maintenance of projects more sustainable. The database is hosted on a Ghent University server, which offers a long-term guarantee on data safety and sustainability. The current data model was primarily conceived to facilitate the linking of records. In this data model, almost every class hierarchically inherits from a base entity class, facilitating the creation of links between these classes with much fewer join tables. A special link table "factoid" is introduced to generically model relations between objects and a related date, interval or location where relevant. Figure 10 depicts an example of a factoid. In the previous data model, a specific join table was used to define relations between Occurrences and Types.
In the current data model, these relations can be defined using the generic join table factoid. 33 31 https://www.dbbe.ugent.be/occurrences/23732. 32 https://www.dbbe.ugent.be/occurrences/22096. 33 The following relations can currently be modelled as factoids: "completed at", "reconstruction of", "located at", "written", "related to", "subject of", "origination", "based on", "appears immediately after", "died", "born", and "attested". This system is also used to model different kinds of relations between two Types: "Is part of", "Variant (permutation of words)", "Variant (other metre)", "Variant (other subject)", "Consists of", "Same cycle", "Variant (other wordings)", "Unknown". Additional "factoid types" can be added if it is ever needed to model additional relationships between objects. Records from different tables representing the same data were detected and migrated towards a single record in a single table. As an example, we will consider a person that is both the subject and patron of an Occurrence. As demonstrated in Figure 11, in the previous data model, a single person was represented by two different records, which could (and do in this specific example) contain different information. In the current data model, the data about this single person are stored in a single record, as visualized in Figure 12. This approach makes the data model more transparent, prevents data duplication and reduces the chances of data inconsistencies. Some objects, such as regions and manuscript contents, can be placed in a hierarchical structure (the so-called "parent-child system"). For example, the region of Apulia is part of the region of Southern Italy, which is part of the region of Italy. In the current data model, regions and genres are stored in such a way that any of the hierarchical levels can be linked to other objects. This allows filtering based on these different levels, where underlying levels are included in the results as well (a search for the region Southern Italy includes the results linked to the region Apulia, see infra). Figure 13 depicts the data model that makes it possible to have any number of hierarchical levels. A region and its parents can be retrieved using a recursive SQL query, as listed in Figure 14. Data types "fuzzydate" and "fuzzyinterval" were introduced to enable the description of uncertainty in dates. In the old data model, uncertain dates were described by a start and end date in some places and by a textual description in other places. In the current data model, "fuzzydate" is defined by the earliest and latest date possible. A "fuzzyinterval" is defined by the earliest and latest date possible for the start of the interval and the earliest and latest date possible for the end of the interval. Using a "fuzzydate" instead of a textual description enables date-based searching and sorting. For example, it is possible to describe the date of birth of John XI Bekkos 34 between 1230 and 1240 by using (1230-01-01,1240-12-31)  As described in detail in section II.2, verses are grouped together in groups of verse variants and these groupings can be used to discover interesting links between Occurrences that might not be discovered by solely relying on the relations between Occurrences and Types. Figure   15 visualizes some of the verses and the occurrences they are part of for verse variant 15714. In the new web application that was needed to take full advantage of the changes in the data model, the front-end framework Vue.js is used to improve the user experience of the data platform. This makes it easier to display both the search filter configuration and the corresponding search results on a single page and makes it possible to immediately update the results after filtering. The search results contain links to detail pages including more information on certain objects; links to related objects; and relevant search pages with predefined filters. In the editing environment, the use of a front-end framework allows to provide in-place validation of the entered data and enables an uncluttered representation of the data to be edited. It also allows the creation of tools to speed up some specific tasks, such as the linking of verses in verse groups when editing Occurrences. Detail pages are completely rendered on the Symfony back-end to maintain findability by web crawlers. The initial request for a search or edit page results in a minimal HTML page that loads the JavaScript with the Vue.js application that renders most of the web page. The information flow between the different components for detail page requests and the initial edit and search page requests is visualised in Figure 16. On edit and search pages, user actions within the same page lead to data requests to the Symfony back-end initiated by the Vue.js application, resulting in updates of the web page without a complete page reload. Figure 17 depicts the general architecture for these interactions on edit and search pages. The engine behind the search pages is Elasticsearch. Its aggregation feature is used to populate the dropdown menus that can be used for faceted navigation. For each object that has hierarchical values in a field (see supra; e.g., manuscript contents), both the value itself and all its parents are added to the Elasticsearch document, making sure that if there is a search query containing a parent field, all objects that contain children of this parent field can be found.
Greek texts from Occurrences and Types are pre-processed before indexation: special characters (round, square, and angular brackets, vertical pipe, and plus) are removed; accents are removed using the ICU Analysis plugin; and all characters are converted to lowercase.
The same operation happens with search strings entered by end users, leading to more expected search results. Elasticsearch is also extensively used to help find verse variants. Two queries are executed to create a list with suggested verse variants: a query to retrieve 10 groups containing the most similar verses and a query to retrieve 25 ungrouped verses that are most similar. In the editing environment, the editor can link any of the results returned by these two queries together in an existing or new verse variant group, as illustrated in Figure   18. An additional tool was developed right after the introduction of verse variants, where verse variant groups were suggested for all the verses in the complete corpus, and a single click per group sufficed to create this specific group.

II.4 An Open Access Project
The DBBE project has been developed as an open access project from the start. In the current phase of the project, we will go even further by adhering to the FAIR principles (European Commission 2016; Wilkinson 2016), which prescribe best-practices to make data easy to Find, easy to Access, Interoperable and easy to Reuse. Even if we did not specifically focus on these principles in the past, their application is already found in the current platform.
First, data and metadata gathered in DBBE are easily findable, as records have been assigned unique identifiers and associated to permalinks. DBBE data can therefore be referred to in an easy and stable way. Furthermore, the URLs resulting from filtering on the search pages can be used to enable the sharing of search queries. A transparent referencing system has also been implemented to identify external resources, both online projects and publications. Table   5 contains a list of all external identifiers currently used in DBBE. Most of these identifiers can be used to search for resources.  and improved also thanks to external input. While team members have access to the backend interface and can edit data, authorisation protocol put in place also allows external users to be granted read-only access to the backend platform.

Record category External identifier Can be used to search
Interoperability is a key element to ensure that data are integrated with other datasets. This aspect of the data management has so far been fulfilled by means of mutual collaboration agreements with other relevant projects that have resulted in reciprocal references. Among the online databases that refer to DBBE it is worth mentioning the Pinakes database, where around 320 DBBE Type records are listed. 35 The online platform Manuscripta Biblica, developed in the framework of the Paratexbib project, also provides links to DBBE, which is listed as an Inventory from which information is drawn. 36

III CURRENT AND FUTURE RESEARCH OPPORTUNITIES
In its current form, the database is very well suited to retrieve material and to perform queries that have already a specific research question in mind. Examples are: what epigrams can be found in Vat. gr. 1650? In which epigrams is the word εὐσέβεια ("piety") to be found? What are epigrams typically found in Psalters of the 11 th century? This is the type of queries for which a traditional relational database is eminently suited, and it will indeed fulfil the needs of a substantial part of the scholarly community.
However, the corpus of book epigrams brings forward challenges and opportunities that exceed the limits of a traditional relational database. Specifically, relational data stores require a predefined database scheme according to which the data have to be structured, which implies complex transformations of data to fixed tabular formats and potential information loss. Moreover, relational data stores are not efficient in handling complex and numerous relationships among data elements. The multiple interconnectedness of these paratexts and their material carriers, outlined above, as well as the large variety in contextual metadata call for technologies that enable more complex research questions. Instead of generating simple lists of results, it would be interesting to analyse the corpus according to network patterns, nodes, and scales of similarities.
Such an interest is obvious when considering the formulaic nature of the corpus. More complex digital data mining will allow us to perform in-depth analyses to detect connections between patterns of a textual kind on the one hand, and patterns of contextual data

III.1 NLP Subproject
First, to cope with the problems the inconsistent orthography in the Occurrences causes, a pipeline for linguistic annotation will be developed. Such a pre-processing pipeline consists of three parts: a tokenizer to split the words and punctuation, a part-of-speech tagger to perform morphological analysis and a lemmatiser to provide every word with its lemma. Such pipelines already exist for ancient Greek (Crane 1991, Keersmaekers et al. 2019) but those are dictionary-based approaches that cannot deal with inconsistent orthography or out-ofvocabulary words. For example, a dictionary-based approach will be able to analyse βιβλίου as a genitive from βιβλίον ("book"), but the variant βοιβληου, as it occurs in Occurrence 32232, will not be recognised as such. To deal with this inconvenience, we will develop a instances of Occurrences as well as nodes containing contextual and linguistic information.
Three categories of relationships, connecting these nodes, will be introduced: (1) edges representing the structure of the Occurrences by connecting textual nodes, (2) links between textual nodes and corresponding contextual or linguistic nodes and (3) connections representing variations of texts or metadata by means of similarities. This new graph database will coexist with the current relational DBBE in a polyglot database system (Khine & Wang 2019). We will construct a pipeline to automate the data extraction, transformation and loading (ETL) process from the relational DBBE to the graph database as well as a synchronisation procedure to guarantee consistency between the two databases. As a result, the graph database will provide users with an innovative visual instrument that facilitates exploration and analysis of related textual and contextual data in the form of nodes.
Taking advantage of this new graph-like structure we will develop three (sets of) advanced facilities supporting the handling of more advanced research questions. First, we will create advanced similarity measures between subgraphs, based on the orthographic and semantic similarity measures between verses (see III.1) as well as on additional textual and contextual data. These similarity measures will allow us to measure the similarity between e.g., (subsets of) Occurrences or even entire Manuscript records. Second, we will develop relevance measures for subgraphs. These measures will make it possible to automatically detect the most relevant words or (half)verses in a subset of Occurrences. Third, we will construct a set of pattern recognition techniques, finding subgraphs of connected nodes (both textual and contextual) that often occur together and reveal hidden knowledge that cannot be detected by conventional querying techniques. For example, a pattern might reveal that a group of epigrams containing a specific set of words are all connected to the same person, origin, period and/or subject.
The development of this subproject is informed by the idea of open data. We will investigate and develop intuitive and practical interfaces to make data publicly available. As a guideline, we will use the FAIR principles (see above, II.4). We will make sure that the new interfaces require little technical knowledge to use, but at the same time offer sufficient flexibility to query this novel database system.

III.3 Language Subproject
Above we have commented on the "formulaic" as well as "fluid" nature of book epigrams.
The main aim of the Language subproject is to gain a better understanding of the formulaic want to have a closer look at two sets of questions.
First, we want to get a better grasp of what the building blocks of Byzantine book epigrams were, or put differently, which kinds of formulaic sequences can be encountered in this corpus. We understand formulaic sequences in a broad sense, as 'the usual phrasings of a speech community' (Buerki 2016: 15), including idioms, collocations, formulas, proverbs, etc. Rather than applying such a predetermined typology, we want to describe the differences between the formulaic sequences retrieved in our corpus in terms of relevant parameters such as length, lexical specificity, combinability, and prosody. In order to do so, we want to make use of the technology that the DBBE-project has at its disposal, such as the verse variants page mentioned above, but at the same time we want to explore the novel NLP and graphbased technologies the other subprojects are developing.
The second question that we want to address is the relationship between formulaicity and creativity. As we already mentioned, scribes did not always stick to fixed patterns, but combined (parts of) epigrams with each other, introduced novel lexical and morphological features, changed the order of words, etc. Our intention is to use graph-based technology to detect and visualize outliers, in combination with close reading of the texts, and to create a typology of the sorts of variation we find. At the same time, we also want to study the social motivations behind such alterations: previous research (e.g., Wray 2002) has argued that formulaic language has a fixed number of functions, such as a reduction of effort for the speaker/writers, the marking of structure in conversation or discourse, and the manipulation of the hearer/reader, including how the hearer/reader perceives the speaker/writer's identity. We want to explore in particular the interpersonal function of formulaic creativity in book epigrams, by looking closer into aspects of both the micro-context (is there a connection with

III.4 Book Culture Subproject
The Book Culture subproject investigates the material entanglement of book epigrams in manuscripts. It focuses on book epigrams as the "nodes" between the material realization of texts and their intellectual or spiritual significance in society. Its main goal is to reveal attitudes in Byzantine culture towards book production and consumption, and to understand the social embedment of manuscripts. Book epigrams were a preferred forum for individuals, institutions, and communities to formulate their intentions when producing and/or consuming specific texts (often connected to donations). Book epigrams can thus contribute a valuable new dimension to the existing research into social contexts of Byzantine manuscripts and cultures of literacy. Moreover, the narrow relation that book epigrams (as paratexts) have with the main text in the manuscript is a source for valuable new insights into various reading strategies and levels of interpretation.
To grasp the complexity of this interconnectedness, this subproject interprets patterns of similarities between different sets of data (textual and contextual), for which a graph database is a promising new tool. In other words, instead of analysing one subset of data or metadata, this project analyses relationships between those sets, and typical patterns that emerge from bundles of relationships.
(1) A first set of relations is that between book epigrams and main texts (i.e., the main works gathered in a manuscript in which book epigrams occur). Typical patterns of discourse that book epigrams present about a certain group of main texts (be it Psalters or a canonical classical Greek author) may reveal attitudes towards the interpretation of texts and the authority of certain texts.
(2) A second set of relations is that between book epigrams and the materiality of a manuscript. Attitudes towards reading (especially in the case of foundational or contested texts) have their impact on the material outlook of the manuscript: its design and structure, the ISSN 2416-5999, an open-access journal types of script, the use of ink and decoration, all of which are data meticulously preserved in the database.
(3) A third set of relations is that between paratexts and historical contexts. How can we relate patterns of textual similarities (as established in the NLP and Language Subprojects) to concrete historical contexts, and identifiable social, intellectual and spiritual communities?
This question pertains to the more concrete historical metadata registered in DBBE together with each record of metrical paratexts: regional provenance or specific monastery, networks of related scribes and/or patrons, etc.
Based upon the newly developed database tools to visualize and analyse big sets of data, this subproject constitutes the ideal testcase to apply the novel computational approach to an analytical model in which philological, palaeographical and codicological facts are interpreted within a wider cultural, spiritual, and historical framework. Since book epigrams often thematize the relationships between textual, contextual and extratextual data, they are an excellent research object for such an enterprise.

IV CONCLUSIONS
The experience of the DBBE project, whose achievements and challenges have been described in this paper, can constitute an excellent reference point to other projects in Digital Humanities. Being at its third round of funding, the project has a rather exceptionally long life. This has allowed us to adjust the scope of DBBE in order to answer new, sometimes unexpected research questions. Moreover, the research approaches adopted to collect and analyse textual and metatextual data have evolved throughout the years, along with the ongoing technical development.
Although our project has a clear scope and focuses on a specific corpus, the approach we have used can potentially be extended to other corpora as well. The DBBE data platform has been designed to accommodate textual and metatextual data concerning evidence from languages other than Greek. Moreover, the conceptual design and the code base of the data platform have inspired and contributed to other projects (even not textually oriented ones) developed by the Ghent Centre for Digital Humanities, and vice versa.
Like many other projects that have put a digital platform at the centre of their research activity, we are also faced with the crucial question of long-term sustainability. To which extent a digitized corpus, compiled according to current practices and standards, will still be useful, beneficial, and appealing within a few years, let alone decades? How to respond to the