TEI-encoding of text reuses in the BIBLINDEX Project

This paper discusses markup strategies for the identification and description of text reuses in a corpus of patristic texts related to the B IBL I NDEX Project, an online index of biblical references in Early Christian Literature and the Middle Ages. In addition to the development of a database that can be queried by canonical biblical or patristic references, a sample corpus of patristic texts has been encoded following the guidelines of the TEI (Text Encoding Initiative), in order to provide direct access to quoted and quoting text passages to the users of the https://www.biblindex.info platform.


INTRODUCTION
BIBLINDEX [http1] is a project led by the French Institut des Sources Chrétiennes [http2], part of the HiSoMA research center in Lyon, with the support of the LIRIS (Laboratoire d'Informatique en Image et Systèmes d'Information) and the LIG (Laboratoire d'Informatique de Grenoble), two computer science labs in Rhône-Alpes region (France).It aims to create an exhaustive online index of biblical quotations and allusions in Early Christian Literature, both Western and Eastern texts, which should eventually cover the whole of Late Antiquity and the early Middle Ages.After introducing the BIBLINDEX Project, this paper describes a new initiative to expand the data available by the direct encoding of biblical text reuse in patristic texts using a TEI encoding strategy that can account for the concepts and categories theoretically defined by the BIBLINDEX Project council.We present the way our TEI encoding choices were defined in order to give a closer account of the text reuses phenomenon, bearing in mind that these choices have to be compliant with the data already stored in the database and with the printed edition prepared at the same time.Firstly, the issue of text reuse delimitation will be addressedthat is to say both the delimitation of the biblical text segment to which the text reuse refers and the delimitation of the text segment considered as a reuse in the patristic work.Secondly, the issue of their characterization will be studied.In doing this, we extend a methodological reflection from [Mellerin, 2013] 1 .

Description
BIBLINDEX currently offers a comprehensive inventory of quotations and allusions providing bibliographical information.It includes a broad sweep of references ranging from verbatim quotations to loose allusions.Each record consists of a series of numbers indicating the chapter and verse of the biblical text, its location in the patristic work and the corresponding page and line numbers in the reference edition.The web interface lists the authors included in the database in a drop-down menu.The search facilities return all the entries for a biblical verse or text passage, which can be further restricted by date, geographical region and author or work.Bibliographical details for the edition as well as the page and line number of the reference are provided for each online entry and can be used to filter the maximal list provided in the search results.

The available corpus
The project is based on the resumption of Biblia Patristica, a collection of printed indexes of biblical citations in the writings of Greek and Latin Church Fathers from the first centuries, published by a CNRS team, the CADP (Centre for Patristics Analysis and Documentation) in Strasbourg, between 1965and 2000[Allenbach, 1967] [Junod, 1995].Sources Chrétiennes has been entrusted with the CADP database and further unpublished data.The seven published volumes of [Biblia Patristica, 1975-2000], along with a supplementary volume of biblical references in Philo of Alexandria, cover the first three centuries, along with part of the fourth: • volume 1, which appeared in 1975, covers the beginnings of extracanonical Christian literature up to Clement of Alexandria and Tertullian; • volume 2 indexes the literature of the third century, apart from Origen, • to whom volume 3 is exclusively devoted; • coverage of the fourth century begins with volume 4, which includes the writings of Eusebius of Caesarea, Cyril of Jerusalem, and Epiphanius of Salamis; • volume 5 covers Basil of Caesarea, Gregory of Nazianzus, Gregory of Nyssa, and Amphilochius of Iconium; • volume 6 turns to the Latin writers Hilary of Poitiers, Ambrose of Milan, and Ambrosiaster; • volume 7 on Didymus the Blind appeared in 2000.
The 270,000 published citation references have now been made freely available online as BIBLINDEX.The searchable database and website also give access to around 100,000 further unpublished entries for authors including Athanasius, John Chrysostom, Theodoret of Cyr, Procopius of Gaza, Jerome and works marked as Spuria and dubia related to the authors of the published volumes.The data has been verified exactly in the same way as for the published volumes, only the last proofreading step is still missing.In addition to this material, about 550,000 references prepared for the CADP by its usual external contributorsvolunteer European researchers, experts in different patristic fields with extensive knowledge of the Bible (among them: J. Fleury, G. Valayer, B. Outtier) -, have been digitized and normalized, i.e. converted in a format consistent with the already available data, and are now ready to be put online.This data is "unverified", that is to say it didn't benefit from the thorough scientific and technical verification usually carried out by the permanent CADP staff, but all the references have been established by reliable experts.These citations are taken from 3,000 diverse texts, written in different places and at different times.Among them are Catenae, hagiographical and liturgical texts, works of Ps.-Chrysostomus, Ephrem the Syrian, Cyril of Alexandria, Gregor the Great, Maximus the Confessor.The BIBLINDEX team is now focusing on the systematic treatment of the remaining works from the fourth and fifth centuries, especially those of Augustine, in order to ensure the exhaustive coverage of this period [Mellerin, 2017-1].Some statistical studies have already been carried out on the corpus currently available [Bady, 2015]; [Mellerin, 2014;2016;2017-2;2017-3].Moreover, a prototype geovisualization interface [http3] displaying temporal and spatial data about authors and works has already been developed by the LIG.

Methodological issues
Work continues to augment the BIBLINDEX database with citation references in the format described above (verse numberlocation in the work).This data has its value: the ability to search by verse reference makes it easy to access synthetic information on the various uses of a biblical passage; and the more references that are available in the database the more illuminating statistical approaches will become.Besides, these references in the form of numbers are, and will always be, freely available online, as they don't depend on any copyright protection issue.However, the references contained in the database are, of course, the result of careful analysis of the patristic and biblical texts concerned.Unfortunately all trace of this analytical step for the inherited data has typically been lost.The precise delimitation of each text reuse is no longer available: the information consists solely of a line number which only gives a clue to the beginning of the text reuse.The annotations made by the analysts and thus the motivations behind their choices have also been lost.Paradoxically, these reference statements, conceived by the CADP as a research outcome, constitute one of the starting points for our research today: the references listed in the database have to be re-injected in the patristic texts in order to be verified and refined.When it comes to adding new data to the database, the analysis will start from the text with TEI-compliant markup stage supported by an automatic tool specifically designed for the recognition of text reuse in BIBLINDEX2 [Gesche-Egyed-Zsigmond-Calabretto, 2017].This time we intend to retain data from all the stages leading to the statement of a text reuse: identification, close delimitation, characterization.To allow access to the corpus by canonical biblical reference, patristic author's name or work title on the one hand and access to the text of a reuse in its context on the other, the project architecture has two main components: a MySQL 3 database and a collection of texts encoded in TEI-XML.This dichotomy reflects the history of the corpus construction; we are now considering the option of a native XML database.All the metadata about the authors (dates, biography, additional documents, bibliography, etc.) and about the works (date of composition, topics, summary, etc.) are stored in the MySQL database.In order to make them easily available for a common use as linked Open Data, links to the BnF (Bibliothèque nationale de France) authorities are foreseen where they exist 4but currently this only concerns about a third of the works in our corpus 5 .Each text reuse corresponds to a record in an association table between a Biblical passage and a passage from a patristic text.The data is entered using a list of abbreviations; the start and end line numbers can be stated.Moreover, to allow the use of several biblical referencing systems beyond the Bible of Jerusalemas has been the case until nowbiblical texts written in different ancient languages (Hebrew, Greek, Latin, Syriac), as well as modern translations in English and French, were added to the database and closely aligned to one another.This additional biblical data was segmented at the verse level and sometimes at the part of verse level.As a result of this work, a multilingual concordance between Bibles was added to the BIBLINDEX platform and is freely searchable online [http5].The biblical text itself is now also stored in the MySQL database, which makes it easy to display the verses sought by the user.The patristic texts are stored as TEI-XML files, which won't be available online in their exhaustive form, because of complex copyright issues, but freely searchable 6 .At present there are only a few test samples, namely three complete works written by Bernard of Clairvaux 7 [Bernard of Clairvaux, 1990, 2009, 2010].Each work is stored in a separate file.The TEI Header of each file contains the unique identifier of the work, which is also recorded in the database.This link enables all the relevant metadata for a work to be extracted from the database and automatically re-injected into the relevant XML file.All information regarding the basic structure of the texts are embedded in the XML files: the logical structure specific to a work (book, chapter, section, paragraph numbers,...) is marked up with the corresponding TEI tags; the physical structure specific to the reference edition (page and line numbers) is expressed with milestones elements (empty tags marking a boundary point).

Encoding scheme
Our encoding scheme follows the TEI guidelines.Each patristic work is encoded in a separate file which contains the following key details.The TEI Header title statement <fileDesc/titleStmt> is used to encode the full and abbreviated title of the work and the name 4 BIBLINDEX also aims at being part of a shared project among the patristic community: building a common and freely available referencing system for authors and works, based on the model provided by the Clavis Patrum.As the BnF, in the context of the Biblissima project [http4], has already begun to classify ancient authors and makes this data available online with persistant IDs, it seems more useful to help improving this existing database than to build an independant one.
5 A large part of the works we have to deal with are fragments, published in reviews as single articles, and/or which authors are unknown; for the time being, the BnF mainly identifies entire works, published in books, which authorship is known with a good certainty level.
Several authors may be associated to a work, according to the certainty of the attribution; the works have different kind of relationship with each other.The structure of the database makes it possible to create workgroups, to specifically manage the collections of fragments and the works that quote each other.Each work is linked to one or more reference editions, classified according to their date and relevance.The metadata on the 823 authors and 7,585 works [http6] gives some idea of the complexity of the data and their relations. 6The internet user will be allowed to see about 5 lines of each text at a time, but texts won't be available for download as full files.As the original print editions of the patristic texts are copyrighted material, it is not possible for us to build from the outset a full open-access architecture comparable to the Leipzig Open Fragmentary Text Series (LOFTS) project's one [http7] that provides open editions of ancient texts linked to born-digital editions of fragmentary works.Nevertheless, our beginning collaboration with the TXM research team [http8] aims to follow such a model on a delimitated test corpus. 7Bernard of Clairvaux is indeed often referred to as the last Church Father.His work has been chosen as a test case, despite his late period of life compared to most other Fathers, because he "speaks Bible": the about 35,000 biblical reuses found in his work are really embedded in the course of his speech and provide a wealth of information to describe non-litteral quotations.Besides, a great part of the biblical analysis on this specific corpus has already been carried out at Sources Chrétiennes.
of the author (including a link to BIBLINDEX page for the author).The <body> part consists of a single element <div>.Each <div type="work"> element can be divided in chapters, <div type="chapter"> elements.The chapters contain a mandatory title and one or more paragraphs.

Delimitation of the biblical targets
Biblical targets fall into one of two categories, depending on whether or not they are linked to precise verse numbers.The first category, described in [Mellerin, 2013:21-24], gathers vague allusions to larger biblical subcorpora such as Johannine writings or healing narratives, which cannot easily be characterized by verse numbers, as well as common biblical formulations, names of characters (e.g. the disciples of Christ, David) and global reminiscences of heterogeneous episodes (e.g. the life of the Hebrew people in the desert).The database contains several records of such subcorpora and named entities in dedicated tables, which will be expanded in the course of our work.In the TEI files, pointers are built from database keys: instead of targeting a canonical reference, they use the identifiers of these records, and/or the PersNames, PlaceNames tags.
[Irenaeus, Adversus Haereses:II, 24, 4 (244, l. 145)]: <rs type="bookgroup" key="PentateuchID">In quinque libris legem populo <persName key="MosesID">Moyses</persName> tradidit</rs> It is the second, predominantly major category, of text reuse that we will focus on in this paper.This category comprises of the citations that can be linked to canonical references.At present, any reference to the biblical text is made to a whole verse or set of verses, defined by the name of the Bible edition used as the versification reference, the biblical book, the chapter number(s) and the verse number(s)8 .This universal referencing system is convenient, insofar as it allows access to the data in BIBLINDEX using the verse numbering, but it cannot express the specific details when only a few words of a verse are actually targeted.This system therefore has to be refined when applied to text reuse detected in the XML files: it has to be capable of identifying text reuse at the word level.We considered targeting only verse numbers and expressing the precise words concerned using the <app> and <rdg> tags, and following a critical apparatus parallel segmentation method.In this system, the quoting Church Father's text would be treated as a witness to the biblical text.This solution has the benefit of exposing the variants clearly in a portion of a verse; but it also requires the insertion of the biblical text to the patristic text file and makes difficult to take word order into account.It therefore seems preferable to create pointers that get down to the word level, expressed in the biblical URI itself.The canonical expression of biblical references has been designed as such: - Each URI constructed targets a verse in the Bible parser of the BIBLINDEX website, allowing the visualization of the text in its biblical context.The second example above is represented in the XML as follows9 : <seg type="bRef" xml:id="T01"> <bibl type="biblical"> <ref cRef="Vg:Mc:12:8:6-8">Mc 12, 8</ref> <ptr target="http://www.biblindex.info/en/biblical/content/ref/Vg_Mc_12_8:6-8"targetLang="lat"/> </bibl> </seg> Clicking the link expressed in the target opens the relevant page in the Bible parser, represented below in Figure 1.The detailed expression of these canonical references for more complex cases is being developed using semantically meaningful identifiers compliant with the "Canonical Text Service" (CTS)10 : this allows a future reformulation of our URIs with CTS.
This precise word-level targetting allows to refer a single texte reuse to several biblical sources.Another additional advantage means that a passage of a text written in one language may be related to a Bible written in another, which is not its usual reference Bible.[Jerome,Epistula 36 ad Damasum 12 (59,l. 17)], whose text is mostly related to the Vulgata, quotes sometimes the Septuagint.<seg type="scripturalQ" xml:id="B01"> <seg type="insertion">ubi Septuaginta posuerunt</seg> quinta autem generatione ascenderunt filii Israhel de terra Aegypti</seg> <note type="scripturalNote" subtype="translation_latin11 "> <seg type="bRef"> <bibl type="biblical"> <ref cRef="LXX:Ex:13:18:15-26"/> <ptr target="http://www.biblindex.info/fr/biblical/content/ref/LXX_Ex_13_18:15-26"/></bibl> </seg> </note> In addition to the reference Bibles stored in the BIBLINDEX database, "Bible" entities under construction can be targeted.In Greek for example, the translation made from the Hebrew by Aquila (2 nd c.AD) is assumed as a work on the whole Bible, but has to be reconstructed from fragments.If reuse of Aquila's text is identified, it will be related to a virtual target, prefixed AQ instead of LXX, with a verse number equivalent to the LXX verse number.The entity "Aquila's version" will be a posteriori gradually reconstructed by compiling the discovered instances of text reuse.Similarly, in the Latin tradition, each time an author quotes the Vetus Latina and not a text form similar to the Vulgate's text form, a record in the entity "Vetus Latina" will be created.Table 1 lists further examples of (partial) biblical versions we have already identified in our corpora which could be (partially) reconstructed using our text reuse discoveries.

Description of the textual reality expressed by the canonical reference
This information regarding biblical verses and words described in the previous section will be supplemented by specifying the nature of the relationship between the text reuse and the target.

Delimitation of the patristic text reuse
"The text reuse object considered is a text unit defined as a part of a patristic text in which an author refers, explicitly or implicitly, through quotation, mention or reminiscence, to one or more parts of the biblical text.Multiple biblical references may be found in a single unit.In each text unit, the author's thought and the biblical contents are nested together in such a way that they are often impossible to separate.This is why a unit is not constituted by the exact words of the Bible which also appear in the Fathers' text, but is defined as the coincidence between a patristic sense unit and a biblical one.There may be no common words between the text unit and the biblical referential, for example in the case of a complete paraphrase of the biblical text.A single word, if relevant, can be a biblical reference and a unit."[Allenbach, 1975:20-21].

Delimitation of the quotation
We first had to decide whether empty elements as boundary delimiters (milestones) or container elements (bloc or segment elements) should be used to indicate the beginning and end of each text reuse.The use of milestones and generic anchors avoids problems of overlapping hierarchies in the encoding; but non-empty elements are more suited to describing precisely a segment of text with help of attributes and types.
The following examples, taken from [Bernard of Clairvaux, De Diligendo Deo], show what the encoding would look like using each of these two methods.
The first example highlights a case of overlapping reuses.
The second example presents a case of a quotation interrupted by an insertion.[Bernard of Clairvaux,De diligendo Deo 31:138, ANCHORS <anchor xml:id="anchor1"/>Comedite<anchor xml:id="anchor2"/>inquit<anchor xml:id="anchor3"/>amici et bibite et inebriamini carissimi<anchor xml:id="anchor4"/> <span from="#anchor1" to="#anchor2" xml:id="seg1a"/> <span from="#anchor3" to="#anchor4" xml:id="seg1b"/> <join target="#seg1a #seg1b" result="seg" type="biblical_occurrence" xml:id="seg1" source="#Vg:Ct:5:36-41"/> SEG TAGS <seg type="scripturalQ" source="#Vg:Ct:5:36-41">Comedite <seg type="insertion">inquit</seg> amici et bibite et inebriamini carissimi</seg> This time, the encoding using anchors is cumbersome; the encoding using <seg> elements is faster and more intuitive.As it is clearly better for simple cases and interrupted text reuses, and as complex as the encoding with milestones in cases of overlaps, the encoding with <seg> tags has been chosen and applied to the whole corpus.The semantically neutral <seg> tag was retained rather than the quote-specific tags proposed by the TEI, <q>, <cit> and <quote>: <q>, as a graphical marker, is not relevant to the ancient texts in general; <cit> assumes that the source is explicitly mentioned, which is rarely the case in our corpus; <quote> assumes the explicit intention to quote, which is also irrelevant in most cases.Moreover, we have left aside the tags specifying the status of the enunciation, <said>, <mentioned>, <soCalled>, because these topics find their place in the commentary notes.In its simplest case, the reuse is composed exclusively of words quoted exactly, from a single biblical passage and without any discursive interruption.However, such text reuse occurs rarely in our corpus, rather they are frequently interrupted by inserted segments.

Delimitation of the inserted segments
A text reuse may include introductory texts, binding texts and paraphrased sections, which, while part of the text reuse from the author's point of view, do not refer to the biblical target.The content that directly refers to the biblical target must be isolated, and the other parts may be annotated with specific attributes.The next example (which is the same text reuse instance as the previous example) presents an instance of text reuse that includes an explicit insertion formula17 mixed with the biblical text quoted.
The text parts where a biblical text reuse is observed are marked with the element <seg>.This element includes the words borrowed from the Bible, but also introductory formulas and possible insertions.However, this system of sub-segmentation is not sufficient in the most complex cases, in particular when text reuses containing references to several biblical texts and with various insertions must be handled.Let's take as an example [Jerome, Prologue to Samuel-Kings].The introduction formula is common to all texts, the terms taken from the book of Exodus are nested and present each in several verses.The encoding of such a text by segmentation and sub-segmentation is an insoluble puzzle.Due to the frequency of such complex cases in our corpus, we have decided to generalize the encoding by tokenization of words; it makes the management of simple cases more complicated but is essential for complex ones.A preliminary step of processing the text files has been planned.The files will be imported into TXM [http8] associated with TreeTagger [http10] to automate the preliminary marking of each word.This phase will also include lemmatization.

Characterization of the patristic text reuse
A text reuse is partially characterized by the segment containing the quoted biblical text.Thus, we can specify if it is a lemma, that is to say a biblical section specifically commented on in a commentary, an homily or a biblical chain (catena), or a reuse of this lemma in the text.
Quotation of Ap:1:9 in the Letter of the Gallic Martyrs to pope Eleutherios, quoted itself by [Eusebius of Caesarea,Historia Ecclesiastica V,4,2:28,l. 5]: <cit> <quote> (…) <seg type="scripturalQ" source="#LXX:Ap:1:9:3-7" >τὸν ἀδελφὸν ἡμῶν καὶ κοινωνόν</seg> (…) </quote> <ref cRef="ID Bx.work">Gallic Martyrs, Letter to Eleutherios</ref> </cit> Most of the characterization elements, however, will be provided by the analysis of the segments inserted in the text reuse, mentioned in the previous section (2.1.2).As BIBLINDEX aims to remain as neutral as possible, and rather provide the user with materials for his/her own hermeneutic path, we decided not to include degrees of intentionality, even though they are at the heart of the text reuse ontologies commonly used (CITO, LAWDI, etc.): each analyst can indicate his/her own interpretation of the author's act of quoting in the commentary notes.
Among the formulas used to insert the biblical text in the reuse context, we have distinguished between neutral insertions (e.g. one reads, he says), and explicit ones, which show in one way or another that the author is aware that his text is borrowed from Scripture.The latter may be general mentions (e.g. according to the Scriptures) or more explicit attributions such as a more or less precise author's name (the Prophet, the Apostle, Job, etc.) or other clues, sometimes even canonical references.The quotations that are explicitly introduced can be categorised as correct, false (in case of a mistake due to the author for instance), agraphon (words attributed to Jesus without being in any gospel), or unknown which includes words not attested in any biblical book, but quoted as belonging to the Scripture by the patristic author.

III. SUMMARY OF THE SELECTED ENCODING
The following is a complete encoding scheme for all these descriptive elements.

Special case of overlapping quotations
In some cases of overlapping text reuse the <join> element may be used to link two discontinuous <seg> elements that form a single text reuse.

Special case of quotation reuse
When text reuse is repeated, it will be specified in the <span> element using the values of the 'ana' attribute and specifying the 'sameAs' attribute.

Scriptural notes
Each text reuse is marked with a scriptural note (each <seg type="scripturalQ"> is followed by an element <note type="scripturalNote">).The <note type="scripturalNote"> element consists of a <seg type="bRef"> element containing the canonical reference and a <link> element connecting the quoting patristic text and the quoted biblical text and specifying the nature of that relationship.There are as many <seg> elements associated to a <link> element as canonical Biblical references associated to the words encoded.The <ref> element contained in the <bibl> element gives the canonical reference in written form.The <link> element establishes the relationship between a quoting text segment and one or more quoted biblical passages.It also specifies the nature of this relationship: exact or inaccurate, explicit or implicit quote.

Conclusion
Trying to take into account every possible scenario, we have defined the markup scheme given above.It is currently implemented on our sample corpus of Bernard of Clairvaux's works.The encoding of other texts in real production conditions will allow us to clarify and refine this scheme.Thanks to this markup, all the queries we have thought of are made possible.For instance, it can be searched for all reuses of a specific verse or set of verses in a defined patristic corpus; for all the quotations characterized by an explicit introduction; for the verses quoted near a specific verse.It's also possible to find a specific lemma in different verses, even when text reuses are embedded within each other.Different ways used by a specific author to insert a biblical passage in the course of his speech can be studied, as well as comparisons between several authors, etc.The next step is to design the search interfaces, in order to express all of these possible queries in the most user-friendly way possible.In the meantime, any feedback from you, the readers, are welcome: please send a message to biblindex.sc@mom.frif you have any comments or ideas for improvement.

Table 2 .
Element <seg>This element <seg type="scripturalQ"> contains an element <span> establishing words related to the biblical text.

Table 7 .
First element <seg> in case of disjunctionEncoding of the second text reuse segment

Table 8 .
Second element <seg> in case of disjunction

Table 15 .
of the Bible]:[abbreviated title of the book]:[chapter number].[versenumber]:[orderingnumber of words in the verse] Table 14.Canonical reference (<ref>)The <ptr> element contains the URN of the biblical target in BIBLINDEX.Canonical reference (<ptr>)The <note> element indicates the source of the biblical text quoted.
:id of the <span> element giving the words of the quoting text] #[xml:id of the <seg> element defining the biblical text quoted] Table17.Canonical reference (<link>)