Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages


1. Measuring and Mapping Intergeneric Allusion in Latin Poetry using Tesserae

Patrick J. Burns.
Most intertextuality in classical poetry is unmarked, that is, it lacks objective signposts to make readers aware of the presence of references to existing texts. Intergeneric relationships can pose a particular problem as scholarship has long privileged intertextual relationships between works of the same genre. This paper treats the influence of Latin love elegy on Lucan’s epic poem, Bellum Civile, by looking at two features of unmarked intertextuality: frequency and distribution. I use the Tesserae project to generate a dataset of potential intertexts between Lucan’s epic and the elegies of Tibullus, Propertius, and Ovid, which are then aggregrated and mapped in Lucan’s text. This study draws two conclusions: 1. measurement of intertextual frequency shows that the elegists contribute fewer intertexts than, for example, another epic poem (Virgil’s Aeneid), though far more than the scholarly record on elegiac influence in Lucan would suggest; and 2. mapping the distribution of intertexts confirms previous scholarship on the influence of elegy on the Bellum Civile by showing concentrations of matches, for example, in Pompey and Cornelia’s meeting before Pharsalus (5.722-815) or during the affair between Caesar and Cleopatra (10.53-106). By looking at both frequency and proportion, we can demonstrate systematically the generic enrichment of Lucan’s Bellum Civile with respect to Latin love elegy.
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities

2. Editing New Testament Arabic Manuscripts in a TEI-base: fostering close reading in Digital Humanities

Claire Clivaz ; Sara Schulthess ; Martial Sankar.
If one is convinced that " quantitative research provides data not interpretation " [Moretti, 2005, 9], close reading should thus be considered as not only the necessary bridge between big data and interpretation but also the core duty of the Humanities. To test its potential in a neglected field – the Arabic manuscripts of the Letters of Paul of Tarsus – an enhanced, digital edition has been in development as a progression of a Swiss National Fund project. This short paper presents the development of this edition and perspectives regarding a second project. Based on the Edition Visualization Technology tool, the digital edition provides a transcription of the Arabic text, a standardized and vocalized version, as well as French translation with all texts encoded in TEI XML. Thanks to another Swiss National Foundation subsidy, a new research project on the unique New Testament, trilingual (Greek-Latin-Arabic) manuscript, the Marciana Library Gr. Z. 11 (379), 12th century, is currently underway. This project includes new features such as " Textlink " , " Hotspot " and notes: HumaReC.
Section: Project presentations

3. Visualizing linguistic variation in a network of Latin documents and scribes

Timo Korkiakangas ; Matti Lassila.
This article explores whether and how network visualization can benefit philological and historical-linguistic study. This is illustrated with a corpus-based investigation of scribes' language use in a lemmatized and morphologically annotated corpus of documentary Latin (Late Latin Charter Treebank, LLCT2). We extract four continuous linguistic variables from LLCT2 and utilize a gradient colour palette in Gephi to visualize the variable values as node attributes in a trimodal network which consists of the documents, writers, and writing locations underlying the same corpus. We call this network the "LLCT2 network". The geographical coordinates of the location nodes form an approximate map, which allows for drawing geographical conclusions. The linguistic variables are examined both separately and as a sum variable, and the visualizations presented as static images and as interactive Sigma.js visualizations. The variables represent different domains of language competence of scribes who learnt written Latin practically as a second-language. The results show that the network visualization of linguistic features helps in observing patterns which support linguistic-philological argumentation and which risk passing unnoticed with traditional methods. However, the approach is subject to the same limitations as all visualization techniques: the human eye can only perceive a certain, relatively small amount of information at a time.
Section: Visualisation of intertextuality and text reuse

4. Text Alignment in Ancient Greek and Georgian: A Case-Study on the First Homily of Gregory of Nazianzus

Tamara Pataridze ; Bastien Kindt.
This paper discusses the word level alignment of lemmatised bitext consisting of the Oratio I of Gregory of Nazianzus in its Greek model and Georgian translation. This study shows how the direct and empirical observations offered by an aligned text enable an accurate analysis of techniques of translation and many philological parameters of the text.

5. Version Variation Visualization (VVV): Case Studies on the Hebrew Haggadah in English

Tom Cheesman ; Avraham Roos.
The ‘Version Variation Visualization’ project has developed online tools to support comparative, algorithm-assisted investigations of a corpus of multiple versions of a text, e.g. variants, translations, adaptations (Cheesman, 2015, 2016; Cheesman et al., 2012, 2012-13, 2016; Thiel, 2014; links: www.tinyurl.com/vvvex). A segmenting and aligning tool allows users to 1) define arbitrary segment types, 2) define arbitrary text chunks as segments, and 3) align segments between a ‘base text’ (a version of the ‘original’ or translated text), and versions of it. The alignment tool can automatically align recurrent defined segment types in sequence.Several visual interfaces in the prototype installation enable exploratory access to parallel versions, to comparative visual representations of versions’ alignment with the base text, and to the base text visually annotated by an algorithmic analysis of variation among versions of segments. Data can be filtered, viewed and exported in diverse ways. Many more modes of access and analysis can be envisaged. The tool is language neutral. Experiments so far mostly use modern texts: German Shakespeare translations. Roos is working on a collection of approx. 100 distinct English-language translations of a Hebrew text with ancient Hebrew and Aramaic passages: the Haggadah (Roos, 2015)
Section: Visualisation of intertextuality and text reuse

6. A Hackathon for Classical Tibetan

Orna Almogi ; Lena Dankin ; Nachum Dershowitz ; Lior Wolf.
We describe the course of a hackathon dedicated to the development of linguistic tools for Tibetan Buddhist studies. Over a period of five days, a group of seventeen scholars, scientists, and students developed and compared algorithms for intertextual alignment and text classification, along with some basic language tools, including a stemmer and word segmenter.
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities

7. A Classification of Manuscripts Based on A New Quantitative Method. The Old Latin Witnesses of John's Gospel as Text Case

David Pastorelli.
A new method for grouping manuscripts in clusters is presented with the calculation of distances between readings, then between witnesses. A classification algorithm (" Hierarchical Ascendant Clustering "), achieved through computer-aided processing, enables the construction of trees illustrating the textual taxonomy obtained. This method is applied to the Old Latin witnesses of the Gospel of John, and, in order to provide a study of a reasonable size, to a chapter as a whole (chapter 14). The result basically confirms the text-types identified by Bonatius Fischer, founder of the Vetus Latina Institute, while it invalidates the classification adopted by the current edition of the Vetus Latina of the Gospel of John.
Section: Managing different types of text re-uses

8. Processing Tools for Greek and Other Languages of the Christian Middle East

Bastien Kindt.
This paper presents some computer tools and linguistic resources of the GREgORI project. These developments allow automated processing of texts written in the main languages of the Christian Middel East, such as Greek, Arabic, Syriac, Armenian and Georgian. The main goal is to provide scholars with tools (lemmatized indexes and concordances) making corpus-based linguistic information available. It focuses on the questions of text processing, lemmatization, information retrieval, and bitext alignment.
Section: Project presentations

9. Recurrent Pattern Modelling in a Corpus of Armenian Manuscript Colophons

Emmanuel Van Elverdinghe.
Colophons of Armenian manuscripts are replete with yet untapped riches. Formulae are not the least among them: these recurrent stereotypical patterns conceal many clues as to the schools and networks of production and diffusion of books in Armenian communities. This paper proposes a methodology for exploiting these sources, as elaborated in the framework of a PhD research project about Armenian colophon formulae. Firstly, the reader is briefly introduced to the corpus of Armenian colophons and then, to the purposes of our project. In the third place, we describe our methodology, relying on lemmatization and modelling of patterns into automata. Fourthly and finally, the whole process is illustrated by a basic case study, the occasion of which is taken to outline the kind of results that can be achieved by combining this methodology with a philologico-historical approach to colophons.
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities

10. Preprocessing Greek Papyri for Linguistic Annotation

Marja Vierros ; Erik Henriksson.
Greek documentary papyri form an important direct source for Ancient Greek. It has been exploited surprisingly little in Greek linguistics due to a lack of good tools for searching linguistic structures. This article presents a new tool and digital platform, “Sematia”, which enables transforming the digital texts available in TEI EpiDoc XML format to a format which can be morphologically and syntactically annotated (treebanked), and where the user can add new metadata concerning the text type, writer and handwriting of each act of writing. An important aspect in this process is to take into account the original surviving writing vs. the standardization of language and supplements made by the editors. This is performed by creating two different layers of the same text. The platform is in its early development phase. Ongoing and future developments, such as tagging linguistic variation phenomena as well as queries performed within Sematia, are discussed at the end of the article.
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities

11. From manuscript catalogues to a handbook of Syriac literature: Modeling an infrastructure for Syriaca.org

Nathan P. Gibson ; David A. Michelson ; Daniel L. Schwartz.
Despite increasing interest in Syriac studies and growing digital availability of Syriac texts, there is currently no up-to-date infrastructure for discovering, identifying, classifying, and referencing works of Syriac literature. The standard reference work (Baumstark's Geschichte) is over ninety years old, and the perhaps 20,000 Syriac manuscripts extant worldwide can be accessed only through disparate catalogues and databases. The present article proposes a tentative data model for Syriaca.org's New Handbook of Syriac Literature, an open-access digital publication that will serve as both an authority file for Syriac works and a guide to accessing their manuscript representations, editions, and translations. The authors hope that by publishing a draft data model they can receive feedback and incorporate suggestions into the next stage of the project.

12. Dealing with all types of quotations (and their parallels) in a closed corpus: The methodology of the Project The literary tradition in the third and fourth centuries CE: Grammarians, rhetoricians and sophists as sources of Graeco-Roman literature

Lucía Rodríguez-Noriega.
The Project The literary tradition in the third and fourth centuries CE: Grammarians, rhetoricians and sophists as sources of Graeco-Roman literature (FFI2014-52808-C2-1-P) aims to trace and classify all types of quotations, both explicit (with or without mention of the author and/or title) and hidden, in a corpus comprising the Greek grammarians, rhetoricians and " sophists " of the third and fourth centuries CE. At the same time, we try to detect whether or not these are first-hand quotations, and if our quoting authors (28 in all) are, in turn, secondary sources for the same citations in later authors. We also study the philological (textual) aspects of the quotations in their context, and the problems of limits they sometimes pose. Finally, we are interested in the function of the quotation in the citing work. This is the first time that such a comprehensive study of this corpus is attempted. This paper explains our methodology, and how we store all these data in our electronic card-file.
Section: Project presentations

13. Bioinformatics and Classical Literary Study

Pramit Chaudhuri ; Joseph P. Dexter.
This paper describes the Quantitative Criticism Lab, a collaborative initiative between classicists, quantitative biologists, and computer scientists to apply ideas and methods drawn from the sciences to the study of literature. A core goal of the project is the use of computational biology, natural language processing, and machine learning techniques to investigate authorial style, intertextuality, and related phenomena of literary significance. As a case study in our approach, here we review the use of sequence alignment, a common technique in genomics and computational linguistics, to detect intertextuality in Latin literature. Sequence alignment is distinguished by its ability to find inexact verbal similarities, which makes it ideal for identifying phonetic echoes in large corpora of Latin texts. Although especially suited to Latin, sequence alignment in principle can be extended to many other languages.
Section: Project presentations

14. QuotationFinder - Searching for Quotations and Allusions in Greek and Latin Texts and Establishing the Degree to Which a Quotation or Allusion Matches Its Source

Luc Herren.
The software programs generally used with the TLG (Thesaurus Linguae Graecae) and the CLCLT (CETEDOC Library of Christian Latin Texts) CD-ROMs are not well suited for finding quotations and allusions. QuotationFinder uses more sophisticated criteria as it ranks search results based on how closely they match the source text, listing search results with literal quotations first and loose verbal parallels last.
Section: Project presentations

15. Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

Mike Kestemont ; Jeroen De Gussem.
In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning.
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities

16. Computer - Assisted Processing of Intertextuality in Ancient Languages

Mark Hedges ; Anna Jordanous ; K. Faith Lawrence ; Charlotte Roueché ; Charlotte Tupman.
The production of digital critical editions of texts using TEI is now a widely-adopted procedure within digital humanities. The work described in this paper extends this approach to the publication of gnomologia (anthologies of wise sayings) , which formed a widespread literary genre in many cultures of the medieval Mediterranean. These texts are challenging because they were rarely copied straightforwardly ; rather , sayings were selected , reorganised , modified or re-attributed between manuscripts , resulting in a highly interconnected corpus for which a standard approach to digital publication is insufficient. Focusing on Greek and Arabic collections , we address this challenge using semantic web techniques to create an ecosystem of texts , relationships and annotations , and consider a new model – organic , collaborative , interconnected , and open-ended – of what constitutes an edition. This semantic web-based approach allows scholars to add their own materials and annotations to the network of information and to explore the conceptual networks that arise from these interconnected sayings .
Section: Project presentations

17. TEI-encoding of text reuses in the BIBLINDEX Project

Elysabeth Hue-Gay ; Laurence Mellerin ; Emmanuelle Morlock.
This paper discusses markup strategies for the identification and description of text reuses in a corpus of patristic texts related to the BIBLINDEX Project, an online index of biblical references in Early Christian Literature. In addition to the development of a database that can be queried by canonical biblical or patristic references, a sample corpus of patristic texts has been encoded following the guidelines of the TEI (Text Encoding Initiative), in order to provide direct access to quoted and quoting text passages to the users of the https://www.biblindex.info platform.
Section: Managing different types of text re-uses

18. Digital Greek Patristic Catena (DGPC). A brief presentation

Athanasios Paparnakis ; Constantinos Domouchtsis.
The project is to develop a database, which is planned to include all available information on the use of the Bible in the patristic works of Migne's Patrologia Graeca. Utilization of the data will be available through a web page equipped with necessary tools for developing data mining techniques and other methods of analysis. The main aim of the project is to revive the catenae, the ancient exegetical tool for biblical interpretation.
Section: Project presentations

19. Interactive Tools and Tasks for the Hebrew Bible : From Language Learning to Textual Criticism

Nicolai Winther-Nielsen.
This contribution to a special issue on “Computer-aided processing of intertextuality” in ancient texts will illustrate how using digital tools to interact with the Hebrew Bible offers new promising perspectives for visualizing the texts and for performing tasks in education and research. This contribution explores how the corpus of the Hebrew Bible created and maintained by the Eep Talstra Centre for Bible and Computer can support new methods for modern knowledge workers within the field of digital humanities and theology be applied to ancient texts, and how this can be envisioned as a new field of digital intertextuality. The article first describes how the corpus was used to develop the Bible Online Learner as a persuasive technology to enhance language learning with, in, and around a database that acts as the engine driving interactive tasks for learners. Intertextuality in this case is a matter of active exploration and ongoing practice. Furthermore, interactive corpus-technology has an important bearing on the task of textual criticism as a specialized area of research that depends increasingly on the availability of digital resources. Commercial solutions developed by software companies like Logos and Accordance offer a market-based intertextuality defined by the production of advanced digital resources for scholars and students as useful alternatives to often inaccessible and expensive printed versions. It is reasonable to expect that in the future interactive […]
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities

20. Intertextual Pointers in the Text Alignment Network

Joel Kalvesmaki.
The Text Alignment Network (TAN) is a suite of XML encoding formats intended to serve anyone who wishes to encode, exchange, and study multiple versions of texts (e.g., translations, paraphrases), and annotations on those texts (e.g., quotations, word-for-word correspondences). This article focuses on TAN’s innovative intertextual pointers, which, I argue, provide an unprecedented level of readability, interoperability, and semantic context. Because TAN is a new, experimental format, this article provides a brief introduction to the format and concludes with comments on progress and future prospects.
Section: Project presentations

21. Encoding (inter)textual insertions in Latin "grammatical commentary"

Bruno Bureau ; Christian Nicolas ; Ariane Pinche.
The ancient commentaries provide a large sample of quotations from classical or biblical texts for which Latin gramamrians developed a complex system of insertion of quoted texts. The paper examines how to encode these places using XML Tei, and focuses on difficult cases, such as inaccurate quotations, or quotations of partly or wholly lost texts.
Section: Managing different types of text re-uses

22. Identification of Parallel Passages Across a Large Hebrew/Aramaic Corpus

Avi Shmidman ; Moshe Koppel ; Ely Porat.
We propose a method for efficiently finding all parallel passages in a large corpus, even if the passages are not quite identical due to rephrasing and orthographic variation. The key ideas are the representation of each word in the corpus by its two most infrequent letters, finding matched pairs of strings of four or five words that differ by at most one word and then identifying clusters of such matched pairs. Using this method, over 4600 parallel pairs of passages were identified in the Babylonian Talmud, a Hebrew-Aramaic corpus of over 1.8 million words, in just over 30 seconds. Empirical comparisons on sample data indicate that the coverage obtained by our method is essentially the same as that obtained using slow exhaustive methods.
Section: Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities