ekdosis: Using LuaLaTeX for Producing TEI xml Compliant Critical Editions and Highlighting Parallel Writings

ekdosis is a LuaL A TEX package written by R. Alessi designed for multilingual critical editions. It can be used to typeset texts and different layers of critical notes in any direction accepted by LuaTEX. Texts can be arranged in running paragraphs or on facing pages, in any number of columns which in turn can be synchronized or not. Database-driven encoding under L A TEX allows extraction of texts entered segment by segment according to various criteria: main edited text, variant readings, translations or annotated borrowings be-tween texts. In addition to printed texts, ekdosis can convert .tex source files so as to produce TEI xml compliant critical editions. It will be published under the terms of the GNU General Public License (GPL) version 3.


I INTRODUCTION
The name of the software that is presented here, ekdosis, derives from a Greek action nounἔκδοσις-the meaning of which is: "publishing a book", and also in concrete sense: "a publication, treatise".For us moderns, this term refers to a long tradition of scholarly work consisting in establishing from manuscript evidence the texts of Greek and Latin classics that were handled down through the Middle Ages to the time of the first printed editions.It is not quite sufficient to mention that no original of such text survives: as Jean Irigoin vividly pointed out in the inaugural lecture that he delivered at the Collège de France in 1986, Sophocles witnessed the building of the Parthenon.As a result, what still can be seen or read from the remains preserved on the Acropolis, at the Louvre or the British Museum, is pretty much the same as Sophocles already could see or read in his time.But the modern reader of the same Sophocles' preserved tragedies has to rely on manuscripts the most ancient of which was copied over a millennium and a half after Sophocles died.1 Scientific textual criticism gradually emerged from the last years of the xvii th century out of the rejection of the idea that any improvement of classical texts should be conducted on the basis of vulgate texts transmitted in printed books. 2 It must be noted that the first impulse came from the Textus receptus of the New Testament.Not only scholars had to deal with a very large number of versions and scholia and countless variant readings, but in many aspects, they were facing unprecedented problems posed by the existence of early patristic citations and ancient translations in various languages, some of which even showed that certain passages of the received Greek text had been deliberately falsified.
Contrary to what one might think, the foregoing remarks are far from being unrelated to digital humanities.Whether they are printed or digital, modern critical editions always exhibit reconstructed texts, and both convey the legacy of those early scholarly attainments.The texts reconstructed come either from manuscripts under the title of the edited work (direct tradition) or from evidence of explicit citations or parallel passages preserved in other authors or from translations in other languages made on the basis of manuscripts which may or may not have survived (indirect tradition).Both kind of editions come equipped with an apparatus criticus in which is mentioned all the evidence that was used to build the edited text.However, this is where the two kinds of editions-printed or digital-part ways.Surely, reading a traditional and well formed apparatus criticus requires a specific training and is only meant for experienced readers: all notes are grouped together in the form of a paragraph of its own, freely composed in Latin at least when it comes to editing a classical text.The conventions are many, whether they concern Latin technical abbreviated terms, using lower-case Greek letters to indicate intermediate lost manuscripts and upper-case Roman letters to indicate preserved manuscripts, using spaces for grouping the sigla into families and exponents for pointing to traceable corrections, where, to take an example, number 1 always refers to first-hand corrections, number 2 to a second hand and so forth.Another source of difficulty arises from the style of the apparatus: some are set out fully and display the words of the adopted text and the part of the tradition they belong to as well as the variant readings, while others only mention the deviant readings which are rejected by the editor.
Arguably, getting oneself familiarized with these conventional rules is not unrelated to learning a language equipped with terms, grammar rules and style embellishments.It came into existence out of over three centuries of tradition and cultural facts and is immediately accessible to human mind's natural ability to use language and interpret conventional symbols.It may be true that reading an apparatus criticus requires training and effort.Nevertheless, this task is part of the curriculum and remains the natural, traditional way to go.But it is quite inaccessible to a computer, unless every item of information has been encoded in the rather dumb format that is suited to machines.
On the other hand, critical editions in print have their own limitations.For example, editors of classical texts are used to save space by not reporting trivial mistakes committed by medieval scribesand rightfully so.However, as for ancient authors who wrote in subdialects, the manuscripts may display forms that are not considered acceptable by modern linguistics even though such forms may go back to the archetype of the preserved witnesses.The question therefore arises as to which text was in circulation in ancient times.We may have grounds to normalize these texts in modern critical editions.But do we have the right to remove those allegedly 'faulty' variants from the apparatus to save space and ease reading, thus making them disappear from the tradition for ever?
Another limitation of editions in print comes from "fluid forms of transmission", as in the case of technical and popular literature. 3To take here just one example, we have a large amount of works from Late Antiquity and the Middle Ages in which the primary intention of the scholars was to provide useful, practical or scientific information.Depending on the intellectual milieu or geographic area in which such texts were actually circulating, they were adapted to the needs of students, namely expanded, abridged or rephrased.Their content was also constantly updated as science progressed, including by their original authors themselves, which could lead to the existence of several recensions of the text in their lifetime.It goes without saying that putting in print such texts-and which ones exactly?-may become quite an intricate business for it is much more about circulation and method of redaction than archetype and authorship.
One final limitation of editions in print is about indirect tradition of transmitted texts. 4Strictly speaking, references to texts quoted by the author of the edited text do not constitute indirect tradition.But they may be collected into the critical notes as an additional layer called apparatus fontium.As for references to the edited text by other authors, they are collected in what is called an apparatus testium.It may happen that such quotations attest variant readings from the manuscripts of the direct tradition or even provide new variants of their own.These testimonia can be found in the same language as the edited text or in another language.To take a common example of such phenomena, many Greek books of technical literature have been translated into Syriac and/or Arabic throughout Late Antiquity and the Middle Ages.Out of necessity, but also for obvious reasons of space available on the pages, editions in print give no access to indirect tradition or translations of the main text in other languages.After all, only deviant readings are of interest to establish the edited text.
The various limitations described above are damaging in more than one respect.First, every detail that the editor decided to discard, regardless of its relevance to the purpose of the edition, is lost permanently as in the case of dialectal coloring of ancient books.Second, passages collected as indirect tradition are only available as references in the apparatus testium.This may be acceptable for short passages quoted by other authors who wrote in the same language as the primary author.However, as translations cannot be compared to the original text, the reader is refrained from bestowing attention upon major ones to understand better difficult passages.But there is more to say in this respect.As a matter of fact, it is only natural that in many cases only original texts are considered worthy of interest.In comparison, translations are looked at as satellite texts.As a result, unless they are made by someone considered a prominent author in a different cultural area, translations are likely to gradually lose interest.
To conclude on these issues, print publications and digital editions are often contrasted as they belonged to two different worlds. 5It is commonly said that the content of editions in print is the result of the binding of the book itself as an object, whereas digital editions, in which format and presentation are by definition separated from content, are free from limitations coming from such bindings.To sum up from the foregoing considerations, this statement is likely to be qualified: as already seen above, the apparatus criticus must be looked at as a brilliant production of mind refined by centuries of scholarly tradition-and surely tradition must go on-arguably not as compact paragraphs that require special and painful training to be 'decoded'.On the other hand, what editions in print do not provide are what D.J. Mastronarde and R.J. Tarrant have called "actionable texts for use in digital research",6 namely database-driven texts allowing the reader to select annotations and display or arrange translations, parallel passages or borrowings in a variety of ways.
ekdosis can be seen as an attempt at combining the two approaches as will be illustrated by the following.

II USED PROGRAMMING LANGUAGES
At the time of writing, v1.0 of ekdosis, which is about to be released as a LuaL A T E X package under the terms of the GNU General Public License (GPL) version 3, is written mostly in T E X and Lua (see fig. 1 for details).tradition established by great printed editions explains why L A T E X has been primarily preferred as a typesetting system with the intent of reviving a tradition, now almost forgotten, in which the brilliance of the presentation was not dissociated from the quality of the academic work.It is worth recalling here that both T E X and Lua come from an academic source.Besides, L A T E X, contrary to xml, is a natural language designed for use in typesetting complex documents, including as many languages one wishes to have printed in any writing direction, such as critical editions where footnotes and other kind of annotations can be particularly abundant.Document processing through compilation is also an argument in favor of L A T E X, let alone, as has just been said, that it is meant to serve complexity: in addition to fine PDF output, compilation can produce with no additional effort, from a single source file, various types of outputs, such as OpenDocument, html or TEI xml files.
By using Lua as an additional scripting language, notably for pattern matching and string handling functions, any portion of the code, as it is compiled by L A T E X, can be intercepted and passed on to Lua functions for further processing and then returned to L A T E X.Some functions are designed to support document processing through L A T E X while others act as a "L A T E X to TEI xml" converter as will be demonstrated shortly by several examples.The main features of the software are as follows: 1. Multilingual critical editions:-ekdosis can be used to typeset texts and various layers of critical notes 8 in any direction accepted by LuaT E X, which makes it adapted to rare or lesserknown languages.Texts can be arranged in running paragraphs or on facing pages, in any number of columns which in turn can be synchronized or not.It is also suitable for complex layouts, such as Arabic poetry or images where three-way alignment is required, 9 diagrams, etc. 2. Database-driven encoding under L A T E X, which allows extractions of texts entered segment by segment according to various criteria: main edited text, variant readings or translated texts and annotated borrowings between texts.From a given main text-whether it be critically edited or not-ekdosis can select, display and arrange in aligned columns other recensions or parallel texts.3. Academic background:-ekdosis comes in support of a seminar to be held in fall 2020 onwards about the scientific controversies raised by the many discussions about medical education which took place in the Greek, Arabic and Byzantine worlds. 10The texts to be discussed in the seminar are characterized by their complexity, as some of them are made out of several recensions, while most of them feature borrowings-either explicit or nottranslations (from Greek into Arabic or vice versa), parallel writings and commentaries.

III MAIN FEATURES
These features will be illustrated here through an excerpt from the Latin text of Caesar's Gallic War (Book VI, 13.1) as it is read in the French edition of the Budé series. 11 can be seen from fig. 2, as lines are not numbered, the notes in the apparatus refer to the sectional divisions of the edited text.That said, the apparatus criticus can be readily scrutinized in light of the  As can be seen, \DeclareWitness requires three mandatory arguments enclosed between curly braces used to specify consecutively: the identifier of the witness to be used as xml:id, 13 the rendition of the siglum to be used in the apparatus criticus and a short description used to build the conspectus siglorum to be printed at the forefront of the edited text. 14Finally, other items of information can be specified in a further optional argument enclosed between square brackets.The three mandatory arguments of \DeclareHand specify in turn: the unique xml:id to be used, the xml:id of the witness the hand is related to and lastly the rendition to be printed in the apparatus criticus.Finally, families of witnesses can be declared as shorthands: 15 -56 \DeclareShorthand{a}{α}{A,M,B,R,S,L,N} Where the first argument of \DeclareShortHand is the xml:id, the second its rendition to be used in the apparatus criticus and the third a comma-separated list of declared witnesses.
Once  4. 15 Another approach which will not be described here for the sake of simplicity is to assign the xml:id of the family as the container of the individual witnesses that are part of the family.Close examination of the two commands \lem and \rdg that were used in listing III.1 shows that while some optional 'name-value' arguments such as wit have TEI equivalent attributes, others do not, as alt (listing III.1, ll.90, 93-4 and 101) and nordg (l.95). 16The principle is that ekdosis uses some of them for PDF or TEI output only, and others for both outputs.Obviously, wit is one of the latter, although values are rendered differently in PDF and in TEI.On the other hand, alt is used to introduce an alternate way of inserting words in the apparatus criticus in print, while keeping safe what is to be found in TEI output.Moreover, as can be seen on lines 93-4, alt has been used to insert a subvariant in the lemma part, as a consequence of which nordg was required (l.95) to remove from PDF output words that would have been otherwise redundant.

IV ALIGNMENT AND SEGMENTATION
An alert reader will have noticed from listing III.2 two <div> elements at lines 178-9.As a matter of fact, lines 178 and 204 from the TEI output come from \begin{latin} ... \end{latin} from the corresponding L A T E X source file (listing III.1, lines 85 and 103), while TEI lines 179, 180 and 203 are generated by one single command provided by ekdosis, \ekddiv (listing III.1, l. 86).Moreover, \begin{segment} ... \end{segment} (listing III.1, lines 87 and 102) has been converted into <seg> elements by ekdosis (listing III.2, lines 182 and 201).
These features call for two remarks.First, ekdosis knows where any opened TEI element that is allowed to nest recursively, such as <div>, <lg> and the like, is to be closed, even though, as in the case of the \ekddiv command, there is no explicit indication of the point where the closure occurs.Thoroughly scanning L A T E X source files with Lua functions which involve complex string matching, reverse string matching and recursions was required, as L A T E X 'open' commands such as \chapter or \section only act as milestones, contrary to nested TEI elements.ekdosis converts these commands into TEI 'numbered' textual divisions, namely <div1> to <div7>.Moreover, in addition to L A T E X standard textual divisions, \ekddiv was needed to meet the requirements of classical and literary texts the divisions of which depend on many different received traditions.As can be seen from listing III.1, line 86, \ekddiv processes a comma-separated list of 'name-value' arguments.Some, as head, correspond to TEI subsequent elements, while others correspond to attributes of <div> elements.As to depth, which has no TEI equivalent, ekdosis uses its value to build un-numbered TEI divisions allowed to nest recursively-or not allowed to-in accordance with their declared hierarchic depth.This mechanism gives the flexibility that is needed in printed editions of classical texts.Finally, it must be noted that format and presentation have been carefully separated: in the PDF output (fig.3), chap.XIII, which is printed at the beginning of the current paragraph in Roman capital letters followed by a dot and an en quad is merely encoded in \ekddiv as head=XIII with no dot.Other commands, which are not discussed here, allow to set the format of any textual division that is used.
The second remark is about the two environments \begin{latin} ... \end{latin} and \beg in{segment} ... \end{segment} (listing III.1, ll.85, 87, 102 and 103): as can be seen from the TEI output (listing III.2, ll.178 and 182), the corresponding <div> and <seg> elements have been given xml:id attributes by ekdosis.This important feature allows for alignment of parallel texts from multilingual corpora.To return to the example of Caesar's Gallic War VI, 13.1 presented here, the alignment has been set in the preamble of the L A T E X source file like so:- It is worth mentioning that \SetEkdosisAlignment provides further options suited for corpora in which translations or parallel passages are sufficient in number to be provided in separate files.17However, this point will not be discussed here, as representing all the possible contrivances leads to much complexity and would be too long to consider.
In listing IV.1, tcols and lcols stand for "total number of columns" and "columns to be printed on the left-hand page"18 respectively.The next option, texts, defines names of environments that are to receive texts to be aligned, namely: latin, english and french-further sub-options can be specified between square brackets, such as xml:lang attributes.Then, the apparatus option, just as texts, takes a semicolon-separated list of previously defined environments that shall receive at least one layer of apparatus criticus.As already said (see above p. 2 and point 1 p. 4), several layers of critical notes can be defined. 19Finally, segmentation=auto instructs ekdosis to automatically increment the xml:id attributes associated to each segment of text delimited by the L A T E X environment \begin{segment} ... \end{segment} as in listing III.1, ll.87 and 102.
The complete L A T E X body text that was used to build Caesar's Gallic War, VI, 13  The three environments that have been set with \SetEkdosisAlignment (see listing IV.1) have been assigned automatically generated xml:ids, with the language attributes that were specified as further optional arguments as can be noted from listing IV.3, ll.178, 205 and 217.Then, as  the correspondences are between spans of text, L A T E X segment environments have been translated into TEI <seg> elements each of which have been assigned again an automatically incremented xml:id.Finally, the segments found in these separate domains are connected together by means of <linkGrp> and <link> elements.This technique naturally applies to alignment of parallel texts in multilingual corpora, but is also well suited for parallel redactions and borrowings which can be further annotated using the \note command which ekdosis translates into TEI <note> elements with associated type attributes. 22V CONCLUDING REMARKS Fig. 4 shows Caesar's Gallic War, VI, 13.1 critical edition in print as it is typeset by ekdosis from the L A T E X source file that has been commented on here.Arguably, this is how it should be read in a "non-actionable" form.Of course, displaying in like manner more texts or more translations would soon become impossible, not to say irrelevant.Yet ekdosis can select a handful of versions out of many and display them properly in print while building a database meant to stand for queries and extraction of data.Such is the spirit in which it was written.

Figure 1 :
Figure 1: Programming languages used in ekdosis