Processing Tools for Greek and Other Languages of the Christian Middle East

This paper presents some computer tools and linguistic resources of the GRE g ORI project. These developments allow automated processing of texts written in the main languages of the Christian Middel East, such as Greek, Arabic, Syriac, Armenian and Georgian. The main goal is to provide scholars with tools (lemmatized indexes and concordances) making corpus-based linguistic information available. It focuses on the questions of text processing, lemmatization, information retrieval, and bitext alignment.

analysis software or data retrieval system.In the second case, the analysis is focused on translation methods.Lexical data of a source text is related to lexical data of a target text in order to highlight how lexical features from a source language (in relation with a specific cultural area) are spread and received in a target language (linked with another cultural area).In both cases, the main purpose is to provide researchers with linguistic information exclusively drawn from corpus-based evidences.
This paper focuses on the Unitex corpus processor, on the one hand, and on specific outputs such as monolingual or bilingual concordances, indexes and other lexicographical lists provided in the shape of PDF documents, on the other hand.Since two contributions included in the present issue, [Pataridze and Kindt, 2017] (Text Alignment in Ancient Greek and Georgian: A Case-Study on the First Homily of Gregory of Nazianzus) and [Van Elverdinghe, 2017] (Recurrent Pattern Modelling in a Corpus of Armenian Manuscript Colophons), are founded on the developments presented here, this paper should be regarded as a "general introduction" to these two more specific studies.Given that these contributions concern Armenian and Georgian languages, this paper deals with Greek and Syriac languages.

I. TEXT PROCESSING AND LEMMATIZING
Unitex is an open source corpus processor [Paumier, 2016 and http://unitexgramlab.org/fr].It supports the Unicode encoding standard and is language independent.This software suite works with specific resources, such as dictionaries [Paumier, 2106:43-74] and grammars [Paumier, 2106:93-160].Dictionaries provide word-forms labeled with lemmata and grammatical tags (part-of-speech tags, inflectional tags).Grammars are user-friendly graphical representations of lexical or syntactic patterns based on the formalism of recursive transition networks.

Unitex
The following lines illustrate how Unitex allows the user to open, to process, and to explore a Greek corpus presently composed by two hundred and forty-nine letters of Gregory of Nazianzus (329-390) published by [Gallay, 1964[Gallay, -1967] ] and [Gallay and Jourjon, 1976].This data includes 45,971 word-forms (11,318 different forms; 4,452 lemmas).This corpus is part of a bigger corpus entitled Corpus Epistularum Patrum Cappadocum -Corpus épistolographique des Pères cappadociens, bringing together the letters of Basilius of Caesarea, Gregory of Nyssa, Gregory of Nazianzus, and Firmus of Caesarea, henceforth available on the website of the GREgORI project.
When user opens a corpus, the interface proposes a three-step preprocessing: 1) splitting sentences, 2) normalization of the special words or expressions, and 3) lexical lookup.
Unitex deals with the sentence as a linguistic unit.The texts are split up into sentences based on punctuation.This starting process is followed by the normalization task aimed at improving analysis of some words or expressions.In Greek, this step deals with the crasis or multiword expressions containing one lexical unit corresponding (out of context) to more than one analysis, but being non ambiguous in the given sequence.
The third step of the preprocessing involves the lexicon lookup.The GREgORI project develops its own dictionaries for each processed language.The dictionaries are Unicode text files listing the word-forms.The Greek lexicon records 441,464 word-forms connected with 67,347 lemmata and, to be used with Unitex, is converted into DELAF format [Paumier, 2016:43-46].Each word-form is provided with a one-line lexical entry using a precise syntax: inflected form, lemma corresponding to the form, part-of-speech tag of the lemma (noun, verb, adjective, adverb, pronoun, etc.) and inflectional tags describing the inflected-form (case, gender, number, voice, tense, mood, person, etc.).For instance : τό,ὁ.DET:Nns:Ans; ἔδαφος,ἔδαφος.N+Com:Nns:Vns:Ans, etc.The DELAF formalism is also used for the other languages, such as Armenian որդիք,որդի.N+Com, Syriac
It is important to precise that the lexical lookup does not take into account the context in which the words appear.Therefore, when a word corresponds to more than one lemma, all the possibilities are memorized by the process, such as, for example, the inflected form φύσει, corresponding to both the verb φύω "engender" and the noun φύσις "nature".This is a case of lexical ambiguity (see sections 1.3 and 1.5, below).

Information retrieval
After the preprocessing, the user can search the text for words or expressions by using the "Locate Pattern" menu [Paumier, 2016:84-89].The results are displayed in the shape of "Key Word In Context" concordances.Queries may concern inflected forms, lemmata, part-ofspeech tags but also the combination of these arguments, as shown in table 1 below.

Locate Pattern Result εἴχομεν
The concordance of the word-from εἴχομεν <ἔχω> The concordance of all the inflected form of the verb ἔχω

<V>
The concordance of all the inflected form of lemmata tagged as verb <I+Neg><ἔχω> The concordance of all the lemmata tagged as negation followed by an inflected form of the verb ἔχω Table 1.Samples of queries.
Fig. 1 below gives a part of the concordance corresponding to the query <I+Neg><ἔχω>.Note once again that, at this stage of the process, before lexical disambiguation, queries on both <φύω> and <φύσις> provide concordances including the word-form φύσει.
Figure 1.Concordance of the lemmata tagged as negation and followed by an inflected form of the verb ἔχω.

Text Automaton
Unitex provides a visual representation of each sentence of the processed text, called "automaton" [Paumier, 2106:161-198], as shown on fig. 2. It begins with an arrow and ends with a square inscribed in a circle.Each word form of the sentence appears in a box.Links between boxes symbolize the continuity of the sentence.Each word-form is accompanied with the corresponding lemma and part-of-speech tag, information coming from the lexical lookup.When a word corresponds to more than one lemma, two or more boxes appear on the same horizontal axis, making different interpretations of words explicit.The automaton in fig. 3 shows the lexical ambiguity of the word-form πέτρας, inflected form of Πέτρα "Petra" (the city of Petra), or of πέτρα "stone".The processed text is fully lemmatized, without lexical ambiguities, when the automaton shows only one box for each word of the sentences of the text, i.e.only one interpretation.At this stage, it is possible to generate valid statistics and to edit lemmatized concordances or other outputs.Lexical ambiguities are processed automatically or manually.In the second case, the user can use the lemmatization interface of Unitex, as shown in section 1.5 below.

Information retrieval by grammars
Grammars are powerful tools able to describe sequences of words.The query <I+Neg><ἔχω> used above can be graphically represented in the shape of a grammar, as shown by fig. 4. The graphical interface of Unitex allows users to draw grammars by themselves.By applying this grammar to the text with Unitex, the user will be provided, once again, with the concordance of fig. 1.However, the grammars allow to search more complex patterns corresponding to a wide range of possible searches.The grammar of fig. 5 above will furnish the concordance of all the sequences constituted by a negation followed by the inflected forms of the verb ἔχω, just like the previous one, but predicts, between these two arguments, the optional presence (the sign <E> allows to recognize an empty sequence) of one or several adverbs or particles.The grammars can also provide the outputs in the shape of the tags included in the results (fig.6 above).These tags follow or surround the sequences displayed by the concordance.This grammar provides the same concordance as above, but inflected forms of ἔχω are classified according to the alphabetical order (due to the use of the asterisk before <ἔχω>).Here, the use of variables allows to repeat in the output the part of the sequence inscribed between the brackets, as shown on the concordance provided below (fig.7).In fact, the preprocessing described above (section 1.1) works with this kind of grammar to analyze crasis (fig.8 and 9) and expressions (fig.10).In that case, the sequences are fully replaced by the content of the grammar.Since Unitex uses the Unicode encoding and is language independent, as written above, this kind of grammars can be adapted to analyse similar linguistic facts in Armenian or Georgian.As regards Armenian, a case study is provided by [Van Elverdinghe, 2017], in this issue.

Lemmatization
Unitex provides a lemmatization interface (fig.11) allowing to resolve lexical ambiguities manually.A special window displays the concordance and the text automaton of sentences containing lexical ambiguities.By clicking on superfluous boxesfor instance, boxes with lemma "ὁ.DET" and lemma "φύω.V", the user can remove inadequate interpretations and solve ambiguities in order to obtain a fully lemmatized and disambiguated text, as shown in fig.12.All linguistic resources, dictionaries and grammars, can be created, updated and corrected by the users, without any special knowledge of computer programming.The linguistic data is not encapsulated in a "black box" and remains accessible and readable.
At the end of the process, all lemmatized and disambiguated data are exported and gathered in a relational database system.Then, the publishing tools (see section III) allow to edit the special outputs in PDF format, such as concordances, indexes and other lexicographical lists.

II. TEXT ALIGMENT
The GREgORI project processes texts in Greek, Arabic, Syriac, Armenian and Georgian using appropriate resources for each language.In this context, we are interested in comparing the lexical data of a source text with its ancient or modern translations, the target text, in order to provide scholars with corpus-based information about translation methods.This task requires alignment software such as mkAlign [Fleury, 2012 and http://www.tal.univ-paris3.fr/mkAlign/].The mkAlign software allows users to pair words or expressions of a source text with words or expressions of a target text.
The mkAlign interface is divided into two columns.The user loads the source text in the left column and the target text in the right one.The first alignment is automatically provided on the basis of the punctuation or textual references of the edition (chapters, paragraphs, etc.).The sample provided here concerns the Greek text of the Homily 13 by Gregory of Nazianzus, aligned with its Syriac (S2) version [Schmidt, 2002].Then, the user can move the words or expressions to align the relevant translation units.Word by word alignment is not always possible, as shown on fig.13, below, where the Greek form αἰρόμεναι is aligned with three Syriac words.Aligned data are afterwards exported as bitext files in the Translation Memory eXchange format (.tmx) and gathered in the relational database system.The bitext alignment is currently done manually.In the near future, the bilingual data will be used as translation memories in order to automatize the alignment for recurring translation units.

III. CONCORDANCES, INDEXES AND LEXICOGRAPHICAL LISTS
Specific tools are developed by the Centre de traitement automatique du langage (Cental) at the UCL (https://uclouvain.be/fr/instituts-recherche/ilc/cental) in order to edit concordances, indexes and other types of lexicographical lists.The purpose of these developments is to extract the lexical data out of the databases and inject them in the frameworks corresponding to the output required by the user.Fig. 14 shows the lemmatized concordance of the verbs included in the letters of Gregory of Nazianzus.In this PDF document, lemmata are displayed in green (accompanied with a frequency rate), POS tags in red and word-forms in blue, with downstream and upstream contexts.The references to the edition appear on the left of the downstream context.The bookmarks list lemmata and allow browsing the document.Fig. 15 gives a sample of the alphabetical list of these verbs and Fig. 16 lists these verbs according to their frequency in the corpus.Fig. 17 offers a sample of a bilingual Greek-Syriac concordance of Homily 13 by Gregory of Nazianzus.This document displays all the lexical data both for the source text and for the target text, including roots (in red) for Syriac.Such a bilingual concordance is the so called "scholar concordance".Fig. 18 provides the lexical data (lemmata and POS tags) only for the source text.Such a bilingual concordance is a so-called "light concordance", being more easily readable for the user.

CONCLUSION
For many years, all these technical developments and linguistic data are used to provide scholars with lemmatized concordances.These concordances were formerly published by Brepols Publishers on microfiches in the Thesaurus Patrum Graecorum series (see [Coulie, 1996;Kindt, 2010].This use of microfiches is however rightly characterized as a "doubtful aspect" since [Trapp, 2008:97].Now concordances and others lexical outputs are directly available on the website of the GREgORI project.Some collaborators made use of these computerized resources to build lemmatized indexes (index nominum, index verborum, etc.) included in their editions of ancient texts (see for instance [Schmidt, 2001], [Auwers, 2005], [Sanpeur, 2007], [Amato, 2009], [Auwers, 2011] and more recently [Calzolari, 2017]).The GREgORI project also takes part in international research programs, such as the Syriac Galen Palimpsest project led by the University of Manchester or the study undertaken at the University of Lausanne and devoted to the comparative analysis of the Iliad and its Byzantine translation contained in the manuscript Genavensis 44.
To date, a web-based interface allowing to consult lemmatized concordances on personal computers via Internet, is being prepared in collaboration with the Cental.A beta-version of this interface is used by Peeters Publishers (Leuven) in order to pave the path for a forthcoming digital version of the well-known Corpus Scriptorum Christianorum Orientalium, the series dealing with some of the most important ancient sources written in languages of the Christian Middle East.

Figure 5 .
Figure 5. Grammar predicting the optional presence of an adverb or a particle between the arguments <I+Neg> and <ἔχω>.

Figure 6 .
Figure 6.Grammar providing outputs and alphabetical sorting of inflected forms of the last verbal argument.

Figure 7 .
Figure 7. Concordance with outputs and inflected forms of ἔχω in alphabetical order.

Figure 12 .
Figure 12.Lemmatization interface with elimination of useless boxes.

Fig
Fig. 14.Lemmatized concordance of the verbs in the corpus of the letters by Gregory of Nazianzus Fig. 17.