Preprocessing Greek Papyri for Linguistic Annotation

Greek documentary papyri form an important direct source for Ancient Greek. It has been exploited surprisingly little in Greek linguistics due to a lack of good tools for searching linguistic structures. This article presents a new tool and digital platform, “Sematia”, which enables transforming the digital texts available in TEI EpiDoc XML format to a format which can be morphologically and syntactically annotated (treebanked), and where the user can add new metadata concerning the text type, writer and handwriting of each act of writing. An important aspect in this process is to take into account the original surviving writing vs. the standardization of language and supplements made by the editors. This is performed by creating two different layers of the same text. The platform is in its early development phase. Future developments, such as tagging linguistic variation phenomena as well as queries performed within Sematia, are discussed at the end of the article. keywords Greek; papyri; linguistic annotation; treebank; dependency grammar; TEI EpiDoc XML; MySQL; Python; JavaScript INTRODUCTION Greek papyri from Egypt have preserved bigger and smaller entities of Greek as it was written by ancient speakers from ca. 300 BCE to 700 CE. There are different registers and styles found within a variety of different text types; the vernacular becomes visible in private letters and official phraseology in contracts. Therefore, the papyrological corpus forms an important direct source for Greek linguists. The documentary papyrological corpus is freely available in digital form in the [Papyrological Navigator] (PN) platform, which also allows users to search both text strings and metadata (such as date and provenance). The search possibilities do not, however, easily yield to querying linguistic structures or variation in spelling or morphosyntax. Partly for this reason, the papyrological corpus has been left without much attention within the majority of linguistic research of Ancient Greek. A research project of author 1 (“SEMATIA: Linguistic Annotation of the Greek Documentary Papyri – Detecting and Determining Contact-Induced, Dialectal and Stylistic Variation” funded by the Academy of Finland) sought methods to make better use of the papyri for purposes of linguistic research. In this first phase we needed a way to preprocess the papyri into a form which could be linguistically annotated. The Sematia tool presented in this article results from this project but the tool is still being further developed. A new research project [“Act of the Scribe: Transmitting Linguistic Knowledge and Scribal Practices in Graeco-Roman Antiquity”] where author 1 is currently a researcher, is concentrating on scribes, their level of competence and their linguistic skills. We study the mechanisms of the language production in order to separate the technical effects from the linguistic and cognitive processes. This enables us to pinpoint the scribe’s part in language change. We have added the possibility for implementing new metadata especially for the purposes of that project in Sematia. We approach the texts by dividing them by the “acts of writing” in order to distinguish each writer within one text.


INTRODUCTION
Greek papyri from Egypt have preserved bigger and smaller entities of Greek as it was written by ancient speakers from ca. 300 BCE to 700 CE.There are different registers and styles found within a variety of different text types; the vernacular becomes visible in private letters and official phraseology in contracts.Therefore, the papyrological corpus forms an important direct source for Greek linguists.The documentary papyrological corpus is freely available in digital form in the [Papyrological Navigator] (PN) platform, which also allows users to search both text strings and metadata (such as date and provenance).The search possibilities do not, however, easily yield to querying linguistic structures or variation in spelling or morphosyntax.Partly for this reason, the papyrological corpus has been left without much attention within the majority of linguistic research of Ancient Greek.A research project of author 1 ("SEMATIA: Linguistic Annotation of the Greek Documentary Papyri -Detecting and Determining Contact-Induced, Dialectal and Stylistic Variation" funded by the Academy of Finland) sought methods to make better use of the papyri for purposes of linguistic research.In this first phase we needed a way to preprocess the papyri into a form which could be linguistically annotated.The Sematia tool presented in this article results from this project but the tool is still being further developed.A new research project ["Act of the Scribe: Transmitting Linguistic Knowledge and Scribal Practices in Graeco-Roman Antiquity"] where author 1 is currently a researcher, is concentrating on scribes, their level of competence and their linguistic skills.We study the mechanisms of the language production in order to separate the technical effects from the linguistic and cognitive processes.This enables us to pinpoint the scribe's part in language change.We have added the possibility for implementing new metadata especially for the purposes of that project in Sematia.We approach the texts by dividing them by the "acts of writing" in order to distinguish each writer within one text.
Sometimes a text is a product of one writer only, but in many cases two or more different people have written in one document, attested by the change of handwriting.

I BACKGROUND
In this section, we will first briefly describe the digital papyrological corpus used in this project, as well as the nature of a papyrus text, in order to illustrate the basic requirements for preprocessing the data.Then, we summarize the linguistic annotation process in 1.2, essential for the later discussion on how we plan to utilize treebanks in this project.Lastly, in order to motivate the way in which we address the texts, we shortly discuss what we mean by linguistic variation in 1.3.

The Papyrological Corpus in Digital Form
The platform Papyrological Navigator (PN) is the most important digital tool for papyrologists and anyone using papyri, potsherds and wooden tablets as primary sources for their studies of the Ancient World.It is an umbrella term under which several databases with different scopes are linked together.Its history goes back to 1982, when a papyrological text corpus in digital form was formed at Packard Humanities Institute, resulting in a CD-ROM (PHI #7 Duke Databank of Documentary Papyri).A more detailed history is given in the information page at PN.At the moment the Duke Databank text corpus is open source and available online via the Papyrological Navigator and the texts have been migrated into [TEI EpiDoc XML] form.New publications are added to the corpus, old entries can be corrected and new data added via the Papyrological Editor by the papyrological community (the workflow is curated by an editorial team).Thus, the corpus is kept in an up-to-date, reliable state.Currently, it hosts ca.70,000 Greek texts, 2,000 Latin texts and 1,000 Coptic texts.A word count is not available and texts vary from very short to extremely long.The PN also includes a search interface, where the texts, metadata, translations and images can be searched using different parameters.

The Nature of a Papyrus Text and its realization in TEI EpiDoc XML
Papyri, like inscriptions, are seldom preserved in perfect condition.This results in gaps (lacunae) within the text.The ink may have faded in places, or the handwriting might be difficult to read, with the result that the editor cannot always be certain how to read each letter.Moreover, many texts contain a large number of abbreviations, because they come from the pens of professional scribes working with texts of an administrative nature.These features are marked in the paper editions according to the editorial conventions called the Leiden System, commonly agreed upon in 1931.For example, a lacuna is marked with square brackets, abbreviations are expanded within parentheses and uncertain letters have a dot under them.For a full list, see [Schubert 2009, 202-203].The EpiDoc XML marks the same phenomena in TEI compatible tags within the text, e.g.<gap> for the lacunae, <uncertain> for the uncertain letters.The display in the PN shows the text in a traditional Leiden System layout (with the apparatus criticus below the text), but the text is stored in the GitHub repository in the XML form.
Example 1.The first two lines of P.Petra 1 6 in PN display layout (A) and in EpiDoc XML (B): (A) (B) <lb n="1"/><g type="stauros"/> <choice><reg>γνῶσις</reg><orig>γνο͂ σις</orig></choice> <choice><reg>ὧν</reg><orig>ὁ͂ν</orig></choice> <choice><reg>ἀπώλε <lb n="2" break="no"/>σα</reg><orig>ἀπόλε<lb n="2" break="no"/>σα</orig></choice> <choice><reg>ἐγώ</reg><orig>ἐγὸ</orig></choice> Ἐπιφάνιος Although Example 1 exhibits no gaps or uncertain letters, it shows another feature that is highly relevant to our project and to linguists in general, namely, editorial corrections.Within the <choice> tag, the <orig> tag informs which form the ancient writer really wrote on the papyrus and <reg> what the editor thinks is the regular or standard form which was meant.A linguist is usually interested in the forms that the writer originally wrote, since they give us information on language change, phonology and the vernacular.However, with regard to our project it is highly important that the edited text contains the assumed standard forms, too.Using that information, the lemmatization and comparison between the original and standard forms are much easier to perform.Of course, we may be hesitant in several cases about what, in fact, is the standard we should be comparing with and if we agree with the editor's interpretation of what was sought after by the original writer.For discussions on this topic, see [Colvin 2009] and briefly [Vierros 2012, 25].

Treebanks
For Ancient Greek literature, two (constantly growing) linguistically annotated treebank corpora exist, as mentioned by [Haug 2104]: the Ancient Greek Dependency Treebank (currently ca.558,000 tokens of Homer, Hesiod, tragedies) and the PROIEL treebank (currently ca.230,000 tokens of the New Testament, Herodotus and later Greek), see also [Universal Dependencies].These treebanks follow the Dependency Grammar originally used for Czech in the Prague Dependency Treebank outlined in [Hajič 1999].The suitability of treebanks for historical linguistic research as well as dependency grammar for Ancient Greek has been recently discussed by [Haug 2015].The most reasonable solution, in our opinion, was to follow the same framework of annotation also with the papyrological material.In this way we can utilize best practices and an annotation infrastructure in those projects as well as gain maximal synergy between the corpora of literary and documentary texts.
In the annotation process each word is supplied with a tag including its lemma, postag (i.e.string containing the part-of-speech and morphological analysis of the form), syntactic role and a reference to the head word.The analysis is performed according to the Guidelines for the annotation of Ancient Greek (see [Bamman and Crane 2008] and [Celano 2014] for versions 1.1 and 2, respectively).The annotation tool we have used is an editor called [Arethusa] in the [Perseids] platform.Arethusa divides the text into sentences at certain punctuation (full stop, colon) and gives each sentence and each word within the sentence an ID number.It employs the [Morpheus] tool in providing each word with a lemma and with morphological analysis.This means that lemmatizing and morphological analysis are performed semi-automatically in the Arethusa editor; the human annotator must evaluate the correctness of the analyses where several options are possible in the case of homonyms and add forms in cases where the tool does not recognize the lemma (e.g.many Egyptian names in the papyri).The syntactic roles and dependencies have to be analysed by the human annotator and implemented manually because a syntactic parser for Ancient Greek is still a desideratum; the first attempts have been reported by [Mambrini and Passarotti 2012].
Example 2. Treebanked sentence in XML format.
The "postag" is a nine-place string marking each lemma with 1) part of speech 2) person 3) number 4) tense 5) mood 6) voice 7) gender 8) case 9) the degree of comparison, using certain agreed letters and numerals, e.g."n" stands for nominative and "g" for genitive within the 8th place of the string, marking "case".

Linguistic Variation
The documentary papyri include many different types of linguistic variation, which often cannot be found in the literary texts preserved via the manuscript tradition.Variation means the existence of competing linguistic forms either within one single speech community or a language as a whole.When we witness a change in a language, it is normally preceded by a great deal of synchronic variation, that is, many variants compete until one of them becomes popular and consistent.Studying the variants as such not only tells us a great deal about language change and the processes leading to it, but also about the community; where the people come from, and with whom they have interacted (contact induced variation).Some of the variants in papyri can be categorized as "scribal errors", a category which is not always treated consistently.It may include mere slips of the pen, but sometimes even a difference of one letter may be an important phonological variant signalling changes in pronunciation.For example, the genitive singular of the word "wheat" (standard: πυροῦ) is written in two different nonstandard ways in the potsherds from Narmouthis (the potsherds, ostraca, are included in the papyrological corpus): πουροῦ (OGN I, 42 and 47) and ποιροῦ (OGN I 46 and 86).The latter (ποιροῦ) attests the merging of /y/ and /oi/ that was an internal development in Greek in the Roman period, but the former (πουροῦ) shows more the transfer of Egyptian, which did not have the front vowel /y/, and often the /u/ and /y/ were confused by Egyptians writing Greek, see [Dahlgren, forthcoming].
In addition to spelling variants, we wish to present a couple of examples of morphosyntactic variation in order to make our treatment of the papyri more understandable.First, the phrase initial inflection strategy.Greek is an inflecting language where morphological case agreement is essential.Certain examples of case incongruence were earlier considered mainly "bad Greek", but shown by [Vierros 2012] to present a pragmatic strategy for the scribes; they only inflected the phrase initial words and left the rest of the words belonging to the same phrase in the nominative case.It also reflected the native language, Egyptian, of the writers, as it did not have case inflection.Also, the relative pronouns of the same writers were inflected according to the wrong head, thus evidencing contact-induced transfer from Egyptian.
A different type of dilemma is presented by some spellings that prevent us from making direct assumptions on what form the ancient writer aimed for.[Leiwo 2010] discusses, for example, how the phrase καλῶς ποιήσεις (a way of saying "please", "you do well…") is used; i.e.
which form of a verb can act as its complement.Usually, an aorist participle complement denotes what is being asked.However, in the ostraca from Mons Claudianus, a form πέµψε is used (O.Claud.II 243, 2-3).In this particular case, it is difficult to say how it should be interpreted: straight up, πέµψε, would be the aorist indicative 3rd person singular form of the verb "to send" and this is how the automatic morphological tool would classify it.In the sentence it cannot be a 3rd person form since the phrase is directive.We could interpret it in two different ways.It could be an aorist imperative 2nd person singular, πέµψον, because unstressed /e/ and /o/ could be confused, especially by Egyptian native speakers, and the final /n/ could easily be dropped out.This is how the editors wish to read it.However, also the infinitive form, πέµψαι would be a phonologically possible interpretation here because the <αι> and <ε> are confused in the papyri all the time.All the forms discussed above were probably pronounced in the same way: /pémpsəә/.The annotator may wish to mark up both options, the infinitive or the imperative, because the question here is whether the infinitive form was an accepted variant with this directive phrase or not.

II PREPROCESSING THE PAPYRI
In this section, we first present the idea of layering as a solution to preprocessing the papyrological data.Second, 2.2 contains the detailed description of how each XML tag is treated in the selection or deselection of elements for each layer.The technical side of building the platform and tool, for which author 2 was in charge, is described in 2.3.

Layers in Sematia
As mentioned in 1.1.1,the XML tags in the papyrus texts code important information.The tags are located inside the text and between words and letters.Similarly, the choices and apparatus entries for one word follow each other.In the treebank editor, a word is the basic element it tries to identify automatically.The EpiDoc XML texts cannot therefore be uploaded to the treebank editor as such, because the tags break up the words and the apparatus choices would all be included side by side if we only removed the tags.For the study of linguistic variation, we need first and foremost to know what the ancient author really wrote (and what is extant of what he wrote).However, the standard variant is useful to have for the sake of comparison.Moreover, the fragmentary nature of many texts makes the syntactic structure discontinuous, and therefore the editor's supplements may help in having a solid syntactic tree of a sentence, which is otherwise broken.
For these reasons, it seemed justified that we should create two different layers of the same text, each of which will be treebanked separately.First, the original layer contains only what has been preserved in the papyrus and in the form the ancient writer wrote them.For abbreviated words, for example, only the part that was written is taken into the original layer to prevent us annotating case inflection that the ancient writer did not produce.The standard layer, on the other hand, includes all the editorial work: the expanded abbreviations, supplements, as well as the standardized forms of misspelled words are all accepted.In this way, we get two different treebanks of one act of writing, and comparison can be made between them to see where the morphology differs.
Since treebanking does not allow us to mark all features relating to linguistic variation, we decided to add a third layer, where a new variation mark-up is added to the treebank XML.This very much concerns phonology and spelling, but can also benefit morphosyntactic analyses.The variation layer is discussed in chapter IV (Future developments).
An important division of one document is performed before the layering.The change of handwriting, <handShift>, indicates a different person penning the letters.Thus, each act of writing gets its own layers and eventually treebanks.Also, the new metadata we enter (discussed in III), concerns each act of writing.
One caveat may be mentioned, although the present article is not the correct place to take the discussion very far.The original layer, in fact, contains some editorial work too, i.e., it does not present a so-called diplomatic transcript.The writing on the papyrus is usually without word divisions (in scriptio continua) and does not contain diacritical marks (accents, breathings, or iota subscripts).The word divisions and diacritics are part of the editor's interpretation and make the text readable.We have not moved towards a diplomatic transcript in the original layer for the sake of readability as well as to facilitate the automatic lemmatization and morphological analysis.If the annotator disagrees with some word divisions or diacritics, s/he has the possibility to make a change in the text in the Arethusa tool.However, in that case the interpretation should be well supported and the same correction should be suggested to the Papyrological Navigator.

How tags were treated
This chapter consists of a full discussion of how the EpiDoc XML tags are treated when creating the original vs. the standard layer.It was important to keep the word count, i.e., keep the tokenization the same in both layers, so that the word-for-word comparison is possible between the layers by using the word-IDs.We use "dummy" elements to replace the parts not included in the layers on account of tokenization.Another reason for using dummy elements is to help the annotator to notice the missing parts of the text.The annotator will clearly see that something is missing either between the words or at the end of an abbreviation when s/he sees the dummy element.For this reason, the dummy element is written in capital letters.

Editorial corrections: <choice>, <reg>, <orig>, <corr> and <sic>
The element <choice> usually contains two alternatives.First, <reg> gives the standardized, regularized version, and is thus selected for the standard layer.On the other hand, <orig> consists of what was originally written on the papyrus, and is naturally elected for the original layer.E.g. from <choice><reg>γνῶσις</reg><orig>γνο͂ σις</orig></choice> we choose γνῶσις for the standard layer and γνο͂ σις for the original layer.Sometimes the editor may have suggested two different possibilities for regularizations, or another scholar may have suggested a new interpretation.In those cases, the platform allows the user to choose one of the options to the text which will be annotated (see below 2.3.3).

Abbreviations: <expan>, <ex>
Words are abbreviated in different ways in the papyri.Sometimes only the end of the word is left unwritten (and it usually has some sort of abbreviation mark at the break up point).In TEI EpiDoc XML, the <expan> tag surrounds the whole word which is abbreviated in its expanded form and, within the <expan> tag, the part which was left unwritten is surrounded by the <ex> tag.For example, when the word στερεοῦ is abbreviated by leaving out the ending οῦ, it is written στερε(οῦ) according to the Leiden System, but in EpiDoc XML it is marked: <expan>στερε<ex>οῦ</ex></expan> In this case, we take the whole word in expanded form into the standard layer (στερεοῦ) and for the original layer we choose only what was written on the papyrus, i.e. στερε, now added with the dummy for abbreviation: A. Thus in the original layer we get στερεA.The annotator now immediately sees that the scribe has not written the ending of the word, and can annotate the word for lemma and other factors that are visible, but not, in this instance, by its morphological case.Some words have been abbreviated only with a certain abbreviation mark.One of the most common is the sign  for ἔτος, "year".In this case the word is most often opened up in the genitive and marked within the parentheses in the Leiden System: (ἔτους).The markup is: The whole word in expanded form, ἔτους, is chosen for the standard layer and for the original layer it is substituted with the marker A. The annotator may be confident enough to lemmatize the word for ἔτος, but otherwise the morphological analysis should be left open.

Supplements and omissions: <supplied>, <surplus>
When there is a hole in the papyrus, it may be possible for the editor to make an educated assessment about what probably was written in the gap and restore it.Especially if the gap was short (only a few letters) or if the missing part is in a formulaic part of a text, the parallel documents help in restoring the text.When text is restored in the lacuna, it is written inside square brackets in the Leiden System, and in TEI EpiDoc XML it is marked with the tag <supplied> with the reason attribute "lost".The markup can go over word boundaries.For example: µ[ε]λίχρως = µ<supplied reason="lost">ε</supplied>λίχρως ὄντ[ος ἐ]ν = ὄντ<supplied reason="lost">ος ἐ</supplied>ν We choose the restorations for the standard layer without brackets, that is, we get µελίχρως and in the latter example two words: ὄντος ἐν.This way, the linguistic annotation tool correctly recognizes these words.For the original layer, however, the supplements are not taken in, since we cannot be sure if the editor has been right; the ancient writer could have written a nonstandard variant even in a short space.The supplement receives the dummy marker SU in the original layer: µSUλίχρως and, in the case of two words, both get their own marker: ὄντSU SUν.Especially when there are several words in a lacuna, it is important that each word (and punctuation mark) is counted in the same way in both layers in order to keep the tokenization the same.
Another type of supplement is when the editor of the papyrus thinks that the ancient writer has not, by mistake, written something we would expect.The editor can add what was omitted using angle brackets in the Leiden System; in EpiDoc XML it is rendered with the supplied tag with the reason attribute "omitted": ἀπ<ε>γραψάµην = ἀπ<supplied reason="omitted">ε</supplied>γραψάµην Again, we choose the supplement for the standard layer as the editor suggests: ἀπεγραψάµην.For the original layer the supplement is replaced with the dummy marker OM, i.e. ἀπOMγραψάµην.
The opposite case is <surplus>, which indicates text which the original writer wrote, but the editor considers superfluous.This surplus text is replaced with the marker SR in the standard layer but included as such in the original layer.

No supplements in lacuna: <gap>
When there is a lacuna in the papyrus in which the editor has not been able to suggest a supplement, this <gap> is replaced with the dummy element G both in the standard and in the original layers.The reason is that, also when annotating the standard layer, the annotator should see if the sentence is not whole.

Uncertain letters: <unclear>
The 'conscience' of a papyrologist, the underdot, signals that a letter is only partially preserved or so faded that the editor cannot be certain beyond a doubt which letter the ancient writer wrote.He makes an assumption based on the ink traces he sees, writes the letter he assumes has been written in the papyrus, but puts a dot under the letter in the edition.In EpiDoc XML those letters are marked with the tag <unclear>: In the standard layer it was an easy decision to include the uncertain letters in the same way as the supplemented letters.However, it was difficult to decide how to address the problem in the original layer, since we need the letter without markers interfering with the word recognition in the annotating environment.We decided to take the uncertain letters into the original layer in the same way as into the standard one.This may result in sometimes annotating a word which will later be read as another word.However, that may happen even in cases where the editor has not used underdots.Moreover, the annotator need not annotate the word at all if s/he does not trust the reading.The annotator has the possibility to change the text in the annotating framework, as mentioned previously in 2.1.

The apparatus: <app>
In the same way as above with <choice> (2.2.1), the apparatus criticus entries can include several options on what the editor or other scholars suggest for the readings.Tags are, e.g., <app type="alternative"> or <app type="editorial">.We have again decided to give the power of decision to the user; s/he can choose the best alternative to be included in the text which will be uploaded to the annotation tool.

Technical realisation
The "Sematia" tool was conceived as a new digital platform to separate layers of linguistic variants from an EpiDoc XML document, discussed in 1.1.1.Furthermore, it was designed to act as the database for storing and eventually querying treebanks associated with each layer, as well as for handling metadata concerning the texts (discussed in III).In order to reach these goals, Sematia is now being developed from the ground up as a user-oriented web application in Python and JavaScript, backed by a MySQL database.The main advantages to using Python as the back-end language (vs., e.g., PHP or Ruby) are its excellent text processing capabilities and the popular Natural Language Toolkit [nltk] library, which makes it easy to integrate the data analysis tools into the application in the future.Sematia, though still in an early development phase, is available for testing at http://sematia.hum.helsinki.fi.The complete source code can be found at https://github.com/ezhenrik/sematia/.

Data structure
As mentioned in the Introduction, Sematia's current development is tied to the research project "Act of the Scribe", which focuses on the scribal production of documentary papyri.Sematia's relational database was structured to the needs of this project, requiring separate tables not only for documents and layers, but for the different "acts of writing" (or "hands") as well.Most importantly, each of the three layers should be associated with a single hand, separated by <handShift> elements in the TEI, instead of the whole document.In order to make the system capable of handling multiple users' documents, Sematia also needed to have an elementary login system.To match these requirements, we created a simple hierarchical database model that looks roughly as follows (fields in parentheses): • User (name) • Document (user_id, XML, HTML, metadata) • Hand (document_id, no, metadata) • Layer (hand_id, type, treebankXML, settings) In this model, each Layer record is linked to a single Hand, which, in turn, is a child of a Document record that belongs to one of Sematia's Users.The Document table has fields for the source XML, the HTML-converted version (see 2.3.2 below) and metadata that pertain to the whole document.To avoid data duplication, no textual data is contained in the Hand table; its only purpose is to store a handful of metadata regarding each act of writing.

Importing documents
We wanted to minimize user effort when creating layers with Sematia by automating the workflow wherever possible.Thus, when the user imports a document (by entering the document URI), the system automatically creates the correct amount of Hand metadata records, as determined by the number of <handShift> elements in the XML, along with the original, standard and variation Layer records for each Hand.
Since the actual layering happens client-side with JavaScript (see 2.3.4),we also need to convert the XML tree recursively to an HTML string that can be displayed and manipulated in the browser window.The following template is used in the conversion: Instead of the maybe more typical XSLT method of performing the transformation, we decided to use Python's ElementTree API here.We considered this the more elegant solution as it takes up fewer lines of code and does the conversion dynamically without the need for a separate XSL file.

Document metadata
If the imported document is stored in one of PN's databases, Sematia will try to populate the Document level metadata fields automatically via PN's Apache Solr API available at http://papyri.info/solr/select/?q=id:[document id].At the time of writing this Sematia is configured to fetch date and provenance metadata from this public API.Each metadata field, in Document as well as Hand records, can also be edited through the web interface.

Creating the layers
The layering process is described in the following steps.The layer is created client-side in the browser using the jQuery JavaScript library from the HTML-version of the document that was created at the import stage.The web page where the layering happens is divided into three windows, the first of which is used for the transformation, the second for printing the output and the third for uploading and viewing treebanks.
1.The HTML version of the whole document gets loaded from the database into the transformation window.2. The loaded document is formatted regardless of layer type, e.g: <lb>-elements (linebreak) that have the attribute break="no" are removed in order to prevent unintended word breaks, and otherwise converted to a single space.Some CSS-styling is applied as a means to highlight different elements to guide the user.3. The elements outside the hand that this particular layer belongs to are tagged inactive with a data-attribute and hidden with CSS.This is conspicuously redundant as the whole document has already been loaded into the DOM; we have nevertheless stuck to this solution due to the fact that <handShift> elements may be found almost anywhere in the XML hierarchy, making it difficult to split the actual file without breaking the element tree.4. The layer is enabled following the rules discussed in 2.2, by marking each element with a data-attribute either for exclusion or with the replaced value.For example, if the layer type is original, each <ex> element gets a new data-layerValue attribute "A", and the children of the <choice> element are marked for exclusion, except for <orig>.The <supplied> element is a special case as it may consist of several words as well as punctuation marks.In the original layer, we want to replace each word and punctuation with "SU" or "OM" (depending on the "reason" attribute) to maintain the same word count in all layers.Likewise, we had to make sure that the tokenization would work the same way in both Sematia's layering tool and Arethusa's treebanking service.For these reasons, the regular expressions used to split up words in Sematia have to follow Arethusa's tokenization rules as far as possible.For example, Arethusa has been configured to deal with crasis (e.g.κἀγώ, "I too") by treating the merged words as separate.In Sematia, a similar mechanism is currently under development.5.In some cases, the editor of the papyrus has provided multiple readings of the same text part, contained in <choice> or <app> elements in the TEI.Only one of the readings should end up in the layer, making it necessary for the user to make the selection manually.This feature was implemented simply by adding a click event listener to the elements that may have multiple readings, allowing the user to choose the preferred one.When the user later returns to view or edit the same layer, these manual settings are loaded from the database.6. Lastly, text values are collected from the elements in the transformation window using the data-attributes and manual settings mentioned above, coupled with a few additional rules, and printed in the output window.The user may now copy this text to the clipboard and import it to Arethusa for annotation.The Treebank XML produced by Arethusa, in turn, may be imported to Sematia via the input form in the treebank window.

Metadata in existing databases
The metadata which concern the actual papyrus document can be found via the Papyrological Navigator from several different databases, e.g the Heidelberger Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens (HGV) has collected information on the date and provenance of the text, the original title and the subject matter (in German); similarly, the Trismegistos portal adds the metadata of people involved and places mentioned, to mention a few aspects.For the needs of the project "Act of the Scribe" we wish to add metadata that would help in the identification of the writers as well as the linguistic register.In addition to that, the date and provenance is extracted automatically for each document from the PN, as discussed in 2.3.2.

Metadata to be added
The new metadata always concern one act of writing; that is, all writers in one papyrus get their own metadata field.It is divided into four sections: Handwriting, Writer and author, Text type, and Addressee.

Handwriting
The printed editions of papyri quite often have some sort of description of the handwriting, at least for the main hand of the text.Moreover, later research may have identified the hand as the same as in some other text, or made some other observations on it.However, if the current user of Sematia has seen the original text or a photograph of it, s/he can add his/her own custom evaluation to the handwriting.We included four subfields for describing handwriting.
The first two, "Description in the edition" and "Custom description", are free text fields serving mainly the user as a reference.The third field is a drop-down list for the level of professionalism with four possibilities to choose from: Not known, Professional, Nonprofessional and Practised letterhand.The first is applicable when there is no description or a photograph or possibility to check the original.The last option is something between the professional and non-professional; a person who is accustomed to writing, but has obviously not received scribal training.The fourth subfield is reserved for entering a list of texts where the same handwriting is found.This list is stored as a JSON string in the database and may be used in the future for connecting the acts of writing by the same person in queries.

Writer and author
In our project, we are interested in distinguishing the linguistic acts of the actual writer (usually a scribe who has received more or less education) from those of the author of the text, who may have dictated the text or given written and/or oral instructions.Moreover, in official contracts there may be a scribal official 'responsible' for the text, e.g. a notary who may even sign the document with his own name, but is not the actual writer of the document, like the agoranomoi from Pathyris discussed by [Vierros 2012].For these reasons, we have three categories which can be filled in, if the information is available, but left blank, if not: "Actual writer", "Scribal official" and "Author".For each one, there are two fields to be filled in: Name and Title.Later, when the corpus has a sufficient amount of texts, this information can be used, for example, for connecting people with similar titles to the similar use of language, or even finding texts that have been written/authored by the same person.

Text type and Addressee
The genre of the text naturally has an influence on the language used.A private letter belongs to a different register than a notarial contract.The addressee has a similar impact.The text is more formal if written to a superior than if written to a peer or subordinate.Therefore, it is important to gather this metadata when possible.We have added a drop-down list for the text type trying to cover the basic text types found in the papyri but also limiting the list to quite general categories (e.g."contract" with certain subfields, "letter" with certain subfields, among others).For the addressee, we wanted a general description selected from a drop-down list: "official", "private" or "not known/applicable".The first two options get subfields with the subfields "superordinate", "peer" and "subordinate".In addition, there are fields for the addressee's name and title.

Variation layer
Research on linguistic variation, discussed above in 1.3, is the driving force on building the Sematia corpus.Quite a number of such phenomena can be queried by comparing the original and standard layers.For example, if we are interested in morphological case agreement, the standard layer includes the grammatically 'correct' versions and the original has the variant forms.A search comparing, e.g., the case coding included in the postag of each word, reveals when a word has been written in an unexpected case (and similar comparisons can be made for mood, person, tempus, etc.).The biggest missing block of linguistic information concerns phonology, since spelling is not taken into account in the existing Treebank templates.This issue is to some extent addressed in the new database of Text Irregularities within the [Trismegistos] platform compiled by [Depauw and Stolk 2015].Their data concerns the whole Duke Databank of Documentary Papyri and is collected phoneme by phoneme and based on the editorial corrections (i.e. the tags within the <choice> element, cf.2.2.1).The tool is not finalized nor public at the time of writing, but will be an important addition to the phonological studies of the papyrological material.However, it is not as accurate as we would wish for our purposes.Therefore, we have plans to include at least a phonological tagset in the variation layer in Sematia.The treebank XML of the original layer would be duplicated and a variation tag added for those words where variation exists.
The case studies of linguistic variation discussed in 1.3, could have, for example, the following annotation for variation implemented into the word element.For πουροῦ: <var type="pho" value="ου" not="υ" pre="π" fol="ρ" > For πέµψε: <var type="pho" value="ε" not="?" pre="ψ" fol="#"> The preceding (pre) and following (fol) letters of the variant are given because phonological research is normally interested in the immediate surroundings of the letter (sound).As regards ambiguous analyses at the morphological or syntactic level, variation types for morphology and syntax could also be used when needed.For example, πέµψε, as discussed in 1.3., could be marked with something like: <var type="syn" value="inf" or value="imp"> However, the tagset in all its depth is still under consideration.

Queries
Several tools for querying treebanked data already exist.For example, both Ancient Greek Treebank corpora can be queried with [SETS Treebank Search] or [PML Tree Query Engine] (see also [Universal Dependencies]).Moreover, the PROIEL corpus is available in INESS query interface.They employ somewhat different query languages, but all support detailed and complicated linguistic queries from the treebanked data.As the development of Sematia progresses, we will be exploring the possibility of using existing tools in querying our corpus, and the potential ways that existing querying methods could be integrated into Sematia itself.
The interface needs to include the possibility for comparative queries between the original and standard layers.Moreover, we need to connect the searches with our new metadata and the variation layer.

Conclusion
In this article, we have described a process in which individual texts from the corpus of documentary Greek papyri can be preprocessed for the purposes of linguistic annotation.The annotation follows the same framework as other corpora of Ancient Greek texts.For the first time we can automatically separate the original text written by the ancient writer from the editorial interpretation.The original layer can be studied in its own right as well as compared with the standardized version.We have not disregarded the results of the hard editorial work devoted to these texts in the previous centuries, as they form the parallel layer of the text.The layers enable the comparison of linguistic variants abundant in the papyri to the scholarly standard forms.The tool is currently optimized for retrieving the texts from the Papyrological Navigator, but there is no impediment to modify it to be used for other texts which are encoded in EpiDoc XML, such as many epigraphic corpora