An Ontology based Smart Management of Linguistic Knowledge

Natural language processing provides a very signiﬁcant contribution to various application areas such as multilingual big data, information retrieval, data integration and multilingual web. However, handling linguistic knowledge to develop such lingware applications is a crucial issue, especially for linguistic novice users. To deal with this issue, a ”smart” linguistic knowledge management may help the users to understand the meaning, scope and especially the use of related techniques and algorithms. In this paper, (1) we propose a semantic processing of linguistic knowledge based on a multilingual linguistic domain ontology, called LingOnto. Compared to related work, LingOnto does not only handles linguistic data, but also linguistic processing functionalities and linguistic processing features. Besides, it allows, via a reasoning engine, inferring new linguistic knowledge and assisting in the process of proposing lingware applications. This is particularly useful for novice users, but can also provide new perspectives for the expert ones. Lin-gOnto covers the French, English and Arabic languages. (2) We propose also an assisted user friendly ontology visualization tool called Ling-Graph. It facilitates the interaction with LingOnto. It oﬀers an easy to use interface for users not familiar with ontologies. It is based on

However, handling linguistic knowledge to propose such lingware applications is a crucial issue.To deal with this issue, a smart linguistic knowledge management may help the users to understand the meaning, scope and especially the use of linguistic knowledge.This is particularly useful for novice users, but can also provide new perspectives for the expert ones.
Various linguistic registries and glossaries have been proposed.Unfortunately, such efforts provide a poor and an imprecise semantic description which are not sufficient for most lingware applications [7].Besides, they do not support multilingualism.Ontologies are more useful as they provide more precise and semantically richer results [5].However, most of the proposed ontologies represent only the linguistic data (e.g., Lexical unit and Part Of Speech (POS)) and neglect the linguistic processing functionalities (e.g., segmentation and POS tagging) and the linguistic processing features (e.g., processing level and analysis type).Moreover, they do not offer a reasoning engine that assists the users in understanding the linguistic knowledge and proposing lingware applications.Besides, they are hard to be used by users less or not familiar with ontologies as they do not offer an ontology visualization tool to facilitate the interaction with it.Finally, most of these ontologies do not support multilingualism.
In this paper, we present a "smart" management of linguistic knowledge.To this end, (1) we propose a multilingual ontology called LingOnto, that covers the different aspects of the NLP domain.It aims to make a wide range of linguistic data, linguistic processing functionalities and linguistic processing features easily accessible to the users.Moreover, LingOnto enables reasoning, via a SWRL based reasoning engine, about the aforementioned knowledge in order to guide the users to select valid NLP pipelines.For example, if the user is developing an annotation tool, he will be guided through each processing functionality choice, where only functionalities that are valid for the annotation task in the processing pipeline are made available for selection.LingOnto covers the French, English and Arabic languages.It is designed to be used by users, who are not necessarily ontology experts.(2) To overcome this issue, we propose a user friendly ontology visualization tool called LingGraph.It offers an understandable visualization of LingOnto to both ontology and nonontology expert users.LingGraph is based on a smart search functionality which relies on a SPARQL pattern-based approach.It extracts and visualizes the ontological view from LingOnto related to only components corresponding to the user's needs.
In order to evaluate LingOnto, we experiment it in the context of Lingware engineering.Particularly, it is applied to a framework of identifying valid NLP pipelines.
The current paper is organized as follows.Section 2 presents some related works.Section 3 presents the multilingual linguistic domain ontology Lin-gOnto.Section 4 presents the proposed user friendly ontology visualization tool LingGraph.The evaluations of the performance of LingOnto will be presented in Section 5. Finally, Section 6 draws conclusions and future research directions.

Related Work
The present work is closely related to the following research areas: (1) linguistic knowledge representation and (2) ontology visualization.

Linguistic Knowledge Representation
Various approaches focusing on linguistic knowledge representation are proposed.We distinguish two main categories: (1) registries-based approaches and (2) ontologies-based approaches.

Registries-Based Approaches
The SIL glossary of linguistic terms [8] represents information based on glossaries and bibliographies proposed to support the linguistic research.This glossary supports only French and English linguistic terms.Moreover, it gives only the equivalent(s) of a linguistic term in the other language (i.e., it gives English glosses for French linguistic terms and French glosses for English linguistic terms).Furthermore, the relations between linguistic terms are unspecified or too general to derive the meaning of a linguistic concept within the NLP domain [9].
In [10], the authors propose WordNet, which is a large lexical database that consists of a set of synsets (i.e., sets of synonyms).These latter are related with semantic relations like synonym, hyponymy and meronymy.However, these latter are not used in a consistent way as they present redundancy [10].Moreover, WordNet provides a poor classification of the types of numbers (i.e., Real, Rational and Natural, and Integer numbers are all subsumed by Number, while they subsume each other.)[11].
The ISOcat data category registry [12] defines only linguistic data at several levels, such as syntactic, morphosyntactic, terminological and lexical.However, navigating through it is a tedious task since it provides a wide range of different "views" and "groups" that specifies linguistic data in a specific language data model.In this regard, the ISOcat data category registry has no underlying data model that represents linguistic data in an interrelating holistic structure.
In attempts to define linguistic terms in a stricter manner, the CLARIN concept registry [13] takes over the work of the ISOcat data category registry.However, this latter still provides very limited structural and relational information [13].
We note that in all the above-mentioned linguistic registries, the structure of the data models representing the linguistic data entries in alphabetical order (e.g., the SIL glossary) or according to linguistic views (e.g., the ISOcat) is not sufficient for ensuring comprehensive knowledge about a linguistic data in the NLP domain.Moreover,they focus only on representing the linguistic data aspect and neglect the processing one.Finally, they define a flat semantic structure providing very unspecific relations between concepts such as "is a" or "has kinds" [9].

Ontologies-Based Approaches
In [14], the authors propose the Lemon ontology, which represent the lexical data on the Semantic Web.It emerges from a combination, review and extension of prior models such as LingInfo [15], LexOnto [16] and SKOS [17].Its successor, OntoLex, is the result of opening lemon to the community under the umbrella of the W3C Ontology-Lexica community group, in order to extend and formally modularise it.The OntoLex develops specifications for a lexiconontology model that can be used to provide rich linguistic grounding for domain ontologies.Rich linguistic grounding include the representation of morphological, syntactic properties of lexical entries as well as the syntax-semantics interface (i.e., the meaning of these lexical entries with respect to the ontology in question).However, both lemon and OntoLex focus only on representing linguistic data aspect and neglect the processing one.Moreover, they do o not propose a reasoning mechanism.
In [18], the authors propose the General Ontology for Linguistic Description (GOLD).It provides a taxonomy of nearly 600 concepts, 76 object properties and 7 data properties.However, most of the object properties interrelate only two concepts; which leaves the majority of the concepts unrelated.Moreover, this ontology does not aim to capture the semantics of terms.It mainly classifies morphological notations, such as expressions, grammar, and meta-concepts [11].The development of this ontology was stopped in 2010.
In [19], the authors propose the Ontologies of Linguistic Annotation (OLiA), which is based on the ISOcat data category registry and the GOLD ontology.It takes a focus only on modeling annotation schemes and their linking with reference categories.Conceptually, the OLiA ontology is closely related to the OntoTag ontologies1 ontologies proposed by [20].One important difference is that the OntoTag ontologies are considering only the languages of the Iberian peninsula (in particular Spanish).
In [21], the authors propose the Oxford Global Languages Ontology (OGL), which are developed to model and integrate only multilingual linguistic data from Oxford Dictionaries [22].It includes elements to account for a range of information found in dictionaries, from inflected forms to semantic relations, pragmatic features and etymological data.This ontology allows linkage with lemon content and ontologies of linguistic description.The emphasis of the approach is set on representing grammatical information with cross-linguistic validity and on maintaining grammar traditions in different languages as key points.However, OGL ontology do not represent linguistic processing functionalities and features.
In [23], the authors propose LexInfo, which is an extensive ontology of types, values and properties derived partially from ISOcat.Currently, the elements of this ontology capture information from the morphosyntactic, syntactic, syntactic-semantic, semantic and pragmatic levels of linguistic description.However, LexInfo covers only the linguistic data and do not offers a reasoning mechanism.
In [11], the author proposes a linguistic ontology for the Arabic language, which is a formal representation of the concepts that the Arabic terms convey.This ontology is considered as an "Arabic WordNet" as it uses the same structure.It consists currently of about 1,000 well investigated concepts in addition to 11,000 concepts that are partially validated.However, this ontology does not support multilingualism as it considers only the Arabic language.
We note that all the above-mentioned ontologies focus only on representing linguistic data aspect and neglect the processing one.Furthermore, they do not propose a reasoning mechanism.Besides, they are hard to be used by users less or not familiar with ontologies as they do not offer an ontology visualization tool to facilitate the interaction with it.Finally, most of these ontologies do not support multilingualism.

Ontology Visualization
In the literature, various ontology visualization tools are proposed.However, most of them are designed to be used only by ontology experts and they overlook the importance of the usability and understandability requirements.According to [24], the generated visualizations "are hard to read for casual users".For instance, GrOWL and SOVA 2 are intended to offer an understandable visualization by defining notations using different symbols, colors, and node shapes for each ontology key-element.However, the proposed notations contain many abbreviations and symbols from the Description Logic.As a consequence, the generated visualizations are not suitable for non-ontology expert users.OWLViz3 , OntoTrack [25], KC-Viz and OntoViz show only specific element(s) of the ontology.For instance, the OWLViz and KC-Viz visualize only the class hierarchy of the ontology and OntoViz shows only inheritance relationships between the graph nodes.This is different with TGViz Tab [26] and NavigOWL [27] which provide visualizations representing all the key elements of the ontology.However, these tools do not make a clear visual distinction between the different ontology key-elements.For instance, they use a plain node-link diagram where all the links and nodes look the same except for their color.This issue has a bad impact on the understandability of the generated visualization.
Only very few visualization tools are designed to be used by non-ontology experts such as OWLeasyViz [28], Protégé VOWL [24] and WebVOWL [24].However, these tools are either not available for downloading or using some Semantic Web words which has a bad impact on the understandability of the generated visualization especially for the non ontology expert users.
Most of these tools offer a basic keyword-based search interaction technique.It is based on a simple matching between ontology's elements and the keyword that the users are looking for.However, they do not offer advanced search by extracting a combination of components taking into account the user's need.

LingOnto: A Multilingual Linguistic Domain Ontology
In this section, we present our ontology-based smart management of linguistic Knowledge called LingOnto.It is freely available online 4 .The current version of LingOnto covers the English, French and Arabic languages.Compared to related work, it does not only handle linguistic data, but also linguistic processing functionalities and linguistic processing features.Besides, it allows via a reasoning engine, inferring new linguistic knowledge and assisting in the process of proposing lingware applications (e.g., it helps the users to avoid incoherency errors by assisting them selecting only compatible linguistic processing functionalities.).

Representing Linguistic Knowledge
We are based on the design principles defined by [29], which are objective criteria for proposing and evaluating ontology designs, such as clarity, coherence, minimal encoding bias and minimal ontological commitments.Following these principles, we define the top-level concepts of our ontology which are linguistic data, linguistic processing functionalities and linguistic processing features.These latter will be more discussed in the following sections.

Linguistic Data Classification
Referring to the ISOcat standard, we identify a set of linguistic data concepts.We choose this registry for the following reasons: • It covers more terms of linguistic data categories compared to other resources.For instance, it holds 115 possible values of "PartOfSpeech" such as (Adjectif ), (Verb), (Noun) and (Adverb) while; the Gold ontology has only 81 values.• It defines linguistic data categories at several levels such as syntactic (e.g., noun phrase, verb phrase and prepositional phrase), morphosyntactic (e.g., number and gender), terminological (e.g., processes, properties and functions) and lexical categories (e.g., Nouns, verbs, adjectives and adverbs).• It supports various languages.For instance, it provides a description of usage in language-specific contexts, including definitions, usage notes and/or lists of values.
For each extracted linguistic data concept, we identify the concepts which are related to it as well as the names of the associated relations.Fig. 1 shows an excerpt of LingOnto, illustrating the classification of some Arabic linguistic data.Indeed, in contrast to the English sentences which are fundamentally in the (subject-verb) order, the Arabic ones can be nominal (subject-verb), or verbal (verb-subject) with a free order.Thus, we define an "is a" object property relating the ("Phrase") class and ("Noun Phrase") and ("Verbal Phrase") classes.Furthermore, in French and English languages, the affix is classified into prefixes, suffixes, infixes, circumfixes, and superfixes.However, in the Arabic language, the affix is classified only into prefixes, suffixes and infixes.Consequently, we define an "is a" object property between the ("Prefix") , ("Suffix") and ("Infix") classes and ("Affix") class.Moreover, Arabic differs phonetically, morphologically, syntactically and semantically from English and French languages.For instance, Arabic has a rich and complex inflectional morphology involving: gender, number, person, aspect, mood, case, state and voice, cliticization of a number of pronouns and particles (e.g., conjunctions, prepositions and definite article).Syntactically, the Arabic sentences are too long with a complex syntax compared to the English and French languages (e.g., a single verbal sentence can consist of more than 50 lexical units).

Linguistic Processing Functionalities Classification
Referring to well-known NLP toolkits such as Apache OpenNLP [30], Stand-fordCoreNlP [31], FreeLing [32] and LingPipe [33] and two language processing platforms which are Language Grid [34] and Gate [35], we identify a set of linguistic processors such as POS Tagger, Lemmatizer, Morphological Analyzer and Chunker.Some of these linguistic processors implements often one or two linguistic processing functionalities.For instance, a Morphological Analyzer processor for French and English languages usually implement Paragraph splitting, Sentence splitting, Tokenization, POS tagging and Lemmatization Fig. 1 The classification of some Arabic linguistic data processing functionalities.Nerveless, a Morphological Analyzer processor for Arabic language, especially for analysing undiacritized texts, implements Paragraph splitting, Sentence splitting, Tokenization, Diacritization, POS tagging and Lemmatization processing functionalities.Therefore, the automatic diacritization is an essential processing functionality for many Arabic lingware applications.Moreover, Arabic sentence components can be swapped without affecting the structure or meaning.For this reason, it leads to a more syntactic and semantic ambiguity in contrast to the English and French languages.
According to [36], an hierarchical inter-dependencies between the linguistic processing functionalities exists.Indeed, a linguistic processing functionality used to perform a given analysis at one level may require, as input, the results of others analysis related to a lower level.For instance, to annotate a French text, this latter must be tokenized, the sentences should be clearly separated from each other and their morphological properties have to be analyzed before starting the parsing functionality.Consequently, we identify the object property "Requires".As shown in Fig. 2, the ("Tokenization") class is in relation with the ("Sentence Splitting") class through the object property "Requires".Moreover, each linguistic processing functionality uses various linguistic data as inputs and others as outputs.Hence, we propose the objects properties "Has Input" and "Has Output".For instance, as shown in Fig. 2, the ("Tokenization") class is in relation with the ("Sentence") class through "Has Input" object property.It is also in relation with the ("Lexical unit") class through "Has Output" object property.

Linguistic Processing Features Classification
The linguistic processing functionalities are characterized by several linguistic features.LingOnto models these features to ease the process of proposing lingware applications as they identify the incoherence between linguistic processing functionalities.We present in Table 1 some examples of the linguistic processing features.The English, French and Arabic languages are based on the same linguistic processing features.Indeed, according to [37], a comparative study of English, French and Arabic sentences shows that it is possible, from the linguistic viewpoint, to adopt the same typology of ellipses (i.e., Gapping, Right-node Raising, Coordination Reduction) for the Arabic language as the one proposed for the English and French languages.
Fig. 3 shows the proposed classification of the linguistic processing features.Each processing level is characterized by its related phenomena.Hence, we define the object property "has Phenomenon" between ("Processing Level") and ("Phenomena") classes.Moreover, each phenomenon has its subphenomena.For example, the ellipsis phenomenon can be a nominal ellipsis or an ellipsis of the whole sentences.For this reason, we define the "refined into" reflexive object property.The linguistic phenomenon has also the relations "supported By"and "treated By", respectively, with the ("Formalism") and ("Approach") classes.Each formalism has an analysis type to solve any linguistic phenomenon.For example, the sentence "Jean dropped the plate.It shattered loudly."illustrates the Anaphora phenomenon.In this sentence, the pronoun "it" is an anaphor and it points to the left to ward its antecedent "the plate".Finally, each processing level uses a linguistic resource related to a phenomenon.Hence, we define the object property "has Resource" relating the ("Processing Level") and ("Linguistic Resource") classes.
Fig. 3 The classification of some linguistic processing features

Reasoning about Linguistic Knowledge
LingOnto proposes a set of SWRL rules to reason about linguistic knowledge, infer new data and assist the users in understanding the NLP domain.Compared to axioms, SWRL rules offer a predefined list of built ins that facilitate the expression of rules such as: • For comparison: swrlb :notEqual, swrlb :lessThan and swrlb :greaterThan.
We categorize the proposed SWRL rules into two categories: (1) SWRL rules for lingware applications development assistant and (2) SWRL rules for NLP domain understanding assistant.

SWRL Rules for Lingware Applications Development Assistant
LingOnto proposes a set of SWRL rules that assist the users in selecting compatible linguistic processing functionalities in order to identify valid NLP pipelines.Fig. 4 shows some examples.
• Rule R1 identifies if a processing functionality "x" requires a processing functionality "y" and a processing functionality "z" requires a processing functionality "x"; then a "requires" relation between the processing functionalities "z" and "y" is inferred.This rule means that processing functionalities can be enchained in an NLP pipeline only if each one requires the other.• Rule R2 identifies if a processing functionality "x" has as input a linguistic data "i" and a processing functionality "y" has as output a linguistic data "i"; then a "requires" relation between the processing functionalities "x" and "y" is inferred.This rule means that if a linguistic processing functionality require, as input, the results of other processing functionality; then, these latter can be enchained in an NLP pipeline.• Rule R3 identifies if a processing functionality "x" requires a processing functionality "y" and the processing functionality "x" uses a linguistic resource "j" and the processing functionalities "x" and "y" belong to the same linguistic processing level; then a "use" relation between the processing functionality "y" and the linguistic resource "j" is inferred.This rule means that two enchained processing functionalities that belong to the same linguistic level should use the same linguistic resource.

SWRL Rules for NLP Domain Understanding Assistant
LingOnto proposes a set of SWRL rules to assist the users in understanding the meaning of different linguistic knowledge.Fig. 5 shows some examples.
• Rule R4 identifies if a phrase "x" has a main part a verb "y"; then the phrase "x" is a verb phrase.This rule means that if a phrase contains both the verb and either a direct or indirect object then it represents a verb phrase.• Rule R5 identifies if an affix "y" surrounds a stem "y"; then the stem "y" is a circumfix.This rule means that if an affix has two parts were one placed at the start of a word, and the other at the end then this affix represents a circumfix.• Rule R6 identifies if a lexical unit "x" has a gender neuter; then the lexical unit "x" is in English.Since we work only with three languages which are French, English and Arabic, if the gender of a word is neuter then this latter can be only written in English.

LingGraph: Ontology Visualization Tool of LingOnto
The LingOnto domain ontology is designed to be used by users, who are not necessarily ontology experts.Visualizations are usually proposed to help in this regard by assisting in the sense-making.Moreover, the large amount of linguistic knowledge covered by LingOnto makes the visualization hard to comprehend due to the visual clutter and information overload.To overcome this issue, we propose a user friendly ontology visualization tool called LingGraph.It is freely available online 6 .The main aim of this tool is to offer an understandable visualization to both ontology and non-ontology expert users.To support the large amount of linguistic knowledge covered by LingOnto, Ling-Graph is based on a smart search functionality which relies on a SPARQL pattern-based approach.It extracts and visualizes an ontological view from LingOnto related to only components corresponding to the user's needs.Moreover, it offers an easy-to-understand wording.For instance, it does not use a semantic web vocabulary.LingGraph is mainly designed to be integrated into a linguistic framework.It can be integrated into other applications for nonontology experts and it can be used as a standalone application by ontology experts.

Graph-based Visualization
LingGraph visualizes the ontology, formalized in OWL2 as a graph.It is based on a force field algorithm.This latter has two main advantages.(1) It ensures an optimal use of the screen.It displays the nodes in a way that those that are closely connected are shown in the center of the visualization, while the ones that are less connected are placed at the edges.(2) It improves the readability of the graph, by avoiding crossing links and displaying all the key elements of the ontology.Moreover, it allows representing the object properties between the concerned nodes by using labeled links.In order to be differentiated from the instances, the classes are displayed in a larger size.

Smart Search Interaction Functionality
The smart search interaction functionality is based on a SPARQL patternbased approach.The aim is to extract and visualize an excerpt ontological view, from LingOnto, which contains only components corresponding to users need's.This latter is materialized by a set of predefined search criteria C = (C 1 ,...,C n ) such as "Abstraction Level", "Processing Level" and "Language".
We ask some users (expert and novice users) to fill a pre-questionnaire about what they need to know as linguistic knowledge.We notice that their needs are very regular as all of them search the abstraction level (e.g., linguistic data and/or processing functionalities and features) of a given processing level(s) or/and a given language(s).This observation leads us to propose an approach based on a set of SPARQL patterns P = (P 1 ,...,P k ).

Pattern Definition
A pattern P is a couple (G, Q) such as: • G is a connected RDF graph, which describes the general structure of the pattern and represents a family of queries; • Q represents the qualifying elements that characterize the pattern and will be taken into account during the mapping of the user query and the considered pattern.A qualifying element can either be a vertex (representing a class or a datatype) or an edge (representing an object property or a datatype property) of G.
Fig. 6 displays a pattern covering the need: [C 1 = "Abstraction Level", CP 1/1 = "Processing Functionalities"], [C 2 = "Processing Level", CP 1/2 = "Lexical Level", CP 2/2 = "Morphological Level"], [C 3 = "Language", CP 1/3 = "Arabic"].In this pattern, the vertexes C 1 and C 2 and the arc r 1 are called qualifying elements.Each vertex defines a selected criterion C i (i.e., vertex C 1 defines the selected criterion "Abstraction level" and the vertex C 2 defines the selected criterion "Processing Level".Each vertex must be replaced by a resource, in order to turn the pattern into a query.This means that, to have the query graph corresponding to the users need, each vertex must be substituted by the selected preferences of the concerned selected search criterion.Each preference CP j/i (j ∈ [1, n]) has a corresponding concept in LingOnto having the same name.This process is called an instantiation.

Pattern Instantiation
In this section, we explain the instantiation of a qualifying element of a pattern.In other words, we will see how the query graph is transformed when one of its qualifying elements is brought closer to an element of the user's need.
For all q qualifying elements of p(G,Q) and α extracted from the user request (which can be either a class, an instance, or a property), we denote by I (p,q,α) = (G 0 ,Q 0 ) the pattern obtained after the instantiation of q by the resource α in the pattern p.This instantiation is only possible if q and α are compatible : • q is a class and α an instance of q.Then the instantiation of the qualifying concept consists in replacing the URI of the class by the URI of the instance.• q is a datatype and α a value corresponding to the type q.Then the instantiation of the qualifying concept consists in replacing the URI of the class by the value α. • q is a property and α the same property or one of its sub-properties.Then the instantiation of the qualifying edge consists in replacing the URI of the edge by the URI of the property α.
The instantiation of the pattern shown in Fig. 6 leads, after substitution of each qualifying element by the selected preferences, to the query graph shown in Fig. 7.

Generation of the SPARQL Query
A question mark in front of an element means that this element is one of the objects of the query.Therefore, we find the qualifying vertices associated with these query elements in the SELECT clause of our SPARQL query.
For each query element preceded by a question mark : if the qualifying vertex in question refers to a class or a data type, it has already been replaced by a variable in the previous step, so we add this same variable in the SELECT clause.Otherwise (the qualifying vertex refers to a relation) it is a request for specialization or generalization of a relation.In this case, the qualifying vertex is replaced in the query graph by a variable, explicitly declared as a sub-property or super-property of the relation referenced by two triplets made alternative in SPARQL with UNION, this variable is also added in the SELECT clause.
We have thus identified all the elements of the graph on which the query is based and obtained the definitive query graph which will form the content of the WHERE clause of our query.Fig. 8 shows the generated SPARQL query corresponding to the query graph in Fig. 7.
Fig. 8 The generated SPARQL query associated to the request graph shown in Fig. 7 5 Experimentation We apply the proposed ontology LingOnto to a linguistic framework of identifying valid linguistic NLP pipelines.To ensure an understandable visualization of LingOnto, we integrate to this framework our ontology visualization tool LingGraph.Then, we evaluate the efficiency of our ontology in identifying valid NLP pipelines.

Application to an NLP Pipelines Identification Framework
Lingware applications are defined as a sequence of many individual components to solve real-world problems [40].However, the combination of multiple components in a particular order into a processing pipeline is a tedious task which can be a barrier for domain experts and especially for novice ones.The LingOnto is applied to a framework of identifying valid NLP pipelines.It targets novice users in the lingware engineering area.
As shown in Fig. 9, the user starts by selecting the preferences "Lexical level" and "Morphological level" as a Processing Level, "Arabic" as a Language and "Linguistic processing" as an Abstraction level.Consequently, based on the smart search interaction functionality, an excerpt ontological view corresponding to the expressed need is generated.
Then, the user starts the process of identifying an NLP pipeline related to the target lingware application.Consequently, the framework offers, under "Next choices", a set of possible processing functionalities which can be added after each selected functionality.This list is generated based on the predefined SWRL rules.For instance, Fig. 10 shows that after a "Pos-tagging" functionality, only "NER", "Dependency-parsing" or "Tokenization" functionalities may be added.These latter can be added to the pipeline by double-clicking on them.
If the user selects a processing functionality out of the list under "Next Choices", the framework displays an error message "Incompatible Functionalities" and indicates using the red color an alternative valid pipeline.As shown in Fig. 11, the ("Diacritization") functionality can be added to the pipeline only after ("Pos tagging") and ("NER") functionalities.The final NLP pipeline is shown in Fig. 12.

Evaluation
In this section, we evaluate the performance of LingOnto in identifying valid NLP pipelines associated to lingware applications.This evaluation consists of three steps: • Step 1: we propose 63 lingware applications, which have to be solved by identifying their corresponding NLP pipelines using LingOnto.We classify these applications into (1) low level and (2) high level applications.Then, we classify applications in each group according to the language (i.e., French, Englich and Arabic).Table 2 shows some examples.• Step 2: we recruit three linguistic experts.The first one is a member of the Arabic Natural Language Processing Research Group (ANLP-RG) of MIR-ACL laboratory (Tunisia, Sfax).The second is a member of the CEDRIC laboratory (France, Paris).The last expert is a member of the Formal linguistics laboratory (France, Paris).We ask each expert to provide, manually, all the possible pipeline(s) which may solve each lingware application related to their native language (i.e., French, English and Arabic).• Step 3: we identify, using the linguistic framework, all the possible NLP pipeline(s) corresponding to each lingware application identified in Step1.
Then, the experts provide their feedback according to each generated pipeline ("Valid or Not valid" pipeline).The experts may also provide a textual explanation.
We use the precision and recall metrics [41] to evaluate the performance of LingOnto.The recall measures the proportion of valid NLP pipelines which are identified using the linguistic framework among identified pipelines by the domain expert.The precision measures the proportion of valid pipelines identified using the linguistic framework within the total number of the identified pipelines.We evaluate the performance of the linguistic framework in identifying valid pipelines associated to the low and high level proposed applications as shown in Fig. 13 and Fig. 14.
The precision and recall metrics indicate that LingOnto is efficient in identifying valid NLP pipelines for high and low processing levels.Indeed, as shown in Fig. 13, the overall means of the precision associated to the English and French languages (86.3% and 92.3%) are almost the same.This similarity is explained by the fact that these languages have a lexical similarity (similarity in both form and meaning).Indeed, they have the same alphabet.They Fig. 13 Recall and Precision performances for low-level lingware applications sometimes use similar grammatical structures and have several lexical units in common.However, the overall means of the precision associated to these languages (86.3% and 92.3%) are better than the overall mean of the precision associated to the Arabic language (78%).This gap is explained by the fact that the Arabic language differs morphologically, syntactically and semantically from the English and French languages.For instance, syntactically, Arabic sentences are long with complex syntax and its components can be swapped without affecting the structure or meaning.These issues lead to a syntactic and semantic ambiguity.Besides, the NLP toolkits and frameworks used to propose the LingOnto are more mature for English and French Languages than Arabic language.Furthermore, this gap affects the performance of LingOnto in identifying valid pipelines for high-level Arabic applications as shown in Fig. 14.This is explained by the fact that the high-level applications depend on the low-level ones.For instance, syntactic analysis, like parsing, usually requires lexical units to be clearly delineated and part-of-speech tagging or morphological analysis to be performed first.This means, in practice, that texts must be tokenized, their sentences clearly separated from each other, and their morphological properties analyzed before the parsing process.

Conclusion and Future Work
This paper addresses the issue of assisting the users in understanding the different aspects of the linguistic domain and easing the process of proposing lingware applications.We propose an ontology-based smart management of linguistic knowledge.Compared to available works, this ontology allows representing linguistic data, linguistic processing functionalities and linguistic processing features.Furthermore, it allows reasoning, via a SWRL based reasoning engine, about the aforementioned knowledge.Currently, three languages are supported: English, French and Arabic.LingOnto is designed to be used mainly by linguistic users; who are usually not familiar with ontologies.To attempt this issue, we propose the LingGraph user friendly ontology visualization tool.It is designed to be used by both ontology and non-ontology expert users.To support an understandable visualization, LingGraph is based on a "smart" search functionality that relies on a SPARQL pattern-based approach.This latter extracts and visualizes an excerpt ontological view from LingOnto containing only the components corresponding to the user's needs.Finally, we evaluate the performance of LingOnto in identifying valid NLP pipelines for 63 proposed lingware applications.The results show that the proposed ontology is efficient in identifying valid NLP pipelines.
For future research, we plan to propose an ontology maintenance collaborative tool.It is based on a JAVA API based on Jena library.It allows linguistic experts, independently of their native language, to manipulate LingOnto (i.e., adding and updating concepts).However, this tool should handle various types of imperfections such as the imperfection of the proposed knowledge and the conflict between the expert's opinions.Moreover, in case of disagreement between the expert's opinions, we plan to use the Cohen's Kappa coefficient to measure of inter-rater reliability or inter-rater agreement.Besides, we suggest exploiting the NLP domain expert's feedback to improve the Not Valid identified NLP pipelines.In addition, we plan to execute the valid pipelines by discovering concrete linguistic web services that match each required linguistic processing functionality in the pipeline.Finally, we plan to allow the LingOnto ontology to be referenced by the Linked Open Vocabularies (LOV) platform.

Fig. 2
Fig.2The classification of some Arabic linguistic processing functionalities

Fig. 4
Fig. 4 Example of SWRL rules for lingware applications development assistant

Fig. 5
Fig. 5 Example of SWRL rules for NLP domain understanding assistant.

Fig. 6
Fig. 6 An Example of a pattern

Fig. 7
Fig.7Query graph resulting from the instantiation of the pattern of Fig.6

Fig. 14
Fig. 14 Recall and Precision performance for high-level lingware applications

Table 1
Examples of Linguistic Processing Features.
Analysis TypeStructural, Thematic, Syntagmatic,Top-down Bottom-Up, Profound, and Surfacing or Chunking Approach Linguistic, Statistic, and Hybrid Formalism Unification Grammar and Resolution Algorithm Resource WordNet-LMF and GermaNet Language English, Arabic, and French Treatment Type Analysis, Generation, and Hybrid

Table 2
Examples of proposed lingware applications