2019

1. Leveraging Concepts in Open Access Publications

Andrea Bertino ; Luca Foppiano ; Laurent Romary ; Pierre Mounier.

This paper addresses the integration of a Named Entity Recognition and Disambiguation (NERD) service within a group of open access (OA) publishing digital platforms and considers its potential impact on both research and scholarly publishing. The software powering this service, called entity-fishing, was initially developed by Inria in the context of the EU FP7 project CENDARI and provides automatic entity recognition and disambiguation using the Wikipedia and Wikidata data sets. The application is distributed with an open-source licence, and it has been deployed as a web service in DARIAH's infrastructure hosted by the French HumaNum. In the paper, we focus on the specific issues related to its integration on five OA platforms specialized in the publication of scholarly monographs in the social sciences and humanities (SSH), as part of the work carried out within the EU H2020 project HIRMEOS (High Integration of Research Monographs in the European Open Science infrastructure). In the first section, we give a brief overview of the current status and evolution of OA publications, considering specifically the challenges that OA monographs are encountering. In the second part, we show how the HIRMEOS project aims to face these challenges by optimizing five OA digital platforms for the publication of monographs from the SSH and ensuring their interoperability. In sections three and four we give a comprehensive description of the entity-fishing service, focusing on its […]

2. A Hackathon for Classical Tibetan

Orna Almogi ; Lena Dankin ; Nachum Dershowitz ; Lior Wolf.

We describe the course of a hackathon dedicated to the development of linguistic tools for Tibetan Buddhist studies. Over a period of five days, a group of seventeen scholars, scientists, and students developed and compared algorithms for intertextual alignment and text classification, along with some basic language tools, including a stemmer and word segmenter.

Rubrique : Vers un écosystème numérique : NLP. Infrastructure de corpus. Méthodes de récupération des textes et de calcul des similarités de textes

3. Computer Analysis of Architecture Using Automatic Image Understanding

Fan Wei ; Yuan Li ; Lior Shamir.

In the past few years, computer vision and pattern recognition systems have been becoming increasingly more powerful, expanding the range of automatic tasks enabled by machine vision. Here we show that computer analysis of building images can perform quantitative analysis of architecture, and quantify similarities between city architectural styles in a quantitative fashion. Images of buildings from 18 cities and three countries were acquired using Google StreetView, and were used to train a machine vision system to automatically identify the location of the imaged building based on the image visual content. Experimental results show that the automatic computer analysis can automatically identify the geographical location of the StreetView image. More importantly, the algorithm was able to group the cities and countries and provide a phylogeny of the similarities between architectural styles as captured by StreetView images. These results demonstrate that computer vision and pattern recognition algorithms can perform the complex cognitive task of analyzing images of buildings, and can be used to measure and quantify visual similarities and differences between different styles of architectures. This experiment provides a new paradigm for studying architecture, based on a quantitative approach that can enhance the traditional manual observation and analysis. The source code used for the analysis is open and publicly available.

4. Transcrire l'écriture de Foucault avec Transkribus

Marie-Laure Massot ; Arianna Sforzini ; Vincent Ventresque.

The Foucault Fiches de Lecture (FFL) project aims both to explore and to make available online a large set of Michel Foucault’s reading notes (organized citations, references and comments) held at the BnF since 2013. Therefore, the team is digitizing, describing and enriching the reading notes that the philosopher gathered while preparing his books and lectures, thus providing a new corpus that will allow a new approach to his work. In order to release the manuscripts online, and to collectively produce the data, the team is also developing a collaborative platform, based on RDF technologies, and designed to link together archival content and bibliographic data. This project is financed by the ANR (2017-2020) and coordinated by Michel Senellart, professor of philosophy at the ENS Lyon. It benefits from the partnerships of the ENS/PSL and the BnF. In addition, a collaboration with the European READ/Transkribus project has been started so as to produce automatic transcription of the reading notes.

Rubrique : Bibliothèques numériques et expositions virtuelles

5. Mapping the Bentham Corpus: Concept-based Navigation

Pablo Ruiz ; Thierry Poibeau.

British philosopher and reformer Jeremy Bentham (1748-1832) left over 60,000 folios of unpublished manuscripts. The Bentham Project, at University College London, is creating a TEI version of the manuscripts, via crowdsourced transcription verified by experts. We present here an interface to navigate these largely unedited manuscripts, and the language technologies the corpus was enriched with to facilitate navigation, i.e Entity Linking against the DBpedia knowledge base and keyphrase extraction. The challenges of tagging a historical domain-specific corpus with a contemporary knowledge base are discussed. The concepts extracted were used to create interactive co-occurrence networks, that serve as a map for the corpus and help navigate it, along with a search index. These corpus representations were integrated in a user interface. The interface was evaluated by domain experts with satisfactory results , e.g. they found the distributional semantics methods exploited here applicable in order to assist in retrieving related passages for scholarly editing of the corpus.

Rubrique : Déluge de données : quelles compétences pour quelles données ?

6. Raisonnement causale et relations symboliques dans les Enluminures médiévales

Djibril Diarra ; Martine Clouzot ; Christophe Nicolle.

This work applies knowledge engineering’s techniques to medieval illuminations. Inside it, an illumination is considered as a knowledge graph which was used by some elites in the Middle Ages to represent themselves as a social group and exhibit the events in their lives, and their cultural values. That graph is based on combinations of symbolic elements linked each to others with semantic relations. Those combinations were used to encode visual metaphors and influential messages whose interpretations are sometimes tricky for not experts. Our work aims to describe the meaning of those elements through logical modelling using ontologies. To achieve that, we construct logical reasoning rules and simulate them using artificial intelligence mechanisms. The goal is to facilitate the interpretation of illuminations and provide, in a future evolution of current social media, logical formalisation of new encoding and information transmission services.

7. The Artist Libraries Project in the Labex Les passés dans le présent

Félicie Faizand de Maupeou ; Ségolène Le Men.

The creation of the Artist Libraries Project was sparked by the observation that artist libraries are still not well known, yet many art historians are interested in this archive for the value it adds to understanding the person behind the artist and his or her creative process. The problem is that these libraries are rarely physically preserved. To remedy this dispersion, we built an online database and a website www.lesbibliothequesdartistes.org that house this valuable source in the form of lists of books and their electronic versions. First data on Monet's library have been made available, and several additional artist libraries from the 19 th and 20 th centuries are on the way for 2019. By gathering all these bibliographical data in a central database, it's possible to explore one library and to compare several. This article explains how we built the database and the website and how the implementation of those IT tools has raised questions about the use of this resource as an archive on the one hand, as well as its value for art history on the other.

Rubrique : Bibliothèques numériques et expositions virtuelles

8. Individual vs. Collaborative Methods of Crowdsourced Transcription

Samantha Blickhan ; Coleman Krawczyk ; Daniel Hanson ; Amy Boyer ; Andrea Simenstad ; Victoria van Hyning.

While online crowdsourced text transcription projects have proliferated in the last decade, there is a need within the broader field to understand differences in project outcomes as they relate to task design, as well as to experiment with different models of online crowdsourced transcription that have not yet been explored. The experiment discussed in this paper involves the evaluation of newly-built tools on the Zooniverse.org crowdsourcing platform, attempting to answer the research question: "Does the current Zooniverse methodology of multiple independent transcribers and aggregation of results render higher-quality outcomes than allowing volunteers to see previous transcriptions and/or markings by other users? How does each methodology impact the quality and depth of analysis and participation?" To answer these questions, the Zooniverse team ran an A/B experiment on the project Anti-Slavery Manuscripts at the Boston Public Library. This paper will share results of this study, and also describe the process of designing the experiment and the metrics used to evaluate each transcription method. These include the comparison of aggregate transcription results with ground truth data; evaluation of annotation methods; the time it took for volunteers to complete transcribing each dataset; and the level of engagement with other project elements such as posting on the message board or reading supporting documentation. Particular focus will be given to the (at times) […]