Jeux de données

Study on the Domain Adaption of Korean Speech Act using Daily Conversation Dataset and Petition Corpus

Youngsook Song ; Won Ik Cho.
In Korean, quantitative speech act studies have usually been conducted on single utterances with unspecified sources. In this study, we annotate sentences from the National Institute of Korean Language's Messenger Corpus and the National Petition Corpus, as well as example sentences from an academic paper on contemporary Korean vlogging, and check the discrepancy between human annotation and model prediction. In particular, for sentences with differences in locutionary and illocutionary forces, we analyze the causes of errors to see if stylistic features used in a particular domain affect the correct inference of speech act. Through this, we see the necessity to build and analyze a balanced corpus in various text domains, taking into account cases with different usage roles, e.g., messenger conversations belonging to private conversations and petition corpus/vlogging script that have an unspecified audience.

OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more)

Simon Gabay ; Thibault Clérice ; Christian Reul.
Machine learning begins with machine teaching: in the following paper, we present the data that we have prepared to kick-start the training of reliable OCR models for 17th century prints written in French. The construction of a representative corpus is a major challenge: we need to gather documents from different decades and of different genres to cover as many sizes, weights and styles as possible. Historical prints containing glyphs and typefaces that have now disappeared, transcription is a complex act, for which we present guidelines. Finally, we provide preliminary results based on these training data and experiments to improve them.