LREC 2020 Paper Dissemination (3/10)

LREC 2020 was not held in Marseille this year and only the Proceedings were published.

The ELRA Board and the LREC 2020 Programme Committee now feel that those papers should be disseminated again, in a thematic-oriented way, shedding light on specific “topics/sessions”.

Packages with several sessions will be disseminated every Tuesday for 10 weeks, from Nov 10, 2020 until the end of January 2021.

Each session displays papers’ title and authors, with corresponding abstract (for ease of reading) and url, in like manner as the Book of Abstracts we used to print and distribute at LRECs.

We hope that you discover interesting, even exciting, work that may be useful for your own research.

Group of papers sent on November 24, 2020

Links to each session

Emotion Recognition and Generation


EmoEvent: A Multilingual Emotion Corpus based on different Events

Flor Miriam Plaza del Arco, Carlo Strapparava, L. Alfonso Urena Lopez and Maite Martin

In recent years emotion detection in text has become more popular due to its potential applications in fields such as psychology, marketing, political science, and artificial intelligence, among others. While opinion mining is a well-established task with many standard data sets and well-defined methodologies, emotion mining has received less attention due to its complexity. In particular, the annotated gold standard resources available are not enough. In order to address this shortage, we present a multilingual emotion data set based on different events that took place in April 2019. We collected tweets from the Twitter platform. Then one of seven emotions, six Ekman’s basic emotions plus the “neutral or other emotions”, was labeled on each tweet by 3 Amazon MTurkers. A total of 8,409 in Spanish and 7,303 in English were labeled. In addition, each tweet was also labeled as offensive or no offensive.  We report some linguistic statistics about the data set in order to observe the difference between English and Spanish speakers when they express emotions related to the same events. Moreover, in order to validate the effectiveness of the data set, we also propose a machine learning approach for automatically detecting emotions in tweets for both languages, English and Spanish.


MuSE: a Multimodal Dataset of Stressed Emotion

Mimansa Jaiswal, Cristian-Paul Bara, Yuanhang Luo, Mihai Burzo, Rada Mihalcea and Emily Mower Provost

Endowing automated agents with the ability to provide support, entertainment and interaction with human beings requires sensing of the users’ affective state. These affective states are impacted by a combination of emotion inducers, current psychological state, and various conversational factors. Although emotion classification in both singular and dyadic settings is an established area, the effects of these additional factors on the production and perception of emotion is understudied. This paper presents a new dataset, Multimodal Stressed Emotion (MuSE), to study the multimodal interplay between the presence of stress and expressions of affect. We describe the data collection protocol, the possible areas of use, and the annotations for the emotional content of the recordings. The paper also presents several baselines to measure the performance of multimodal features for emotion and stress classification.


Affect inTweets: A Transfer Learning Approach

Linrui Zhang, Hsin-Lun Huang, Yang Yu and Dan Moldovan

People convey sentiments and emotions through language. To understand these affectual states is an essential step towards understanding natural language. In this paper, we propose a transfer-learning based approach to inferring the affectual state of a person from their tweets. As opposed to the traditional machine learning models which require considerable effort in designing task specific features, our model can be well adapted to the proposed tasks with a very limited amount of fine-tuning, which significantly reduces the manual effort in feature engineering. We aim to show that by leveraging the pre-learned knowledge, transfer learning models can achieve competitive results in the affectual content analysis of tweets, compared to the traditional models. As shown by the experiments on SemEval-2018 Task 1: Affect in Tweets, our model ranking 2nd, 4th and 6th place in four of its subtasks proves the effectiveness of our idea.


Annotation of Emotion Carriers in Personal Narratives

Aniruddha Tammewar, Alessandra Cervone, Eva-Maria Messner and Giuseppe Riccardi

We are interested in the problem of understanding personal narratives (PN) – spoken or written – recollections of facts, events, and thoughts. For PNs, we define emotion carriers as the speech or text segments that best explain the emotional state of the narrator. Such segments may span from single to multiple words, containing for example verb or noun phrases. Advanced automatic understanding of PNs requires not only the prediction of the narrator’s emotional state but also to identify which events (e.g. the loss of a relative or the visit of grandpa) or people (e.g. the old group of high school mates) carry the emotion manifested during the personal recollection. This work proposes and evaluates an annotation model for identifying emotion carriers in spoken personal narratives. Compared to other text genres such as news and microblogs, spoken PNs are particularly challenging because a narrative is usually unstructured, involving multiple sub-events and characters as well as thoughts and associated emotions perceived by the narrator. In this work, we experiment with annotating emotion carriers in speech transcriptions from the Ulm State-of-Mind in Speech (USoMS) corpus, a dataset of PNs in German. We believe this resource could be used for experiments in the automatic extraction of emotion carriers from PN, a task that could provide further advancements in narrative understanding.


Towards Interactive Annotation for Hesitation in Conversational Speech

Jane Wottawa, Marie Tahon, Apolline Marin and Nicolas Audibert

Manual annotation of speech corpora is expensive in both human resources and time. Furthermore, recognizing affects in spontaneous, non acted speech presents a challenge for humans and machines. The aim of the present study is to automatize the labeling of hesitant speech as a marker of expressed uncertainty. That is why, the NCCFr-corpus was manually annotated for ‘degree of hesitation’ on a continuous scale between -3 and 3 and the affective dimensions ‘activation, valence and control’. In total, 5834 chunks of the NCCFr-corpus were manually annotated. Acoustic analyses were carried out based on these annotations. Furthermore, regression models were trained in order to allow automatic prediction of hesitation for speech chunks that do not have a manual annotation. Preliminary results show that the number of filled pauses as well as vowel duration increase with the degree of hesitation, and that automatic prediction of the hesitation degree reaches encouraging RMSE results of 1.6.


Abusive language in Spanish children and young teenager’s conversations: data preparation and short text classification with contextual word embeddings

Marta R. Costa-jussà, Esther González, Asuncion Moreno and Eudald Cumalat

Abusive texts are reaching the interests of the scientific and social community.  How to automatically detect them is onequestion that is gaining interest in the natural language processing community.  The main contribution of this paper is toevaluate the quality of the recently developed ”Spanish Database for cyberbullying prevention” for the purpose of trainingclassifiers on detecting abusive short texts.  We compare classical machine learning techniques to the use of a more ad-vanced model: the contextual word embeddings in the particular case of classification of abusive short-texts for the Spanishlanguage. As contextual word embeddings, we use Bidirectional Encoder Representation from Transformers (BERT), pro-posed at the end of 2018.  We show that BERT mostly outperforms classical techniques.  Far beyond the experimentalimpact of our research, this project aims at planting the seeds for an innovative technological tool with a high potentialsocial impact and aiming at being part of the initiatives in artificial intelligence for social good.


IIIT-H TEMD Semi-Natural Emotional Speech Database from Professional Actors and Non-Actors

Banothu Rambabu, Kishore Kumar Botsa, Gangamohan Paidi and Suryakanth V Gangashetty

A fundamental essence for emotional speech analysis towards emotion recognition is a good database. Database collected from natural scenarios consists of spontaneous emotions, but there are several issues in collection of such database. Other than the privacy and legal related concerns, there is no control over environment at the background. As it is difficult to collect data from natural scenarios, many research groups have collected data from semi-natural or designed procedures. In this paper, a new emotional speech database named IIIT-H TEMD (International Institute of Information Technology-Hyderabad Telugu Emotional Database) is collected using designed drama situations from actors and non-actors. Utterances are manually annotated using a hybrid strategy by giving the context to one of the listeners. As some of the data collection studies in the literature recommend for actors, analysis of actors data versus non-actors data is carried out for their significance. The total size of the dataset is about 5 hours, which makes it an useful resource for the emotional speech analysis.


The POTUS Corpus, a Database of Weekly Addresses for the Study of Stance in Politics and Virtual Agents

Thomas Janssoone, Kévin Bailly, Gaël Richard and Chloé Clavel

One of the main challenges in the field of Embodied Conversational Agent (ECA) is to generate socially believable agents. The common strategy for agent behaviour synthesis is to rely on dedicated corpus analysis. Such a corpus is composed of multimedia files of socio-emotional behaviors which have been annotated by external observers. The underlying idea is to identify interaction information for the agent’s socio-emotional behavior by checking whether the intended socio-emotional behavior is actually perceived by humans. Then, the annotations can be used as learning classes for machine learning algorithms applied to the social signals. This paper introduces the POTUS Corpus composed of high-quality audio-video files of political addresses to the American people. Two protagonists are present in this database. First, it includes speeches of former president Barack Obama to the American people. Secondly, it provides videos of these same speeches given by a virtual agent named Rodrigue. The ECA reproduces the original address as closely as possible using social signals automatically extracted from the original one. Both are annotated for social attitudes, providing information about the stance observed in each file. It also provides the social signals automatically extracted from Obama’s addresses used to generate Rodrigue’s ones.


GoodNewsEveryone: A Corpus of News Headlines Annotated with Emotions, Semantic Roles, and Reader Perception

Laura Ana Maria Bostan, Evgeny Kim and Roman Klinger

Most research on emotion analysis from text focuses on the task of emotion classification or emotion intensity regression. Fewer works address emotions as a phenomenon to be tackled with structured learning, which can be explained by the lack of relevant datasets. We fill this gap by releasing a dataset of 5000 English news headlines annotated via crowdsourcing with their associated emotions, the corresponding emotion experiencers and textual cues, related emotion causes and targets, as well as the reader’s perception of the emotion of the headline. This annotation task is comparably challenging, given the large number of classes and roles to be identified. We therefore propose a multiphase annotation procedure in which we first find relevant instances with emotional content and then annotate the more fine-grained aspects. Finally, we develop a baseline for the task of automatic prediction of semantic role structures and discuss the results. The corpus we release enables further research on emotion classification, emotion intensity prediction, emotion cause detection, and supports further qualitative studies.


SOLO: A Corpus of Tweets for Examining the State of Being Alone

Svetlana Kiritchenko, Will Hipson, Robert Coplan and Saif M. Mohammad

The state of being alone can have a substantial impact on our lives, though experiences with time alone diverge significantly among individuals. Psychologists distinguish between the concept of solitude, a positive state of voluntary aloneness, and the concept of loneliness, a negative state of dissatisfaction with the quality of one’s social interactions. Here, for the first time, we conduct a large-scale computational analysis to explore how the terms associated with the state of being alone are used in online language. We present SOLO (State of Being Alone), a corpus of over 4 million tweets collected with query terms solitude, lonely, and loneliness. We use SOLO to analyze the language and emotions associated with the state of being alone. We show that the term solitude tends to co-occur with more positive, high-dominance words (e.g., enjoy, bliss) while the terms lonely and loneliness frequently co-occur with negative, low-dominance words (e.g., scared, depressed), which confirms the conceptual distinctions made in psychology. We also show that women are more likely to report on negative feelings of being lonely as compared to men, and there are more teenagers among the tweeters that use the word lonely than among the tweeters that use the word solitude.


PoKi: A Large Dataset of Poems by Children

Will Hipson and Saif M. Mohammad

Child language studies are crucial in improving our understanding of child well-being; especially in determining the factors that impact happiness, the sources of anxiety, techniques of emotion regulation, and the mechanisms to cope with stress. However, much of this research is stymied by the lack of availability of large child-written texts. We present a new corpus of child-written text, PoKi, which includes about 62 thousand poems written by children from grades 1 to 12. PoKi is especially useful in studying child language because it comes with information about the age of the child authors (their grade). We analyze the words in PoKi along several emotion dimensions (valence, arousal, dominance) and discrete emotions (anger, fear, sadness, joy). We use non-parametric regressions to model developmental differences from early childhood to late-adolescence. Results show decreases in valence that are especially pronounced during mid-adolescence, while arousal and dominance peaked during adolescence. Gender differences in the developmental trajectory of emotions are also observed. Our results support and extend the current state of emotion development research.


AlloSat: A New Call Center French Corpus for Satisfaction and Frustration Analysis

Manon Macary, Marie Tahon, Yannick Estève and Anthony Rousseau

We present a new corpus, named AlloSat, composed of real-life call center conversations in French that is continuously annotated in frustration and satisfaction. This corpus has been set up to develop new systems able to model the continuous aspect of semantic and paralinguistic information at the conversation level. The present work focuses on the paralinguistic level, more precisely on the expression of emotions. In the call center industry, the conversation usually aims at solving the caller’s request. As far as we know, most emotional databases contain static annotations in discrete categories or in dimensions such as activation or valence. We hypothesize that these dimensions are not task-related enough. Moreover, static annotations do not enable to explore the temporal evolution of emotional states. To solve this issue, we propose a corpus with a rich annotation scheme enabling a real-time investigation of the axis frustration  /  satisfaction. AlloSat regroups 303 conversations with a total of approximately 37 hours of audio, all recorded in real-life environments collected by Allo-Media (an intelligent call tracking company). First regression experiments, with audio features, show that the evolution of frustration / satisfaction axis can be retrieved automatically at the conversation level.


Learning the Human Judgment for the Automatic Evaluation of Chatbot

Shih-Hung Wu and Sheng-Lun Chien

It is hard to evaluate the quality of the generated text by a generative dialogue system. Currently, dialogue evaluation relies on human judges to label the quality of the generated text. It is not a reusable mechanism that can give consistent evaluation for system developers. We believe that it is easier to get consistent results on comparing two generated dialogue by two systems and it is hard to give a consistent quality score on only one system at a time. In this paper, we propose a machine learning approach to reduce the effort of human evaluation by learning the human judgment on comparing two dialogue systems. Training from the human labeling result, the evaluation model learns which generative models is better in each dialog context. Thus, it can be used for system developers to compare the fine-tuned models over and over again without the human labor. In our experiment we find the agreement between the learned model and human judge is 70%. The experiment is conducted on comparing two attention based GRU-RNN generative models.


Korean-Specific Emotion Annotation Procedure Using N-Gram-Based Distant Supervision and Korean-Specific-Feature-Based Distant Supervision

Young-Jun Lee, Chae-Gyun Lim and Ho-Jin Choi

Detecting emotions from texts is considerably important in an NLP task, but it has the limitation of the scarcity of manually labeled data. To overcome this limitation, many researchers have annotated unlabeled data with certain frequently used annotation procedures. However, most of these studies are focused mainly on English and do not consider the characteristics of the Korean language. In this paper, we present a Korean-specific annotation procedure, which consists of two parts, namely n-gram-based distant supervision and Korean-specific-feature-based distant supervision. We leverage the distant supervision with the n-gram and Korean emotion lexicons. Then, we consider the Korean-specific emotion features. Through experiments, we showed the effectiveness of our procedure by comparing with the KTEA dataset. Additionally, we constructed a large-scale emotion-labeled dataset, Korean Movie Review Emotion (KMRE) Dataset, using our procedure. In order to construct our dataset, we used a large-scale sentiment movie review corpus as the unlabeled dataset. Moreover, we used a Korean emotion lexicon provided by KTEA. We also performed an emotion classification task and a human evaluation on the KMRE dataset.


Semi-Automatic Construction and Refinement of an Annotated Corpus for a Deep Learning Framework for Emotion Classification

Jiajun Xu, Kyosuke Masuda, Hiromitsu Nishizaki, Fumiyo Fukumoto and Yoshimi Suzuki

In the case of using a deep learning (machine learning) framework for emotion classification, one significant difficulty faced is the requirement of building a large, emotion corpus in which each sentence is assigned emotion labels. As a result, there is a high cost in terms of time and money associated with the construction of such a corpus. Therefore, this paper proposes a method of creating a semi-automatically constructed emotion corpus. For the purpose of this study sentences were mined from Twitter using some emotional seed words that were selected from a dictionary in which the emotion words were well-defined. Tweets were retrieved by one emotional seed word, and the retrieved sentences were assigned emotion labels based on the emotion category of the seed word. It was evident from the findings that the deep learning-based emotion classification model could not achieve high levels of accuracy in emotion classification because the semi-automatically constructed corpus had many errors when assigning emotion labels. In this paper, therefore, an approach for improving the quality of the emotion labels by automatically correcting the errors of emotion labels is proposed and tested. The experimental results showed that the proposed method worked well, and the classification accuracy rate was improved to 55.1% from 44.9% on the Twitter emotion classification task.


CEASE, a Corpus of Emotion Annotated Suicide notes in English

Soumitra Ghosh, Asif Ekbal and Pushpak Bhattacharyya

A suicide note is usually written shortly before the suicide and it provides a chance to comprehend the self-destructive state of mind of the deceased.  From a psychological point of view, suicide notes have been utilized for recognizing the motive behind the suicide. To the best of our knowledge, there is no openly accessible suicide note corpus at present, making it challenging for the researchers and developers to deep dive into the area of mental health assessment and suicide prevention. In this paper, we create a fine-grained emotion annotated corpus (CEASE) of suicide notes in English and develop various deep learning models to perform emotion detection on the curated dataset. The corpus consists of 2393 sentences from around 205 suicide notes collected from various sources. Each sentence is annotated with a particular emotion class from a set of 15 fine-grained emotion labels, namely (forgiveness, happiness_peacefulness, love,  pride,  hopefulness, thankfulness,  blame,  anger,  fear,  abuse,  sorrow,  hopelessness,  guilt,  information, instructions). For the evaluation, we develop an ensemble architecture, where the base models correspond to three supervised deep learning models, namely Convolutional Neural Network (CNN), Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM). We obtain the highest test accuracy of 60.17% and cross-validation accuracy of 60.32%


Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems

Oliver Guhr, Anne-Kathrin Schumann, Frank Bahrmann and Hans Joachim Böhme

This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.


An Event-comment Social Media Corpus for Implicit Emotion Analysis

Sophia Yat Mei Lee and Helena Yan Ping Lau

The classification of implicit emotions in text has always been a great challenge to emotion processing. Even though the majority of emotion expressed implicitly, most previous attempts at emotions have focused on the examination of explicit emotions. The poor performance of existing emotion identification and classification models can partly be attributed to the disregard of implicit emotions. In view of this, this paper presents the development of a Chinese event-comment social media emotion corpus. The corpus deals with both explicit and implicit emotions with more emphasis being placed on the implicit ones. This paper specifically describes the data collection and annotation of the corpus. An annotation scheme has been proposed for the annotation of emotion-related information including the emotion type, the emotion cause, the emotion reaction, the use of rhetorical question, the opinion target (i.e. the semantic role in an event that triggers an emotion), etc. Corpus data shows that the annotated items are of great value to the identification of implicit emotions. We believe that the corpus will be a useful resource for both explicit and implicit emotion classification and detection as well as event classification.


An Emotional Mess! Deciding on a Framework for Building a Dutch Emotion-Annotated Corpus

Luna De Bruyne, Orphee De Clercq and Veronique Hoste

Seeing the myriad of existing emotion models, with the categorical versus dimensional opposition the most important dividing line, building an emotion-annotated corpus requires some well thought-out strategies concerning framework choice. In our work on automatic emotion detection in Dutch texts, we investigate this problem by means of two case studies. We find that the labels joy, love, anger, sadness and fear are well-suited to annotate texts coming from various domains and topics, but that the connotation of the labels strongly depends on the origin of the texts. Moreover, it seems that information is lost when an emotional state is forcedly classified in a limited set of categories, indicating that a bi-representational format is desirable when creating an emotion corpus.


PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry

Thomas Haider, Steffen Eger, Evgeny Kim, Roman Klinger and Winfried Menninghaus

Most approaches to emotion analysis of social media, literature, news, and other domains focus exclusively on basic emotion categories as defined by Ekman or Plutchik. However, art (such as literature) enables engagement in a broader range of more complex and subtle emotions. These have been shown to also include mixed emotional responses. We consider emotions in poetry as they are elicited in the reader, rather than what is expressed in the text or intended by the author. Thus, we conceptualize a set of aesthetic emotions that are predictive of aesthetic appreciation in the reader, and allow the annotation of multiple labels per line to capture mixed emotions within their context. We evaluate this novel setting in an annotation experiment both with carefully trained experts and via crowdsourcing. Our annotation with experts leads to an acceptable agreement of k = .70, resulting in a consistent dataset for future large scale analysis. Finally, we conduct first emotion classification experiments based on BERT, showing that identifying aesthetic emotions is challenging in our data, with up to .52 F1-micro on the German subset. Data and resources are available at


Learning Word Ratings for Empathy and Distress from Document-Level User Responses

João Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone and Lyle Ungar

Despite the excellent performance of black box approaches to modeling sentiment and emotion, lexica (sets of informative words and associated weights) that characterize different emotions are indispensable to the NLP community because they allow for interpretable and robust predictions. Emotion analysis of text is increasing in popularity in NLP; however, manually creating lexica for psychological constructs such as empathy has proven difficult. This paper automatically creates empathy word ratings from document-level ratings. The underlying problem of learning word ratings from higher-level supervision has to date only been addressed in an ad hoc fashion and has not used deep learning methods. We systematically compare a number of approaches to learning word ratings from higher-level supervision against a Mixed-Level Feed Forward Network (MLFFN), which we find performs best, and use the MLFFN to create the first-ever empathy lexicon. We then use Signed Spectral Clustering to gain insights into the resulting words. The empathy and distress lexica are publicly available at:


Evaluation Methodologies

Back to Top


Evaluation of Sentence Representations in Polish

Slawomir Dadas, Michał Perełkiewicz and Rafał Poświata

Methods for learning sentence representations have been actively developed in recent years. However, the lack of pre-trained models and datasets annotated at the sentence level has been a problem for low-resource languages such as Polish which led to less interest in applying these methods to language-specific tasks. In this study, we introduce two new Polish datasets for evaluating sentence embeddings and provide a comprehensive evaluation of eight sentence representation methods including Polish and multilingual models. We consider classic word embedding models, recently developed contextual embeddings and multilingual sentence encoders, showing strengths and weaknesses of specific approaches. We also examine different methods of aggregating word vectors into a single sentence vector.


Identification of Primary and Collateral Tracks in Stuttered Speech

Rachid Riad, Anne-Catherine Bachoud-Lévi, Frank Rudzicz and Emmanuel Dupoux

Disfluent speech has been previously addressed from two main perspectives:  the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from (Clark, 1996) which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.


How to Compare Automatically Two Phonological Strings: Application to Intelligibility Measurement in the Case of Atypical Speech

Alain Ghio, Muriel Lalain, Laurence Giusti, Corinne Fredouille and Virginie Woisard

Atypical speech productions, regardless of their origins (accents, learning, pathology), need to be assessed with regard to “typical” or “expected” productions. Evaluation is necessarily based on comparisons between linguistic forms produced and linguistic forms expected. In the field of speech disorders, the intelligibility of a patient is evaluated in order to measure the functional impact of his/her pathology on his/her oral communication. The usual method is to transcribe orthographic linguistic forms perceived and to assign a global and imprecise rating based on their correctness or incorrect. To obtain a more precise evaluation of the production deviations, we propose a measurement method based on phonological transcriptions. An algorithm computes automatically and finely the distances between the phonological forms produced and expected from cost matrices based on the differences of features between phonemes. A first test of this method among a large population of healthy speakers and patients treated for cancer of the oral and pharyngeal cavities has proved its validity.


Evaluating Text Coherence at Sentence and Paragraph Levels

Sennan Liu, Shuang Zeng and Sujian Li

In this paper, to evaluate text coherence, we propose the paragraph ordering task as well as conducting sentence ordering. We collected four distinct corpora from different domains on which we investigate the adaptation of existing sentence ordering methods to a paragraph ordering task. We also compare the learnability and robustness of existing models by artificially creating mini datasets and noisy datasets respectively and verifying the efficiency of established models under these circumstances. Furthermore, we carry out human evaluation on the rearranged passages from two competitive models and confirm that WLCS-l is a better metric performing significantly higher correlations with human rating than τ , the most prevalent metric used before. Results from these evaluations show that except for certain extreme conditions, the recurrent graph neural network-based model is an optimal choice for coherence modeling.


HardEval: Focusing on Challenging Tokens to Assess Robustness of NER

Gabriel Bernier-Colborne and Phillippe Langlais

To assess the robustness of NER systems, we propose an evaluation method that focuses on subsets of tokens that represent specific sources of errors: unknown words and label shift or ambiguity. These subsets provide a system-agnostic basis for evaluating specific sources of NER errors and assessing room for improvement in terms of robustness. We analyze these subsets of challenging tokens in two widely-used NER benchmarks, then exploit them to evaluate NER systems in both in-domain and out-of-domain settings. Results show that these challenging tokens explain the majority of errors made by modern NER systems, although they represent only a small fraction of test tokens. They also indicate that label shift is harder to deal with than unknown words, and that there is much more room for improvement than the standard NER evaluation procedure would suggest. We hope this work will encourage NLP researchers to adopt rigorous and meaningful evaluation methods, and will help them develop more robust models.


An Evaluation Dataset for Identifying Communicative Functions of Sentences in English Scholarly Papers

Kenichi Iwatsuki, Florian Boudin and Akiko Aizawa

Formulaic expressions, such as ‘in this paper we propose’, are used by authors of scholarly papers to perform communicative functions; the communicative function of the present example is ‘stating the aim of the paper’. Collecting such expressions and pairing them with their communicative functions would be highly valuable for various tasks, particularly for writing assistance. However, such collection and paring in a principled and automated manner would require high-quality annotated data, which are not available. In this study, we address this shortcoming by creating a manually annotated dataset for detecting communicative functions in sentences. Starting from a seed list of labelled formulaic expressions, we retrieved new sentences from scholarly papers in the ACL Anthology and asked multiple human evaluators to label communicative functions. To show the usefulness of our dataset, we conducted a series of experiments that determined to what extent sentence representations acquired by recent models, such as word2vec and BERT, can be employed to detect communicative functions in sentences.


An Automatic Tool For Language Evaluation

Fabio Fassetti and Ilaria Fassetti

The aim of evaluating children speech and language is to measure their communication skills. In particular, the speech language pathologist is interested in determining the child’s impairments in the areas of language, articulation, voice, fluency and swallowing. In literature some standardized tests have been proposed to assess and screen developmental language impairments but they require manual laborious transcription, annotation and calculation. This work is very time demanding and, also, may introduce several kinds of errors in the evaluation phase and non-uniform evaluations. In order to help therapists, a system performing automated evaluation is proposed. Providing as input the correct sentence and the sentence produced by patients, the technique evaluates the level of the verbal production and returns a score. The main phases of the method concern an ad-hoc transformation of the produced sentence in the reference sentence and in the evaluation of the cost of this transformation. Since the cost function is related to many weights, a learning phase is defined to automatically set such weights.


Which Evaluations Uncover Sense Representations that Actually Make Sense?

Jordan Boyd-Graber, Fenfei Guo, Leah Findlater and Mohit Iyyer

Text representations are critical for modern natural language processing. One form of text representation, sense-specific embeddings, reflect a word’s sense in a sentence better than single-prototype word embeddings tied to each type. However, existing sense representations are not uniformly better: although they work well for computer-centric evaluations, they fail for human-centric tasks like inspecting a language’s sense inventory. To expose this discrepancy, we propose a new coherence evaluation for sense embeddings. We also describe a minimal model (Gumbel Attention for Sense Induction) optimized for discovering interpretable sense representations that are more coherent than existing sense embeddings.


Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

Yi-An Lai, Xuan Zhu, Yi Zhang and Mona Diab

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity  that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.


Towards Few-Shot Event Mention Retrieval: An Evaluation Framework and A Siamese Network Approach

Bonan Min, Yee Seng Chan and Lingjun Zhao

Automatically analyzing events in a large amount of text is crucial for situation awareness and decision making. Previous approaches treat event extraction as “one size fits all” with an ontology defined a priori. The resulted extraction models are built just for extracting those types in the ontology. These approaches cannot be easily adapted to new event types nor new domains of interest. To accommodate personalized event-centric information needs, this paper introduces the few-shot Event Mention Retrieval (EMR) task: given a user-supplied query consisting of a handful of event mentions, return relevant event mentions found in a corpus. This formulation enables “query by example”, which drastically lowers the bar of specifying event-centric information needs. The retrieval setting also enables fuzzy search. We present an evaluation framework leveraging existing event datasets such as ACE. We also develop a Siamese Network approach, and show that it performs better than ad-hoc retrieval models in the few-shot EMR setting.


Linguistic Appropriateness and Pedagogic Usefulness of Reading Comprehension Questions

Andrea Horbach, Itziar Aldabe, Marie Bexte, Oier Lopez de Lacalle and Montse Maritxalar

Automatic generation of reading comprehension questions is a topic receiving growing interest in the NLP community, but there is currently no consensus on evaluation metrics and many approaches focus on linguistic quality only while ignoring the pedagogic value and appropriateness of questions. This paper overcomes such weaknesses by a new evaluation scheme where questions from the questionnaire are structured in a hierarchical way to avoid confronting human annotators with evaluation measures that do not make sense for a certain question. We show through an annotation study that our scheme can be applied, but that expert annotators with some level of expertise are needed. We also created and evaluated two new evaluation data sets from the biology domain for Basque and German, composed of questions written by people with an educational background, which will be publicly released. Results show that manually generated questions are in general both of higher linguistic as well as pedagogic quality and that among the human generated questions, teacher-generated ones tend to be most useful.


Dataset Reproducibility and IR Methods in Timeline Summarization

Leo Born, Maximilian Bacher and Katja Markert

Timeline summarization (TLS) generates a dated overview of real-world events based on event-specific corpora. The two standard datasets for this task were collected using Google searches for news reports on given events. Not only is this IR method not reproducible at different search times, it also uses components (such as document popularity) that are not always available for any large news corpus. It is unclear how TLS algorithms fare when provided with event corpora collected with varying IR methods. We therefore construct event-specific corpora from a large static background corpus, the newsroom dataset, using differing, relatively simple IR methods based on raw text alone. We show that the choice of IR method plays a crucial role in the performance of various TLS algorithms. A weak TLS algorithm can even match a stronger one by employing a stronger IR method in the data collection phase. Furthermore, the results of TLS systems are often highly sensitive to additional sentence filtering. We consequently advocate for integrating IR into the development of TLS systems and having a common static background corpus for evaluation of TLS systems.


Database Search vs. Information Retrieval: A Novel Method for Studying Natural Language Querying of Semi-Structured Data

Stefanie Nadig, Martin Braschler and Kurt Stockinger

The traditional approach of querying a relational database is via a formal language, namely SQL. Recent developments in the design of natural language interfaces to databases show promising results for querying either with keywords or with full natural language queries and thus render relational databases more accessible to non-tech savvy users. Such enhanced relational databases basically use a search paradigm which is commonly used in the field of information retrieval. However, the way systems are evaluated in the database and the information retrieval communities often differs due to a lack of common benchmarks. In this paper, we provide an adapted benchmark data set that is based on a test collection originally used to evaluate information retrieval systems. The data set contains 45 information needs developed on the Internet Movie Database (IMDb), including corresponding relevance assessments. By mapping this benchmark data set to a relational database schema, we enable a novel way of directly comparing database search techniques with information retrieval. To demonstrate the feasibility of our approach, we present an experimental evaluation that compares SODA, a keyword-enabled relational database system, against the Terrier information retrieval system and thus lays the foundation for a future discussion of evaluating database systems that support natural language interfaces.


Why Attention is Not Explanation: Surgical Intervention and Causal Reasoning about Neural Models

Christopher Grimsley, Elijah Mayfield and Julia R.S. Bursten

As the demand for explainable deep learning grows in the evaluation of language technologies, the value of a principled grounding for those explanations grows as well. Here we study the state-of-the-art in explanation for neural models for NLP tasks from the viewpoint of philosophy of science. We focus on recent evaluation work that finds brittleness in explanations obtained through attention mechanisms. We harness philosophical accounts of explanation to suggest broader conclusions from these studies. From this analysis, we assert the impossibility of causal explanations from attention layers over text data. We then introduce NLP researchers to contemporary philosophy of science theories that allow robust yet non-causal reasoning in explanation, giving computer scientists a vocabulary for future research.


Have a Cake and Eat it Too: Assessing Discriminating Performance of an Intelligibility Index Obtained from a Reduced Sample Size

Anna Marczyk, Alain Ghio, Muriel Lalain, Marie Rebourg, Corinne Fredouille and Virginie Woisard

This paper investigates random vs. phonetically motivated reduction of linguistic material used in an intelligibility task in speech disordered populations and the subsequent impact on the discrimination classifier quantified by the area under the receiver operating characteristics curve (AUC of ROC). The comparison of obtained accuracy indexes shows that when the sample size is reduced based on a phonetic criterium—here, related to phonotactic complexity—, the classifier has a higher ranking ability than when the linguistic material is arbitrarily reduced. Crucially, downsizing the linguistic sample to about 30% of the original dataset does not diminish the discriminatory performance of the classifier. This result is of significant interest to both clinicians and patients as it validates a tool that is both reliable and efficient.


Evaluation Metrics for Headline Generation Using Deep Pre-Trained Embeddings

Abdul Moeed, Yang An, Gerhard Hagerer and Georg Groh

With the explosive growth in textual data, it is becoming increasingly important to summarize text automatically.  Recently, generative language models have shown promise in abstractive text summarization tasks. Since these models rephrase text and thus use similar but different words as found in the summarized text, existing metrics such as ROUGE that use n-gram overlap may not be optimal. Therefore we evaluate two embedding-based evaluation metrics that are applicable to abstractive summarization: Fr ́echet embedding distance,  which has been introduced recently, and angular embedding similarity, which is our proposed metric. To demonstrate the utility of both metrics, we analyze the headline generation capacity of two state-of-the-art language models: GPT-2 and ULMFiT. In particular, our proposed metric shows close relation with human judgments in our experiments and has overall better correlations with them. To provide reproducibility, the source code plus human assessments of our experiments is available on GitHub.


LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation

Gustavo Aguilar, Sudipta Kar and Thamar Solorio

Recent trends in NLP research have raised an interest in linguistic code-switching (CS); modern approaches have been proposed to solve a wide range of NLP tasks on multiple language pairs. Unfortunately, these proposed methods are hardly generalizable to different code-switched languages. In addition, it is unclear whether a model architecture is applicable for a different task while still being compatible with the code-switching setting. This is mainly because of the lack of a centralized benchmark and the sparse corpora that researchers employ based on their specific needs and interests. To facilitate research in this direction, we propose a centralized benchmark for Linguistic Code-switching Evaluation (LinCE) that combines eleven corpora covering four different code-switched language pairs (i.e., Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks (i.e., language identification, named entity recognition, part-of-speech tagging, and sentiment analysis). As part of the benchmark centralization effort, we provide an online platform where researchers can submit their results while comparing with others in real-time. In addition, we provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems. LinCE is a continuous effort, and we will expand it with more low-resource languages and tasks.


Paraphrase Generation and Evaluation on Colloquial-Style Sentences

Eetu Sjöblom, Mathias Creutz and Yves Scherrer

In this paper, we investigate paraphrase generation in the colloquial domain. We use state-of-the-art neural machine translation models trained on the Opusparcus corpus to generate paraphrases in six languages: German, English, Finnish, French, Russian, and Swedish. We perform experiments to understand how data selection and filtering for diverse paraphrase pairs affects the generated paraphrases. We compare two different model architectures, an RNN and a Transformer model, and find that the Transformer does not generally outperform the RNN. We also conduct human evaluation on five of the six languages and compare the results to the automatic evaluation metrics BLEU and the recently proposed BERTScore. The results advance our understanding of the trade-offs between the quality and novelty of generated paraphrases, affected by the data selection method. In addition, our comparison of the evaluation methods shows that while BLEU correlates well with human judgments at the corpus level, BERTScore outperforms BLEU in both corpus and sentence-level evaluation.


Analyzing Word Embedding Through Structural Equation Modeling

Namgi Han, Katsuhiko Hayashi and Yusuke Miyao

Many researchers have tried to predict the accuracies of extrinsic evaluation by using intrinsic evaluation to evaluate word embedding. The relationship between intrinsic and extrinsic evaluation, however, has only been studied with simple correlation analysis, which has difficulty capturing complex cause-effect relationships and integrating external factors such as the hyperparameters of word embedding. To tackle this problem, we employ partial least squares path modeling (PLS-PM), a method of structural equation modeling developed for causal analysis. We propose a causal diagram consisting of the evaluation results on the BATS, VecEval, and SentEval datasets, with a causal hypothesis that linguistic knowledge encoded in word embedding contributes to solving downstream tasks. Our PLS-PM models are estimated with 600 word embeddings, and we prove the existence of causal relations between linguistic knowledge evaluated on BATS and the accuracies of downstream tasks evaluated on VecEval and SentEval in our PLS-PM models. Moreover, we show that the PLS-PM models are useful for analyzing the effect of hyperparameters, including the training algorithm, corpus, dimension, and context window, and for validating the effectiveness of intrinsic evaluation.


Evaluation of Lifelong Learning Systems

Yevhenii Prokopalo, Sylvain Meignier, Olivier Galibert, Loic Barrault and Anthony Larcher

Current intelligent systems need the expensive support of machine learning experts to sustain their performance level when used on a daily basis. To reduce this cost, i.e. remaining free from any machine learning expert, it is reasonable to implement lifelong (or continuous) learning intelligent systems that will continuously adapt their model when facing changing execution conditions. In this work, the systems are allowed to refer to human domain experts who can provide the system with relevant knowledge about the task. Nowadays, the fast growth of lifelong learning systems development rises the question of their evaluation. In this article we propose a generic evaluation methodology for the specific case of lifelong learning systems. Two steps will be considered. First, the evaluation of human-assisted learning (including active and/or interactive learning) outside the context of lifelong learning. Second, the system evaluation across time, with propositions of how a lifelong learning intelligent system should be evaluated when including human assisted learning or not.


Interannotator Agreement for Lexico-Semantic Annotation of a Corpus

Elżbieta Hajnicz

This paper examines the procedure for lexico-semantic annotation of the Basic Corpus of Polish Metaphors that is the first step for annotating metaphoric expressions occurring in it. The procedure involves correcting the morphosyntactic annotation of part of the corpus that is automatically annotated on the morphosyntactic level. The main procedure concerns annotation of adjectives, adverbs, nouns and verbs (including gerunds and participles), including abbreviations of the words that belong to the above classes. It is composed of three steps: deciding whether a particular occurrence of a word is asemantic (e.g. anaphoric or strictly grammatical), whether we are dealing with a multi-word expression, reciprocal usages of the się marker and pluralia tantum, which may involve annotation with two lexical units (having two different lemmas) for a single token. We propose an interannotator agreement statistics adequate for this procedure. Finally, we discuss the preliminary results of annotation of a fragment of the corpus.


An In-Depth Comparison of 14 Spelling Correction Tools on a Common Benchmark

Markus Näther

Determining and correcting spelling and grammar errors in text is an important but surprisingly difficult task. There are several reasons why this remains challenging. Errors may consist of simple typing errors like deleted, substituted, or wrongly inserted letters, but may also consist of word confusions where a word was replaced by another one. In addition, words may be erroneously split into two parts or get concatenated. Some words can contain hyphens, because they were split at the end of a line or are compound words with a mandatory hyphen. In this paper, we provide an extensive evaluation of 14 spelling correction tools on a common benchmark. In particular, the evaluation provides a detailed comparison with respect to 12 error categories. The benchmark consists of sentences from the English Wikipedia, which were distorted using a realistic error model. Measuring the quality of an algorithm with respect to these error categories requires an alignment of the original text, the distorted text and the corrected text provided by the tool. We make our benchmark generation and evaluation tools publicly available.


Sentence Level Human Translation Quality Estimation with Attention-based Neural Networks

Yu Yuan and Serge Sharoff

This paper explores the use of Deep Learning methods for automatic estimation of quality of human translations.  Automatic estimation can provide useful feedback for translation teaching, examination and quality control.  Conventional methods for solving this task rely on manually engineered features and external knowledge.  This paper presents an end-to-end neural model without feature engineering, incorporating a cross attention mechanism to detect which parts in sentence pairs are most relevant for assessing quality.  Another contribution concerns oprediction of fine-grained scores for measuring different aspects of translation quality, such as terminological accuracy or idiomatic writing.  Empirical results on a large human annotated dataset show that the neural model outperforms feature-based methods significantly.  The dataset and the tools are available.


Evaluating Language Tools for Fifteen EU-official Under-resourced Languages

Diego Alves, Gaurish Thakkar and Marko Tadić

This article presents the results of the evaluation campaign of language tools available for fifteen EU-official under-resourced languages. The evaluation was conducted within the MSC ITN CLEOPATRA action that aims at building the cross-lingual event-centric knowledge processing on top of the application of linguistic processing chains (LPCs) for at least 24 EU-official languages. In this campaign, we concentrated on three existing NLP platforms (Stanford CoreNLP, NLP Cube, UDPipe) that all provide models for under-resourced languages and in this first run we covered 15 under-resourced languages for which the models were available. We present the design of the evaluation campaign and present the results as well as discuss them. We considered the difference between reported and our tested results within a single percentage point as being within the limits of acceptable tolerance and thus consider this result as reproducible. However, for a number of languages, the results are below what was reported in the literature, and in some cases, our testing results are even better than the ones reported previously. Particularly problematic was the evaluation of NERC systems. One of the reasons is the absence of universally or cross-lingually applicable named entities classification scheme that would serve the NERC task in different languages analogous to the Universal Dependency scheme in parsing task. To build such a scheme has become one of our the future research directions.


Word Embedding Evaluation for Sinhala

Dimuthu Lakmal, Surangika Ranathunga, Saman Peramuna and Indu Herath

This paper presents the first ever comprehensive evaluation of different types of word embeddings for Sinhala language. Three standard word embedding models, namely, Word2Vec (both Skipgram and CBOW), FastText, and Glove are evaluated under two types of evaluation methods: intrinsic evaluation and extrinsic evaluation. Word analogy and word relatedness evaluations were performed in terms of intrinsic evaluation, while sentiment analysis and part-of-speech (POS) tagging were conducted as the extrinsic evaluation tasks. Benchmark datasets used for intrinsic evaluations were carefully crafted considering specific linguistic features of Sinhala. In general, FastText word embeddings with 300 dimensions reported the finest accuracies across all the evaluation tasks, while Glove reported the lowest results.


Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks

Carlos Aspillaga, Andrés Carvallo and Vladimir Araujo

There has been significant progress in recent years in the field of Natural Language Processing thanks to the introduction of the Transformer architecture. Current state-of-the-art models, via a large number of parameters and pre-training on massive text corpus, have shown impressive results on several downstream tasks. Many researchers have studied previous (non-Transformer) models to understand their actual behavior under different scenarios, showing that these models are taking advantage of clues or failures of datasets and that slight perturbations on the input data can severely reduce their performance. In contrast, recent models have not been systematically tested with adversarial-examples in order to show their robustness under severe stress conditions. For that reason, this work evaluates three Transformer-based models (RoBERTa, XLNet, and BERT) in Natural Language Inference (NLI) and Question Answering (QA) tasks to know if they are more robust or if they have the same flaws as their predecessors. As a result, our experiments reveal that RoBERTa, XLNet and BERT are more robust than recurrent neural network models to stress tests for both NLI and QA tasks. Nevertheless, they are still very fragile and demonstrate various unexpected behaviors, thus revealing that there is still room for future improvement in this field.


Brand-Product Relation Extraction Using Heterogeneous Vector Space Representations

Arkadiusz Janz, Łukasz Kopociński, Maciej Piasecki and Agnieszka Pluwak

Relation Extraction is a fundamental NLP task. In this paper we investigate the impact of underlying text representation on the performance of neural classification models in the task of Brand-Product relation extraction. We also present the methodology of preparing annotated textual corpora for this task and we provide valuable insight into the properties of Brand-Product relations existing in textual corpora. The problem is approached from a practical angle of applications Relation Extraction in facilitating commercial Internet monitoring.


A Tale of Three Parsers: Towards Diagnostic Evaluation for Meaning Representation Parsing

Maja Buljan, Joakim Nivre, Stephan Oepen and Lilja Øvrelid

We discuss methodological choices in contrastive and diagnostic evaluation in meaning representation parsing, i.e. mapping from natural language utterances to graph-based encodings of its semantic structure. Drawing inspiration from earlier work in syntactic dependency parsing, we transfer and refine several quantitative diagnosis techniques for use in the context of the 2019 shared task on Meaning Representation Parsing (MRP). As in parsing proper, moving evaluation from simple rooted trees to general graphs brings along its own range of challenges. Specifically, we seek to begin to shed light on relative strenghts and weaknesses in different broad families of parsing techniques. In addition to these theoretical reflections, we conduct a pilot experiment on a selection of top-performing MRP systems and one of the five meaning representation frameworks in the shared task. Empirical results suggest that the proposed methodology can be meaningfully applied to parsing into graph-structured target representations, uncovering hitherto unknown properties of the different systems that can inform future development and cross-fertilization across approaches.


Information Extraction, Information Retrieval

Back to Top


Headword-Oriented Entity Linking: A Special Entity Linking Task with Dataset and Baseline

Mu Yang, Chi-Yen Chen, Yi-Hui Lee, Qian-hui Zeng, Wei-Yun Ma, Chen-Yang Shih and Wei-Jhih Chen

In this paper, we design headword-oriented entity linking (HEL), a specialized entity linking problem in which only the headwords of the entities are to be linked to knowledge bases; mention scopes of the entities do not need to be identified in the problem setting. This special task is motivated by the fact that in many articles referring to specific products, the complete full product names are rarely written; instead, they are often abbreviated to shorter, irregular versions or even just to their headwords, which are usually their product types, such as “stick” or “mask” in a cosmetic context. To fully design the special task, we construct a labeled cosmetic corpus as a public benchmark for this problem, and propose a product embedding model to address the task, where each product corresponds to a dense representation to encode the different information on products and their context jointly. Besides, to increase training data, we propose a special transfer learning framework in which distant supervision with heuristic patterns is first utilized, followed by supervised learning using a small amount of manually labeled data. The experimental results show that our model provides a strong benchmark performance on the special task.


TableBank: Table Benchmark for Image-based Table Detection and Recognition

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou and Zhoujun Li

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models can be downloaded from


WIKIR: A Python Toolkit for Building a Large-scale Wikipedia-based English Information Retrieval Dataset

Jibril Frej, Didier Schwab and Jean-Pierre Chevallet

Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR59k: a large-scale publicly available dataset that contains 59,252 queries and 2,617,003 (query, relevant documents) pairs.


Constructing a Public Meeting Corpus

Koji Tanaka, Chenhui Chu, Haolin Ren, Benjamin Renoust, Yuta Nakashima, Noriko Takemura, Hajime Nagahara and Takao Fujikawa

In this paper, we propose a full pipeline of analysis of a large corpus about a century of public meeting in historical Australian news papers, from construction to visual exploration. The corpus construction method is based on image processing and OCR. We digitize and transcribe texts of the specific topic of public meeting. Experiments show that our proposed method achieves a F-score of 87.8% for corpus construction. As a result, we built a content search tool for temporal and semantic content analysis.


Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature

Fusataka Kuniyoshi, Kohei Makino, Jun Ozawa and Makoto Miwa

The synthesis process is essential for achieving computational experiment design in the field of inorganic materials chemistry. In this work, we present a novel corpus of the synthesis process for all-solid-state batteries and an automated machine reading system for extracting the synthesis processes buried in the scientific literature. We define the representation of the synthesis processes using flow graphs, and create a corpus from the experimental sections of 243 papers. The automated machine-reading system is developed by a deep learning-based sequence tagger and simple heuristic rule-based relation extractor. Our experimental results demonstrate that the sequence tagger with the optimal setting can detect the entities with a macro-averaged F1 score of 0.826, while the rule-based relation extractor can achieve high performance with a macro-averaged F1 score of 0.887.


WEXEA: Wikipedia EXhaustive Entity Annotation

Michael Strobl, Amine Trabelsi and Osmar Zaiane

Building predictive models for information extraction from text, such as named entity recognition or the extraction of semantic relationships between named entities in text, requires a large corpus of annotated text. Wikipedia is often used as a corpus for these tasks where the annotation is a named entity linked by a hyperlink to its article. However, editors on Wikipedia are only expected to link these mentions in order to help the reader to understand the content, but are discouraged from adding links that do not add any benefit for understanding an article. Therefore, many mentions of popular entities (such as countries or popular events in history), or previously linked articles, as well as the article’s entity itself, are not linked. In this paper, we discuss WEXEA, a Wikipedia EXhaustive Entity Annotation system, to create a text corpus based on Wikipedia with exhaustive annotations of entity mentions, i.e. linking all mentions of entities to their corresponding articles. This results in a huge potential for additional annotations that can be used for downstream NLP tasks, such as Relation Extraction. We show that our annotations are useful for creating distantly supervised datasets for this task. Furthermore, we publish all code necessary to derive a corpus from a raw Wikipedia dump, so that it can be reproduced by everyone.


Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information

Arnaud Ferré, Robert Bossy, Mouhamadou Ba, Louise Deléger, Thomas Lavergne, Pierre Zweigenbaum and Claire Nédellec

Entity normalization (or entity linking) is an important subtask of information extraction that links entity mentions in text to categories or concepts in a reference vocabulary. Machine learning based normalization methods have good adaptability as long as they have enough training data per reference with a sufficient quality. Distributional representations are commonly used because of their capacity to handle different expressions with similar meanings. However, in specific technical and scientific domains, the small amount of training data and the relatively small size of specialized corpora remain major challenges. Recently, the machine learning-based CONTES method has addressed these challenges for reference vocabularies that are ontologies, as is often the case in life sciences and biomedical domains. And yet, its performance is dependent on manually annotated corpus. Furthermore, like other machine learning based methods, parametrization remains tricky. We propose a new approach to address the scarcity of training data that extends the CONTES method by corpus selection, pre-processing and weak supervision strategies, which can yield high-performance results without any manually annotated examples. We also study which hyperparameters are most influential, with sometimes different patterns compared to previous work. The results show that our approach significantly improves accuracy and outperforms previous state-of-the-art algorithms.


HBCP Corpus: A New Resource for the Analysis of Behavioural Change Intervention Reports

Francesca Bonin, Martin Gleize, Ailbhe Finnerty, Candice Moore, Charles Jochim, Emma Norris, Yufang Hou, Alison J. Wright, Debasis Ganguly, Emily Hayes, Silje Zink, Alessandra Pascale, Pol Mac Aonghusa and Susan Michie

Due to the fast pace at which research reports in behaviour change are published, researchers, consultants and policymakers would benefit from more automatic ways to process these reports. Automatic extraction of the reports’ intervention content, population, settings and their results etc. are essential in synthesising and summarising the literature. However, to the best of our knowledge, no unique resource exists at the moment to facilitate this synthesis. In this paper, we describe the construction of a corpus of published behaviour change intervention evaluation reports aimed at smoking cessation. We also describe and release the annotation of 57 entities, that can be used as an off-the-shelf data resource for tasks such as entity recognition, etc. Both the corpus and the annotation dataset are being made available to the community.


Cross-lingual Structure Transfer for Zero-resource Event Extraction

Di Lu, Ananya Subburathinam, Heng Ji, Jonathan May, Shih-Fu Chang, Avi Sil and Clare Voss

Most of the current cross-lingual transfer learning methods for Information Extraction (IE) have been only applied to name tagging. To tackle more complex tasks such as event extraction we need to transfer graph structures (event trigger linked to multiple arguments with various roles) across languages. We develop a novel share-and-transfer framework to reach this goal with three steps: (1) Convert each sentence in any language to language-universal graph structures; in this paper we explore two approaches based on universal dependency parses and complete graphs, respectively. (2) Represent each node in the graph structure with a cross-lingual word embedding so that all sentences in multiple languages can be represented with one shared semantic space. (3) Using this common semantic space, train event extractors from English training data and apply them to languages that do not have any event annotations. Experimental results on three languages (Spanish, Russian and Ukrainian) without any annotations show this framework achieves comparable performance to a state-of-the-art supervised model trained from more than 1,500 manually annotated event mentions.


Cross-Domain Evaluation of Edge Detection for Biomedical Event Extraction

Alan Ramponi, Barbara Plank and Rosario Lombardo

Biomedical event extraction is a crucial task in order to automatically extract information from the increasingly growing body of biomedical literature. Despite advances in the methods in recent years, most event extraction systems are still evaluated in-domain and on complete event structures only. This makes it hard to determine the performance of intermediate stages of the task, such as edge detection, across different corpora. Motivated by these limitations, we present the first cross-domain study of edge detection for biomedical event extraction. We analyze differences between five existing gold standard corpora, create a standardized benchmark corpus, and provide a strong baseline model for edge detection. Experiments show a large drop in performance when the baseline is applied on out-of-domain data, confirming the need for domain adaptation methods for the task. To encourage research efforts in this direction, we make both the data and the baseline available to the research community:


Semantic Annotation for Improved Safety in Construction Work

Paul Thompson, Tim Yates, Emrah Inan and Sophia Ananiadou

Risk management is a vital activity to ensure employee safety in construction projects. Various documents provide important supporting evidence, including details of previous incidents, consequences and mitigation strategies. Potential hazards may depend on a complex set of project-specific attributes, including activities undertaken, location, equipment used, etc. However, finding evidence about previous projects with similar attributes can be problematic, since information about risks and mitigations is usually hidden within and may be dispersed across a range of different free text documents.  Automatic named entity recognition (NER), which identifies mentions of concepts in free text documents, is the first stage in structuring knowledge contained within them. While developing NER methods generally relies on annotated corpora, we are not aware of any such corpus targeted at concepts relevant to construction safety.  In response, we have designed a novel named entity annotation scheme and associated guidelines for this domain, which covers hazards, consequences, mitigation strategies and project attributes. Four health and safety experts used the guidelines to annotate a total of 600 sentences from accident reports; an average inter-annotator agreement rate of 0.79 F-Score shows that our work constitutes an important first step towards developing tools for detailed semantic analysis of construction safety documents.


Social Web Observatory: A Platform and Method for Gathering Knowledge on Entities from Different Textual Sources

Leonidas Tsekouras, Georgios Petasis, George Giannakopoulos and Aris Kosmopoulos

Within this work we describe a framework for the collection and summarization of information from the Web in an entity-driven manner. The framework consists of a set of appropriate workflows and the Social Web Observatory platform, which implements those workflows, supporting them through a language analysis pipeline. The pipeline includes text collection/crawling, identification of different entities, clustering of texts into events related to entities, entity-centric sentiment analysis, but also text analytics and visualization functionalities. The latter allow the user to take advantage of the gathered information as actionable knowledge: to understand the dynamics of the public opinion for a given entity over time and across real-world events. We describe the platform and the analysis functionality and evaluate the performance of the system, by allowing human users to score how the system fares in its intended purpose of summarizing entity-centered information from different sources in the Web.


Development of a Corpus Annotated with Medications and their Attributes in Psychiatric Health Records

Jaya Chaturvedi, Natalia Viani, Jyoti Sanyal, Chloe Tytherleigh, Idil Hasan, Kate Baird, Sumithra Velupillai, Robert Stewart and Angus Roberts

Free text fields within electronic health records (EHRs) contain valuable clinical information which is often missed when conducting research using EHR databases. One such type of information is medications which are not always available in structured fields, especially in mental health records. Most use cases that require medication information also generally require the associated temporal information (e.g. current or past) and attributes (e.g. dose, route, frequency). The purpose of this study is to develop a corpus of medication annotations in mental health records. The aim is to provide a more complete picture behind the mention of medications in the health records, by including additional contextual information around them, and to create a resource for use when developing and evaluating applications for the extraction of medications from EHR text. Thus far, an analysis of temporal information related to medications mentioned in a sample of mental health records has been conducted. The purpose of this analysis was to understand the complexity of medication mentions and their associated temporal information in the free text of EHRs, with a specific focus on the mental health domain.


Do not let the history haunt you: Mitigating Compounding Errors in Conversational Question Answering

Angrosh Mandya, James O’ Neill, Danushka Bollegala and Frans Coenen

The Conversational Question Answering (CoQA) task involves answering a sequence of inter-related conversational questions about a contextual paragraph. Although existing approaches employ human-written ground-truth answers for answering conversational questions at test time, in a realistic scenario, the CoQA model will not have any access to ground-truth answers for the previous questions, compelling the model to rely upon its own previously predicted answers for answering the subsequent questions.  In this paper, we find that compounding errors occur when using previously predicted answers at test time, significantly lowering the performance of CoQA systems.  To solve this problem, we propose a sampling strategy that dynamically selects between target answers and model predictions during training, thereby closely simulating the situation at test time. Further, we analyse the severity of this phenomena as a function of the question type, conversation length and domain type.


CLEEK: A Chinese Long-text Corpus for Entity Linking

Weixin Zeng, Xiang Zhao, Jiuyang Tang, Zhen Tan and Xuqian Huang

Entity linking, as one of the fundamental tasks in natural language processing, is crucial to knowledge fusion, knowledge base construction and update. Nevertheless, in contrast to the research on entity linking for English text, which undergoes continuous development, the Chinese counterpart is still in its infancy. One prominent issue lies in publicly available annotated datasets and evaluation benchmarks, which are lacking and deficient. In specific, existing Chinese corpora for entity linking were mainly constructed from noisy short texts, such as microblogs and news headings, where long texts were largely overlooked, which yet constitute a wider spectrum of real-life scenarios. To address the issue, in this work, we build CLEEK, a Chinese corpus of multi-domain long text for entity linking, in order to encourage advancement of entity linking in languages besides English. The corpus consists of 100 documents from diverse domains, and is publicly accessible. Moreover, we devise a measure to evaluate the difficulty of documents with respect to entity linking, which is then used to characterize the corpus. Additionally, the results of two baselines and seven state-of-the-art solutions on CLEEK are reported and compared. The empirical results validate the usefulness of CLEEK and the effectiveness of proposed difficulty measure.


The Medical Scribe: Corpus Development and Model Performance Analyses

Izhak Shafran, Nan Du, Linh Tran, Amanda Perry, Lauren Keyes, Mark Knichel, Ashley Domin, Lei Huang, Yu-hui Chen, Gang Li, Mingqiu Wang, Laurent El Shafey, Hagen Soltau and Justin Stuart Paul

There is a growing interest in creating tools to assist in clinical note generation using the audio of provider-patient encounters. Motivated by this goal and with the help of providers and medical scribes, we developed an annotation scheme to extract relevant clinical concepts. We used this annotation scheme to label a corpus of about 6k clinical encounters. This was used to train a state-of-the-art tagging model. We report ontologies, labeling results, model performances, and detailed analyses of the results. Our results show that the entities related to medications can be extracted with a relatively high accuracy of 0.90 F-score, followed by symptoms at 0.72 F-score, and conditions at 0.57 F-score. In our task, we not only identify where the symptoms are mentioned but also map them to canonical forms as they appear in the clinical notes. Of the different types of errors, in about 19-38% of the cases, we find that the model output was correct, and about 17-32% of the errors do not impact the clinical note. Taken together, the models developed in this work are more useful than the F-scores reflect, making it a promising approach for practical applications.


A Contract Corpus for Recognizing Rights and Obligations

Ruka Funaki, Yusuke Nagata, Kohei Suenaga and Shinsuke Mori

A contract is a legal document executed by two or more parties. It is important for these parties to precisely understand their rights and obligations that are described in the contract. However, understanding the content of a contract is sometimes difficult and costly, particularly if the contract is long and complicated. Therefore, a language-processing system that can present information concerning rights and obligations found within a given contract document would help a contracting party to make better decisions. As a step toward the development of such a language-processing system, in this paper, we describe the annotated corpus of contract documents that we built. Our corpus is annotated so that a language-processing system can recognize a party’s rights and obligations. The annotated information includes the parties involved in the contract, the rights and obligations of the parties, the conditions and the exceptions under which these rights and obligations to take effect. The corpus was built based on 46 English contracts and 25 Japanese contracts drafted by lawyers. We explain how we annotated the corpus and the statistics of the corpus. We also report the results of the experiments for recognizing rights and obligations.


Recognition of Implicit Geographic Movement in Text

Scott Pezanowski and Prasenjit Mitra

Analyzing the geographic movement of humans, animals, and other phenomena is a growing field of research. This research has benefited urban planning, logistics, animal migration understanding, and much more. Typically, the movement is captured as precise geographic coordinates and time stamps with Global Positioning Systems (GPS). Although some research uses computational techniques to take advantage of implicit movement in descriptions of route directions, hiking paths, and historical exploration routes, innovation would accelerate with a large and diverse corpus. We created a corpus of sentences labeled as describing geographic movement or not and including the type of entity moving. Creating this corpus proved difficult without any comparable corpora to start with, high human labeling costs, and since movement can at times be interpreted differently. To overcome these challenges, we developed an iterative process employing hand labeling, crowd voting for confirmation, and machine learning to predict more labels. By merging advances in word embeddings with traditional machine learning models and model ensembling, prediction accuracy is at an acceptable level to produce a large silver-standard corpus despite the small gold-standard corpus training set. Our corpus will likely benefit computational processing of geography in text and spatial cognition, in addition to detection of movement.


Extraction of the Argument Structure of Tokyo Metropolitan Assembly Minutes: Segmentation of Question-and-Answer Sets

Keiichi Takamaru, Yasutomo Kimura, Hideyuki Shibuki, Hokuto Ototake, Yuzu Uchida, Kotaro Sakamoto, Madoka Ishioroshi, Teruko Mitamura and Noriko Kando

In this study, we construct a corpus of Japanese local assembly minutes. All speeches in an assembly were transcribed into a local assembly minutes based on the local autonomy law. Therefore, the local assembly minutes form an extremely large amount of text data. Our ultimate objectives were to summarize and present the arguments in the assemblies, and to use the minutes as primary information for arguments in local politics. To achieve this, we structured all statements in assembly minutes. We focused on the structure of the discussion, i.e., the extraction of question and answer pairs. We organized the shared task “QA Lab-PoliInfo” in NTCIR 14. We conducted a “segmentation task” to identify the scope of one question and answer in the minutes as a sub task of the shared task. For the segmentation task, 24 runs from five teams were submitted. Based on the obtained results, the best recall was 1.000, best precision was 0.940, and best F-measure was 0.895.


A Term Extraction Approach to Survey Analysis in Health Care

Cécile Robin, Mona Isazad Mashinchi, Fatemeh Ahmadi Zeleti, Adegboyega Ojo and Paul Buitelaar

The voice of the customer has for a long time been a key focus of businesses in all domains. It has received a lot of attention from the research community in Natural Language Processing (NLP) resulting in many approaches to analyzing customers feedback  ((aspect-based) sentiment analysis, topic modeling, etc.). In the health domain, public and private bodies are increasingly prioritizing patient engagement for assessing the quality of the service given at each stage of the care. Patient and customer satisfaction analysis relate in many ways. In the domain of health particularly, a more precise and insightful analysis is needed to help practitioners locate potential issues and plan actions accordingly. We introduce here an approach to patient experience with the analysis of free text questions from the 2017 Irish National Inpatient Survey campaign using term extraction as a means to highlight important and insightful subject matters raised by patients. We evaluate the results by mapping them to a manually constructed framework following the Activity, Resource, Context (ARC) methodology (Ordenes, 2014) and specific to the health care environment, and compare our results against manual annotations done on the full 2017 dataset based on those categories.


A Scientific Information Extraction Dataset for Nature Inspired Engineering

Ruben Kruiper, Julian F.V. Vincent, Jessica Chen-Burger, Marc P.Y. Desmulliez and Ioannis Konstas

Nature has inspired various ground-breaking technological developments in applications ranging from robotics to aerospace engineering and the manufacturing of medical devices. However, accessing the information captured in scientific biology texts is a time-consuming and hard task that requires domain-specific knowledge. Improving access for outsiders can help interdisciplinary research like Nature Inspired Engineering. This paper describes a dataset of 1,500 manually-annotated sentences that express domain-independent relations between central concepts in a scientific biology text, such as trade-offs and correlations. The arguments of these relations can be Multi Word Expressions and have been annotated with modifying phrases to form non-projective graphs. The dataset allows for training and evaluating Relation Extraction algorithms that aim for coarse-grained typing of scientific biological documents, enabling a high-level filter for engineers.


Automated Discovery of Mathematical Definitions in Text

Natalia Vanetik, Marina Litvak, Sergey Shevchuk and Lior Reznik

Automatic definition extraction from texts is an important task that has numerous applications in several natural language processing fields such as summarization, analysis of scientific texts, automatic taxonomy generation, ontology generation, concept identification, and question answering. For definitions that are contained within a single sentence, this problem can be viewed as a binary classification of sentences into definitions and non-definitions. Definitions in scientific literature can be generic (Wikipedia) or more formal (mathematical articles). In this paper, we focus on automatic detection of one-sentence definitions in mathematical texts, which are difficult to separate from surrounding text. We experiment with several data representations, which include sentence syntactic structure and word embeddings, and apply deep learning methods such as convolutional neural network (CNN) and recurrent neural network (RNN), in order to identify mathematical definitions. Our experiments demonstrate the superiority of CNN and its combination with RNN, applied on the syntactically-enriched input representation. We also present a new dataset for definition extraction from mathematical texts. We demonstrate that the use of this dataset for training learning models improves the quality of definition extraction when these models are then used for other definition datasets. Our experiments with different domains approve that mathematical definitions require special treatment, and that using cross-domain learning is inefficient.


WN-Salience: A Corpus of News Articles with Entity Salience Annotations

Chuan Wu, Evangelos Kanoulas, Maarten de Rijke and Wei Lu

Entities can be found in various text genres, ranging from tweets and web pages to user queries submitted to web search engines. Existing research either considers all entities in the text equally important, or heuristics are used to measure their salience. We believe that a key reason for the relatively limited work on entity salience is the lack of appropriate datasets. To support research on entity salience, we present a new dataset, the WikiNews Salience dataset (WN-Salience), which can be used to benchmark tasks such as entity salience detection and salient entity linking. WN-Salience is built on top of Wikinews, a Wikimedia project whose mission is to present reliable news articles. Entities in Wikinews articles are identified by the authors of the articles and are linked to Wikinews categories when they are salient or to Wikipedia pages otherwise. The dataset is built automatically, and consists of approximately 7,000 news articles, and 90,000 in-text entity annotations. We compare the WN-Salience dataset against existing datasets on the task and analyze their differences. Furthermore, we conduct experiments on entity salience detection; the results demonstrate that WN-Salience is a challenging testbed that is complementary to existing ones.


Event Extraction from Unstructured Amharic Text

ephrem tadesse, Rosa Tsegaye and Kuulaa Qaqqabaa

In information extraction, event extraction is one of the types that extract the specific knowledge of certain incidents from texts.  Event extraction has been done on different languages text but not on one of the Semitic language, Amharic. In this study, we present a system that extracts an event from unstructured Amharic text. The system has designed by the integration of supervised machine learning and rule-based approaches. We call this system a hybrid system. The system uses the supervised machine learning to detect events from the text and the handcrafted and the rule-based rules to extract the event from the text.  For the event extraction, we have been using event arguments.   Event arguments identify event triggering words or phrases that clearly express the occurrence of the event. The event argument attributes can be verbs, nouns, sometimes adjectives (such as ̃rg/wedding) and time as well. The hybrid system has compared with the standalone rule-based method that is well known for event extraction. The study has shown that the hybrid system has outperformed the standalone rule-based method.


Comparing Machine Learning and Deep Learning Approaches on NLP Tasks for the Italian Language

Bernardo Magnini, Alberto Lavelli and Simone Magnolini

We present a comparison between deep learning and traditional machine learning methods for various NLP tasks in Italian. We carried on experiments using  available datasets (e.g., from the Evalita shared tasks) on two sequence tagging tasks  (i.e., named entities recognition and nominal entities recognition) and four classification tasks (i.e., lexical relations among words, semantic relations among sentences, sentiment analysis and text classification). We show that deep learning approaches outperform traditional machine learning algorithms in sequence tagging, while for classification tasks that heavily rely on semantics approaches based on feature engineering are still competitive. We think that a similar analysis could be carried out for other languages to provide an assessment of machine learning / deep learning models across different languages.


MyFixit: An Annotated Dataset, Annotation Tool, and Baseline Methods for Information Extraction from Repair Manuals

Nima Nabizadeh, Dorothea Kolossa and Martin Heckmann

Text instructions are among the most widely used media for learning and teaching. Hence, to create assistance systems that are capable of supporting humans autonomously in new tasks, it would be immensely productive, if machines were enabled to extract task knowledge from such text instructions. In this paper, we, therefore, focus on information extraction (IE) from the instructional text in repair manuals. This brings with it the multiple challenges of information extraction from the situated and technical language in relatively long and often complex instructions. To tackle these challenges, we introduce a semi-structured dataset of repair manuals. The dataset is annotated in a large category of devices, with information that we consider most valuable for an automated repair assistant, including the required tools and the disassembled parts at each step of the repair progress. We then propose methods that can serve as baselines for this IE task: an unsupervised method based on a bags-of-n-grams similarity for extracting the needed tools in each repair step, and a deep-learning-based sequence labeling model for extracting the identity of disassembled parts. These baseline methods are integrated into a semi-automatic web-based annotator application that is also available along with the dataset.


Towards Entity Spaces

Marieke van Erp and Paul Groth

Entities are a central element of knowledge bases and are important input to many knowledge-centric tasks including text analysis. For example, they allow us to find documents relevant to a specific entity irrespective of the underlying syntactic expression within a document. However, the entities that are commonly represented in knowledge bases are often a simplification of what is truly being referred to in text. For example, in a knowledge base, we may have an entity for Germany as a country but not for the more fuzzy concept of Germany that covers notions of German Population, German Drivers, and the German Government. Inspired by recent advances in contextual word embeddings, we introduce the concept of entity spaces – specific representations of a set of associated entities with near-identity. Thus, these entity spaces provide a handle to an amorphous grouping of entities. We developed a proof-of-concept for English showing how, through the introduction of entity spaces in the form of disambiguation pages, the recall of entity linking can be improved.


Love Me, Love Me, Say (and Write!) that You Love Me: Enriching the WASABI Song Corpus with Lyrics Annotations

Michael Fell, Elena Cabrio, Elmahdi Korfed, Michel Buffa and Fabien Gandon

We present the WASABI Song Corpus, a large corpus of songs enriched with metadata extracted from  music databases on  the  Web, and resulting from the processing of song lyrics and from audio analysis. More specifically, given that lyrics encode an important part of the semantics of a song, we focus here on the description of the methods we proposed to extract relevant information from the lyrics, as their structure segmentation, their topic, the explicitness of the lyrics content, the salient passages of a song and the emotions conveyed. The creation of the resource is still ongoing: so far, the corpus contains 1.73M songs with lyrics (1.41M unique lyrics)  annotated at different levels with the output of the above mentioned methods. Such corpus labels and the provided methods can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an  intelligent browsing, categorization and segmentation recommendation of songs.


Evaluating Information Loss in Temporal Dependency Trees

Mustafa Ocal and Mark Finlayson

Temporal Dependency Trees (TDTs) have emerged as an alternative to full temporal graphs for representing the temporal structure of texts, with a key advantage being that TDTs can be straightforwardly computed using adapted dependency parsers. Relative to temporal graphs, the tree form of TDTs naturally omits some fraction of temporal relationships, which intuitively should decrease the amount of temporal information available, potentially increasing temporal indeterminacy of the global ordering. We demonstrate a new method for quantifying this indeterminacy that relies on solving temporal constraint problems to extract timelines, and show that TDTs result in up to a 109% increase in temporal indeterminacy over their corresponding temporal graphs for the three corpora we examine. On average, the increase in indeterminacy is 32%, and we show that this increase is a result of the TDT representation eliminating on average only 2.4% of total temporal relations. This result suggests that small differences can have big effects in temporal graphs, and the use of TDTs must be balanced against their deficiencies, with tasks requiring an accurate global temporal ordering potentially calling for use of the full temporal graph


Populating Legal Ontologies using Semantic Role Labeling

Llio Humphreys, Guido Boella, Luigi Di Caro, Livio Robaldo, Leon van der Torre, Sepideh Ghanavati and Robert Muthuri

This paper is concerned with the goal of maintaining legal information and compliance systems: the ‘resource consumption bottleneck’ of creating semantic technologies manually. The use of automated information extraction techniques could significantly reduce this bottleneck. The research question of this paper is: How to address the resource bottleneck problem of creating specialist knowledge management systems? In particular, how to semi-automate the extraction of norms and their elements to populate legal ontologies? This paper shows that the acquisition paradox can be addressed by combining state-of-the-art general-purpose NLP modules with pre- and post-processing using rules based on domain knowledge. It describes a Semantic Role Labeling based information extraction system to extract norms from legislation and represent them as structured norms in legal ontologies. The output is intended to help make laws more accessible, understandable, and searchable in legal document management systems such as Eunomos (Boella et al., 2016).


PST 2.0 – Corpus of Polish Spatial Texts

Michał Marcińczuk, Marcin Oleksy and Jan Wieczorek

In the paper, we focus on modeling spatial expressions in texts. We present the guidelines used to annotate the PST 2.0 (Corpus of Polish Spatial Texts) — a corpus designed for training and testing the tools for spatial expression recognition. The corpus contains a set of texts gathered from texts collected from travel blogs available under Creative Commons license. We have defined our guidelines based on three existing specifications for English (SpatialML, SpatialRole Labelling from SemEval-2013 Task 3 and ISO-Space1.4 from SpaceEval 2014). We briefly present the existing specifications and discuss what modifications have been made to adapt the guidelines to the characteristics of the Polish language. We also describe the process of data collection and manual annotation, including inter-annotator agreement calculation and corpus statistics. In the end, we present detailed statistics of the PST 2.0 corpus, which include the number of components, relations, expressions, and the most common values of spatial indicators, motion indicators, path indicators, distances, directions, and regions.


Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text

Deborah Ferreira and André Freitas

Mathematical text is written using a combination of words and mathematical expressions. This combination, along with a specific way of structuring sentences makes it challenging for state-of-art NLP tools to understand and reason on top of mathematical discourse. In this work, we propose a new NLP task, the natural premise selection, which is used to retrieve supporting definitions and supporting propositions that are useful for generating an informal mathematical proof for a particular statement. We also make available a dataset, NL-PS, which can be used to evaluate different approaches for the natural premise selection task. Using different baselines, we demonstrate the underlying interpretation challenges associated with the task.


Odinson: A Fast Rule-based Information Extraction Framework

Marco A. Valenzuela-Escárcega, Gus Hahn-Powell and Dane Bell

We present Odinson, a rule-based information extraction framework, which couples a simple yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time. In the Odinson query language, a single pattern may combine regular expressions over surface tokens with regular expressions over graphs such as syntactic dependencies. To guarantee the rapid matching of these patterns, our framework indexes most of the necessary information for matching patterns, including directed graphs such as syntactic dependencies, into a custom Lucene index. Indexing minimizes the amount of expensive pattern matching that must take place at runtime. As a result, the runtime system matches a syntax-based graph traversal in 2.8 seconds in a corpus of over 134 million sentences, nearly 150,000 times faster than its predecessor.


The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

Jennifer D’Souza, Anett Hoppe, Arthur Brack, Mohmad Yaser Jaradeh, Sören Auer and Ralph Ewerth

We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.


MathAlign: Linking Formula Identifiers to their Contextual Natural Language Descriptions

Maria Alexeeva, Rebecca Sharp, Marco A. Valenzuela-Escárcega, Jennifer Kadowaki, Adarsh Pyarelal and Clayton Morrison

Extending machine reading approaches to extract mathematical concepts and their descriptions is useful for a variety of tasks, ranging from mathematical information retrieval to increasing accessibility of scientific documents for the visually impaired. This entails segmenting mathematical formulae into identifiers and linking them to their natural language descriptions. We propose a rule-based approach for this task, which extracts LaTeX representations of formula identifiers and links them to their in-text descriptions, given only the original PDF and the location of the formula of interest. We also present a novel evaluation dataset for this task, as well as the tool used to create it.


Domain Adapted Distant Supervision for Pedagogically Motivated Relation Extraction

Oscar Sainz, Oier Lopez de Lacalle, Itziar Aldabe and Montse Maritxalar

In this paper we present a relation extraction system that given a text extracts pedagogically motivated relation types, as a previous step to obtaining a semantic representation of the text which will make possible to automatically generate questions for reading comprehension. The system maps pedagogically motivated relations with relations from ConceptNet and deploys Distant Supervision for relation extraction. We run a study on a subset of those relationships in order to analyse the viability of our approach. For that, we build a domain-specific relation extraction system and explore two relation extraction models: a state-of-the-art model based on transfer learning and a discrete feature based machine learning model. Experiments show that the neural model obtains better results in terms of F-score and we yield promising results on the subset of relations suitable for pedagogical purposes. We thus consider that distant supervision for relation extraction is a valid approach in our target domain, i.e. biology.


Temporal Histories of Epidemic Events (THEE): A Case Study in Temporal Annotation for Public Health

Jingcheng Niu, Victoria Ng, Gerald Penn and Erin E. Rees

We present a new temporal annotation standard, THEE-TimeML, and a corpus TheeBank enabling precise temporal information extraction (TIE) for event-based surveillance (EBS) systems in the public health domain. Current EBS must estimate the occurrence time of each event based on coarse document metadata such as document publication time. Because of the complicated language and narration style of news articles, estimated case outbreak times are often inaccurate or even erroneous. Thus, it is necessary to create annotation standards and corpora to facilitate the development of TIE systems in the public health domain to address this problem.We will discuss the adaptations that have proved necessary for this domain as we present THEE-TimeML and TheeBank. Finally, we document the corpus annotation process, and demonstrate the immediate benefit to public health applications brought by the annotations.


Exploiting Citation Knowledge in Personalised Recommendation of Recent Scientific Publications

Anita Khadka, Iván Cantador and Miriam Fernandez

In this paper we address the problem of providing personalised recommendations of recent scientific publications to a particular user, and explore the use of citation knowledge to do so. For this purpose, we have generated a novel dataset that captures authors’ publication history and is enriched with different forms of paper citation knowledge, namely citation graphs, citation positions, citation contexts, and citation types.  Through a number of empirical experiments on such dataset, we show that the exploitation of the extracted knowledge, particularly the type of citation, is a promising approach for recommending recently published papers that may not be cited yet.  The dataset, which we make publicly available, also represents a valuable resource for further investigation on academic information retrieval and filtering.


A Platform for Event Extraction in Hindi

Sovan Kumar Sahoo, Saumajit Saha, Asif Ekbal and Pushpak Bhattacharyya

Event Extraction is an important task in the widespread field of Natural Language Processing (NLP). Though this task is adequately addressed in English with sufficient resources, we are unaware of any benchmark setup in Indian languages. Hindi is one of the most widely spoken languages in the world. In this paper, we present an Event Extraction framework for Hindi language by creating an annotated resource for benchmarking, and then developing deep learning based models to set as the baselines. We crawl more than seventeen hundred disaster related Hindi news articles from the various news sources. We also develop deep learning based models for Event Trigger Detection and Classification, Argument Detection and Classification and Event-Argument Linking.


Rad-SpatialNet: A Frame-based Resource for Fine-Grained Spatial Relations in Radiology Reports

Surabhi Datta, Morgan Ulinski, Jordan Godfrey-Stovall, Shekhar Khanpara, Roy F. Riascos-Castaneda and Kirk Roberts

This paper proposes a representation framework for encoding spatial language in radiology based on frame semantics. The framework is adopted from the existing SpatialNet representation in the general domain with the aim to generate more accurate representations of spatial language used by radiologists. We describe Rad-SpatialNet in detail along with illustrating the importance of incorporating domain knowledge in understanding the varied linguistic expressions involved in different radiological spatial relations. This work also constructs a corpus of 400 radiology reports of three examination types (chest X-rays, brain MRIs, and babygrams) annotated with fine-grained contextual information according to this schema. Spatial trigger expressions and elements corresponding to a spatial frame are annotated. We apply BERT-based models (BERT-Base and BERT- Large) to first extract the trigger terms (lexical units for a spatial frame) and then to identify the related frame elements. The results of BERT- Large are decent, with F1 of 77.89 for spatial trigger extraction and an overall F1 of 81.61 and 66.25 across all frame elements using gold and predicted spatial triggers respectively. This frame-based resource can be used to develop and evaluate more advanced natural language processing (NLP) methods for extracting fine-grained spatial information from radiology text in the future.


NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports

Corentin Masson and Patrick Paroubek

Recent advances in neural computing and word embeddings for semantic processing open many new applications areas which had been left unaddressed so far because of inadequate language understanding capacity. But this new kind of approaches rely even more on training data to be operational. Corpora for financial applications exists, but most of them concern stock market prediction and are in English. To address this need for the French language and regulation oriented applications which require a deeper understanding of the text content, we hereby present “DoRe”, a French and dialectal French Corpus for NLP analytics in Finance, Regulation and Investment. This corpus is composed of: (a) 1769 Annual Reports from 336 companies among the most capitalized companies in: France (Euronext Paris) & Belgium (Euronext Brussels), covering a time frame from 2009 to 2019, and (b) related MetaData containing information for each company about its ISIN code, capitalization and sector. This corpus is designed to be as modular as possible in order to allow for maximum reuse in different tasks pertaining to Economics, Finance and Regulation. After presenting existing resources, we relate the construction of the DoRe corpus and the rationale behind our choices, concluding on the spectrum of possible uses of this new resource for NLP applications.


The Language of Brain Signals: Natural Language Processing of Electroencephalography Reports

Ramon Maldonado and Sanda Harabagiu

Brain signals are captured by clinical electroencephalography (EEG) which is an excellent tool for probing neural function. When EEG tests are performed, a textual EEG report is generated by the neurologist to document the findings, thus using language that describes the brain signals and its clinical correlations. Even with the impetus provided by the BRAIN initiative (, there are no annotations available in texts that capture language describing the brain activities and their correlations with various pathologies. In this paper we describe an annotation effort carried out on a large corpus of EEG reports, providing examples of EEG-specific and clinically relevant concepts. In addition, we detail our annotation schema for brain signal attributes. We also discuss the resulting annotation of long-distance relations between concepts in EEG reports. By exemplifying a self-attention joint-learning to predict similar annotations in the EEG report corpus, we discuss the promising results, hoping that our effort will inform the design of novel knowledge capture techniques that will include the language of brain signals.

  • LREC 2020 Paper Dissemination (1/10)
  • LREC 2020 Paper Dissemination (2/10)
  • LREC 2020 Paper Dissemination (3/10)
  • LREC 2020 Paper Dissemination (4/10)
  • LREC 2020 Paper Dissemination (5/10)
  • LREC 2020 Paper Dissemination (6/10)
  • LREC 2020 Paper Dissemination (7/10)
  • LREC 2020 Paper Dissemination (8/10)
  • LREC 2020 Paper Dissemination (9/10)
  • LREC 2020 Paper Dissemination (10/10)