LREC 2020 Paper Dissemination (1/10)

LREC 2020 was not held in Marseille this year and only the Proceedings were published.

The ELRA Board and the LREC 2020 Programme Committee now feel that those papers should be disseminated again, in a thematic-oriented way, shedding light on specific “topics/sessions”.

Packages with several sessions will be disseminated every Tuesday for 10 weeks, from Nov 10, 2020 until the end of January 2021.

Each session displays papers’ title and authors, with corresponding abstract (for ease of reading) and url, in like manner as the Book of Abstracts we used to print and distribute at LRECs.

We hope that you discover interesting, even exciting, work that may be useful for your own research.

Group of papers sent on November 10, 2020

Links to each session

Session Anaphora, Coreference


Neural Mention Detection

Juntao Yu, Bernd Bohnet and Massimo Poesio

Mention detection is an important preprocessing step for annotation and interpretation in applications such as NER and coreference resolution, but few stand-alone neural models have been proposed able to handle the full range of mentions. In this work, we propose and compare three neural network-based approaches to mention detection. The first approach is based on the mention detection part of a state of the art coreference resolution system; the second uses ELMO embeddings together with a bidirectional LSTM and a biaffine classifier; the third approach uses the recently introduced BERT model. Our best model (using a biaffine classifier) achieves gains of up to 1.8 percentage points on mention recall when compared with a strong baseline in a HIGH RECALL coreference annotation setting. The same model achieves improvements of up to 5.3 and 6.2 p.p. when compared with the best-reported mention detection F1 on the CONLL and CRAC coreference data sets respectively in a HIGH F1 annotation setting. We then evaluate our models for coreference resolution by using mentions predicted by our best model in start-of-the-art coreference systems. The enhanced model achieved absolute improvements of up to 1.7 and 0.7 p.p. when compared with our strong baseline systems (pipeline system and end-to-end system) respectively. For nested NER, the evaluation of our model on the GENIA corpora shows that our model matches or outperforms state-of-the-art models despite not being specifically designed for this task.


A Cluster Ranking Model for Full Anaphora Resolution

Juntao Yu, Alexandra Uma and Massimo Poesio

Anaphora resolution (coreference) systems designed for the CONLL 2012 dataset typically cannot handle key aspects of the full anaphora resolution task such as the identification of singletons and of certain types of non-referring expressions (e.g., expletives), as these aspects are not annotated in that corpus. However, the recently released dataset for the CRAC 2018 Shared Task can now be used for that purpose. In this paper, we introduce an architecture to simultaneously identify non-referring expressions (including expletives, predicative {\NP}s, and other types)  and build coreference chains, including singletons.  Our cluster-ranking system uses an attention mechanism to determine the relative importance of the mentions in the same cluster.  Additional classifiers are used to identify singletons and non-referring markables. Our contributions are as follows. First all, we report the first result on the CRAC data using system mentions; our result is 5.8% better than the shared task baseline system, which used gold mentions. Second, we demonstrate that the availability of singleton clusters and non-referring expressions can lead to substantially improved performance on non-singleton clusters as well. Third, we show that despite our model not being designed specifically for the CONLL data, it achieves a score equivalent to that of the state-of-the-art system by Kantor and Globerson (2019)  on that dataset.


Mandarinograd: A Chinese Collection of Winograd Schemas

Timothée Bernard and Ting Han

This article introduces Mandarinograd, a corpus of Winograd Schemas in Mandarin Chinese. Winograd Schemas are particularly challenging anaphora resolution problems, designed to involve common sense reasoning and to limit the biases and artefacts commonly found in natural language understanding datasets. Mandarinograd contains the schemas in their traditional form, but also as natural language inference instances (ENTAILMENT or NO ENTAILMENT pairs) as well as in their fully disambiguated candidate forms. These two alternative representations are often used by modern solvers but existing datasets present automatically converted items that sometimes contain syntactic or semantic anomalies. We detail the difficulties faced when building this corpus and explain how weavoided the anomalies just mentioned. We also show that Mandarinograd is resistant to a statistical method based on a measure of word association.


On the Influence of Coreference Resolution on Word Embeddings in Lexical-semantic Evaluation Tasks

Alexander Henlein and Alexander Mehler

Coreference resolution (CR) aims to find all spans of a text that refer to the same entity. The F1-Scores on these task have been greatly improved by new developed End2End-approaches and transformer networks. The inclusion of CR as a pre-processing step is expected to lead to improvements in downstream tasks.  The paper examines this effect with respect to word embeddings. That is, we analyze the effects of CR on six different embedding methods and evaluate them in the context of seven lexical-semantic evaluation tasks and instantiation/hypernymy detection. Especially in the last tasks we hoped for a significant increase in performance.We show that all word embedding approaches do not benefit significantly from pronoun substitution. The measurable improvements are only marginal (around 0.5% in most test cases). We explain this result with the loss of contextual information, reduction of the relative occurrence of rare words and the lack of pronouns to be replaced.


NoEl: An Annotated Corpus for Noun Ellipsis in English

Payal Khullar, Kushal Majmundar and Manish Shrivastava

Ellipsis resolution has been identified as an important step to improve the accuracy of mainstream Natural Language Processing (NLP) tasks such as information retrieval, event extraction, dialog systems, etc. Previous computational work on ellipsis resolution has focused on one type of ellipsis, namely Verb Phrase Ellipsis (VPE) and a few other related phenomenon. We extend the study of ellipsis by presenting the No(oun)El(lipsis) corpus – an annotated corpus for noun ellipsis and closely related phenomenon using the first hundred movies of Cornell Movie Dialogs Dataset. The annotations are carried out in a standoff annotation scheme that encodes the position of the licensor, the antecedent boundary, and Part-of-Speech (POS) tags of the licensor and antecedent modifier. Our corpus has 946 instances of exophoric and endophoric noun ellipsis, making it the biggest resource of noun ellipsis in English, to the best of our knowledge. We present a statistical study of our corpus with novel insights on the distribution of noun ellipsis, its licensors and antecedents. Finally, we perform the tasks of detection and resolution of noun ellipsis with different classifiers trained on our corpus and report baseline results.


An Annotated Dataset of Coreference in English Literature

David Bamman, Olivia Lewke and Anya Mansoor

We present in this work a new dataset of coreference annotations for works of literature in English, covering 29,103 mentions in 210,532 tokens from 100 works of fiction published between 1719 and 1922.  This dataset differs from previous coreference corpora in containing documents whose average length (2,105.3 words) is four times longer than other benchmark datasets (463.7 for OntoNotes), and contains examples of difficult coreference problems common in literature.  This dataset allows for an evaluation of cross-domain performance for the task of coreference resolution, and analysis into the characteristics of long-distance within-document coreference.


GerDraCor-Coref: A Coreference Corpus for Dramatic Texts in German

Janis Pagel and Nils Reiter

Dramatic texts are a highly structured literary text type. Their quantitative analysis so far has relied on analysing structural properties (e.g., in the form of networks). Resolving coreferences is crucial for an analysis of the content of the character speech, but developing automatic coreference resolution (CR) systems depends on the existence of annotated corpora. In this paper, we present an annotated corpus of German dramatic texts, a preliminary analysis of the corpus as well as some baseline experiments on automatic CR. The analysis shows that with respect to the reference structure, dramatic texts are very different from news texts, but more similar to other dialogical text types such as interviews. Baseline experiments show a performance of 28.8 CoNLL score achieved by the rule-based CR system CorZu. In the future, we plan to integrate the (partial) information given in the dramatis personae into the CR model.


A Study on Entity Resolution for Email Conversations

Parag Pravin Dakle, Takshak Desai and Dan Moldovan

This paper investigates the problem of entity resolution for email conversations and presents a seed annotated corpus of email threads labeled with entity coreference chains. Characteristics of email threads concerning reference resolution are first discussed, and then the creation of the corpus and annotation steps are explained. Finally, performance of the current state-of-the-art deep learning models on

the seed corpus is evaluated and qualitative error analysis on the predictions obtained is presented.


Model-based Annotation of Coreference

Rahul Aralikatte and Anders Søgaard

Humans do not make inferences over texts, but over models of what texts are about. When annotators are asked to annotate coreferent spans of text, it is therefore a somewhat unnatural task. This paper presents an alternative in which we preprocess documents, linking entities to a knowledge base, and turn the coreference annotation task — in our case limited to pronouns — into an annotation task where annotators are asked to assign pronouns to entities. Model-based annotation is shown to lead to faster annotation and higher inter-annotator agreement, and we argue that it also opens up an alternative approach to coreference resolution. We present two new coreference benchmark datasets, for English Wikipedia and English teacher-student dialogues, and evaluate state-of-the-art coreference resolvers on them.


French Coreference for Spoken and Written Language

Rodrigo Wilkens, Bruno Oberle, Frédéric Landragin and Amalia Todirascu

Coreference resolution aims at identifying and grouping all mentions referring to the same entity. In French, most systems run different setups, making their comparison difficult. In this paper, we present an extensive comparison of several coreference resolution systems for French. The systems have been trained on two corpora (ANCOR for spoken language and Democrat for written language) annotated with coreference chains, and augmented with syntactic and semantic information. The models are compared with different configurations (e.g. with and without singletons). In addition, we evaluate mention detection and coreference resolution apart. We present a full-stack model that outperforms other approaches. This model allows us to study the impact of mention detection errors on coreference resolution. Our analysis shows that mention detection can be improved by focusing on boundary identification while advances in the pronoun-noun relation detection can help the coreference task. Another contribution of this work is the first end-to-end neural French coreference resolution model trained on Democrat (written texts), which compares to the state-of-the-art systems for oral French.


Cross-lingual Zero Pronoun Resolution

Abdulrahman Aloraini and Massimo Poesio

In languages like Arabic, Chinese, Italian, Japanese, Korean, Portuguese, Spanish, and many others, predicate arguments in certain syntactic positions are not realized instead of being realized as overt pronouns, and are thus called zero- or null-pronouns. Identifying and resolving such omitted arguments is crucial to machine translation, information extraction and other NLP tasks, but depends heavily on semantic coherence and lexical relationships. We propose a BERT-based cross-lingual model for zero pronoun resolution, and evaluate it on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, ours is the first neural model of zero-pronoun resolution for Arabic; and our model also outperforms the state-of-the-art for Chinese. In the paper we also evaluate BERT feature extraction and fine-tune models on the task, and compare them with our model. We also report on an investigation of BERT layers indicating which layer encodes the most suitable representation for the task.


Exploiting Cross-Lingual Hints to Discover Event Pronouns

Sharid Loáiciga, Christian Hardmeier and Asad Sayeed

Non-nominal co-reference is much less studied than nominal coreference, partly because of the lack of annotated corpora. We explore the possibility to exploit parallel multilingual corpora as a means of cheap supervision for the classification of three different readings of the English pronoun ‘it’: entity, event or pleonastic, from their translation in several languages. We found that the ‘event’ reading is not very frequent, but can be easily predicted provided that the construction used to translate the ‘it’ example is a pronoun as well. These cases, nevertheless, are not enough to generalize to other types of non-nominal reference.


MuDoCo: Corpus for Multidomain Coreference Resolution and Referring Expression Generation

Scott Martin, Shivani Poddar and Kartikeya Upasani

This paper proposes a new dataset, MuDoCo, composed of authored dialogs between a fictional user and a system who are given tasks to perform within six task domains. These dialogs are given rich linguistic annotations by expert linguists for several types of reference mentions and named entity mentions, either of which can span multiple words, as well as for coreference links between mentions. The dialogs sometimes cross and blend domains,  and the users exhibit complex task switching behavior such as re-initiating a previous task in the dialog by referencing the entities within it.  The dataset contains a total of 8,429 dialogs with an average of 5.36 turns per dialog.  We are releasing this dataset to encourage research in the field of coreference resolution, referring expression generation and identification within realistic, deep dialogs involving multiple domains. To demonstrate its utility, we also propose two baseline models for the downstream tasks: coreference resolution and referring expression generation.


Session Cognitive Methods

Back to Top

Affection Driven Neural Networks for Sentiment Analysis

Rong Xiang, Yunfei Long, Mingyu Wan, Jinghang Gu, Qin Lu and Chu-Ren Huang

Deep neural network models have played a critical role in sentiment analysis with promising results in the recent decade. One of the essential challenges, however, is how external sentiment knowledge can be effectively utilized. In this work, we propose a novel affection-driven approach to incorporating affective knowledge into neural network models. The  affective knowledge is obtained in the form of a lexicon under the Affect Control Theory  (ACT), which is represented by vectors of three-dimensional attributes in Evaluation, Potency, and Activity (EPA). The EPA vectors are mapped to an affective influence value and then integrated into Long Short-term Memory (LSTM) models to highlight affective terms. Experimental results show a consistent improvement of our approach over conventional LSTM models by 1.0% to 1.5% in accuracy on three large benchmark datasets. Evaluations across a variety of algorithms have also proven the effectiveness of leveraging affective terms for deep model enhancement.


The Alice Datasets: fMRI & EEG Observations of Natural Language Comprehension

Shohini Bhattasali, Jonathan Brennan, Wen-Ming Luh, Berta Franzluebbers and John Hale

The Alice Datasets are a set of datasets based on magnetic resonance data and electrophysiological data, collected while participants heard a story in English. Along with the datasets and the text of the story, we provide a variety of different linguistic and computational measures ranging from prosodic predictors to predictors capturing hierarchical syntactic information. These ecologically valid datasets can be easily reused to replicate prior work and to test new hypotheses about natural language comprehension in the brain.


Modelling Narrative Elements in a Short Story: A Study on Annotation Schemes and Guidelines

Elena Mikhalkova, Timofei Protasov, Polina Sokolova, Anastasiia Bashmakova and Anastasiia Drozdova

Text-processing algorithms that annotate main components of a story-line are presently in great need of corpora and well-agreed annotation schemes. The Text World Theory of cognitive linguistics offers a model that generalizes a narrative structure in the form of world building elements (characters, time and space) as well as text worlds themselves and switches between them. We have conducted a survey on how text worlds and their elements are annotated in different projects and proposed our own annotation scheme and instructions. We tested them, first, on the science fiction story “We Can Remember It for You Wholesale” by Philip K. Dick. Then we corrected the guidelines and added computer annotation of verb forms with the purpose to get a higher raters’ agreement and tested them again on the short story “The Gift of the Magi” by O. Henry. As a result, the agreement among the three raters has risen. With due revision and tests, our annotation scheme and guidelines can be used for annotating narratives in corpora of literary texts, criminal evidence, teaching materials, quests, etc.


Cortical Speech Databases For Deciphering the Articulatory Code

Harald Höge

The paper relates to following ‘AC-hypotheses’: The articulatory code (AC) is a neural code exchanging multi-item messages between the short-term memory and cortical areas as the vSMC and STG. In these areas already neurons active in the presence of articulatory features have been measured. The AC codes the content of speech segmented in chunks and is the same for both modalities – speech perception and speech production. Each AC-message is related to a syllable. The items of each message relate to coordinated articulatory gestures composing the syllable. The mechanism to transport the AC and to segment the auditory signal is based on Ɵ/γ-oscillations, where a Ɵ-cycle has the duration of a Ɵ-syllable. The paper describes the findings from neuroscience, phonetics and the science of evolution leading to the AC-hypotheses. The paper proposes to verify the AC-hypotheses by measuring the activity of all ensembles of neurons coding and decoding the AC. Due to state of the art, the cortical measurements to be prepared, done and further processed need a high effort from scientists active in different areas. We propose to launch a project to produce cortical speech databases with cortical recordings synchronized with the speech signal allowing to decipher the articulatory code.


ZuCo 2.0: A Dataset of Physiological Recordings During Natural Reading and Annotation

Nora Hollenstein, Marius Troendle, Ce Zhang and Nicolas Langer

We recorded and preprocessed ZuCo 2.0, a new dataset of simultaneous eye-tracking and electroencephalography during natural reading and during annotation. This corpus contains gaze and brain activity data of 739 English sentences, 349 in a normal reading paradigm and 390 in a task-specific paradigm, in which the 18 participants actively search for a semantic relation type in the given sentences as a linguistic annotation task. This new dataset complements ZuCo 1.0 by providing experiments designed to analyze the differences in cognitive processing between natural reading and annotation. The data is freely available here:


Linguistic, Kinematic and Gaze Information in Task Descriptions: The LKG-Corpus

Tim Reinboth, Stephanie Gross, Laura Bishop and Brigitte Krenn

Data from neuroscience and psychology suggest that sensorimotor cognition may be of central importance to language. Specifically, the linguistic structure of utterances referring to concrete actions may reflect the structure of the sensorimotor processing underlying the same action. To investigate this, we present the Linguistic, Kinematic and Gaze information in task descriptions Corpus (LKG-Corpus), comprising multimodal data on 13 humans, conducting take, put, and push actions, and describing these actions with 350 utterances. Recorded are audio, video, motion and eye-tracking data while participants perform an action and describe what they do. The dataset is annotated with orthographic transcriptions of utterances and information on: (a) gaze behaviours, (b) when a participant touched an object, (c) when an object was moved, (d) when a participant looked at the location s/he would next move the object to, (e) when the participant’s gaze was stable on an area. With the exception of the annotation of stable gaze, all annotations were performed manually. With the LKG-Corpus, we present a dataset that integrates linguistic, kinematic and gaze data with an explicit focus on relations between action and language. On this basis, we outline applications of the dataset to both basic and applied research.


The ACQDIV Corpus Database and Aggregation Pipeline

Anna Jancso, Steven Moran and Sabine Stoll

We present the ACQDIV corpus database and aggregation pipeline, a tool developed as part of the European Research Council (ERC) funded project ACQDIV, which aims to identify the universal cognitive processes that allow children to acquire any language. The corpus database represents 15 corpora from 14 typologically maximally diverse languages. Here we give an overview of the project, database, and our extensible software package for adding more corpora to the current language sample. Lastly, we discuss how we use the corpus database to mine for universal patterns in child language acquisition corpora and we describe avenues for future research.


Providing Semantic Knowledge to a Set of Pictograms for People with Disabilities: a Set of Links between WordNet and Arasaac: Arasaac-WN

Didier Schwab, Pauline Trial, Céline Vaschalde, Loïc Vial, Emmanuelle Esperanca-Rodier and Benjamin Lecouteux

This article presents a resource that links WordNet, the widely known lexical and semantic database, and Arasaac, the largest freely available database of pictograms. Pictograms are a tool that is more and more used by people with cognitive or communication disabilities. However, they are mainly used manually via workbooks, whereas caregivers and families would like to use more automated tools (use speech to generate pictograms, for example). In order to make it possible to use pictograms automatically in NLP applications, we propose a database that links them to semantic knowledge. This resource is particularly interesting for the creation of applications that help people with cognitive disabilities, such as text-to-picto, speech-to-picto, picto-to-speech… In this article, we explain the needs for this database and the problems that have been identified. Currently, this resource combines approximately 800 pictograms with their corresponding WordNet synsets and it is accessible both through a digital collection and via an SQL database. Finally, we propose a method with associated tools to make our resource language-independent: this method was applied to create a first text-to-picto prototype for the French language. Our resource is distributed freely under a Creative Commons license at the following URL:


Orthographic Codes and the Neighborhood Effect: Lessons from Information Theory

Stéphan Tulkens, Dominiek Sandra and Walter Daelemans

We consider the orthographic neighborhood effect: the effect that words with more orthographic similarity to other words are read faster. The neighborhood effect serves as an important control variable in psycholinguistic studies of word reading, and explains variance in addition to word length and word frequency. Following previous work, we model the neighborhood effect as the average distance to neighbors in feature space for three feature sets: slots, character ngrams and skipgrams. We optimize each of these feature sets and find evidence for language-independent optima, across five megastudy corpora from five alphabetic languages. Additionally, we show that weighting features using the inverse of mutual information (MI) improves the neighborhood effect significantly for all languages. We analyze the inverse feature weighting, and show that, across languages, grammatical morphemes get the lowest weights. Finally, we perform the same experiments on Korean Hangul, a non-alphabetic writing system, where we find the opposite results: slower responses as a function of denser neighborhoods, and a negative effect of inverse feature weighting. This raises the question of whether this is a cognitive effect, or an effect of the way we represent Hangul orthography, and indicates more research is needed.


Understanding the Dynamics of Second Language Writing through Keystroke Logging and Complexity Contours

Elma Kerz, Fabio Pruneri, Daniel Wiechmann, Yu Qiao and Marcus Ströbel

The purpose of this paper is twofold: [1] to introduce, to our knowledge, the largest available resource of keystroke logging (KSL) data generated by Etherpad (, an open-source, web-based collaborative real-time editor, that captures the dynamics of second language (L2) production and [2] to relate the behavioral data from KSL to indices of syntactic and lexical complexity of the texts produced obtained from a tool that implements a sliding window approach capturing the progression of complexity within a text. We present the procedures and measures developed to analyze a sample of 14,913,009 keystrokes in 3,454 texts produced by 512 university students (upper-intermediate to advanced L2 learners of English) (95,354 sentences and 18,32,027 words) aiming to achieve a better alignment between keystroke-logging measures and underlying cognitive processes, on the one hand, and L2 writing performance measures, on the other hand. The resource introduced in this paper is a reflection of increasing recognition of the urgent need to obtain ecologically valid data that have the potential to transform our current understanding of mechanisms underlying the development of literacy (reading and writing) skills.


Design of BCCWJ-EEG: Balanced Corpus with Human Electroencephalography

Yohei Oseki and Masayuki Asahara

The past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed.


Using the RUPEX Multichannel Corpus in a Pilot fMRI Study on Speech Disfluencies

Katerina Smirnova, Nikolay Korotaev, Yana Panikratova, Irina Lebedeva, Ekaterina Pechenkova and Olga Fedorova

In modern linguistics and psycholinguistics speech disfluencies in real fluent speech are a well-known phenomenon. But it’s not still clear which components of brain systems are involved into its comprehension in a listener’s brain. In this paper we provide a pilot neuroimaging study of the possible neural correlates of speech disfluencies perception, using a combination of the corpus and functional magnetic-resonance imaging (fMRI) methods. Special technical procedure of selecting stimulus material from Russian multichannel corpus RUPEX allowed to create fragments in terms of requirements for the fMRI BOLD temporal resolution. They contain isolated speech disfluencies and their clusters. Also, we used the referential task for participants fMRI scanning. As a result, it was demonstrated that annotated multichannel corpora like RUPEX can be an important resource for experimental research in interdisciplinary fields. Thus, different aspects of communication can be explored through the prism of brain activation.


Session Collaborative Resource Construction & Crowdsourcing

Back to Top

Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language

Aomi Koyama, Tomoshige Kiyuna, Kenji Kobayashi, Mio Arai and Mamoru Komachi

The NAIST Lang-8 Learner Corpora (Lang-8 corpus) is one of the largest second-language learner corpora. The Lang-8 corpus is suitable as a training dataset for machine translation-based grammatical error correction systems. However, it is not suitable as an evaluation dataset because the corrected sentences sometimes include inappropriate sentences. Therefore, we created and released an evaluation corpus for correcting grammatical errors made by learners of Japanese as a Second Language (JSL). As our corpus has less noise and its annotation scheme reflects the characteristics of the dataset, it is ideal as an evaluation corpus for correcting grammatical errors in sentences written by JSL learners. In addition, we applied neural machine translation (NMT) and statistical machine translation (SMT) techniques to correct the grammar of the JSL learners’ sentences and evaluated their results using our corpus. We also compared the performance of the NMT system with that of the SMT system.


Effective Crowdsourcing of Multiple Tasks for Comprehensive Knowledge Extraction

Sangha Nam, Minho Lee, Donghwan Kim, Kijong Han, Kuntae Kim, Sooji Yoon, Eun-kyung Kim and Key-Sun Choi

Information extraction from unstructured texts plays a vital role in the field of natural language processing. Although there has been extensive research into each information extraction task (i.e., entity linking, coreference resolution, and relation extraction), data are not available for a continuous and coherent evaluation of all information extraction tasks in a comprehensive framework. Given that each task is performed and evaluated with a different dataset, analyzing the effect of the previous task on the next task with a single dataset throughout the information extraction process is impossible. This paper aims to propose a Korean information extraction initiative point and promote research in this field by presenting crowdsourcing data collected for four information extraction tasks from the same corpus and the training and evaluation results for each task of a state-of-the-art model. These machine learning data for Korean information extraction are the first of their kind, and there are plans to continuously increase the data volume. The test results will serve as an initiative result for each Korean information extraction task and are expected to serve as a comparison target for various studies on Korean information extraction using the data collected in this study.


Developing a Corpus of Indirect Speech Act Schemas

Antonio Roque, Alexander Tsuetaki, Vasanth Sarathy and Matthias Scheutz

Resolving Indirect Speech Acts (ISAs), in which the intended meaning of an utterance is not identical to its literal meaning, is essential to enabling the participation of intelligent systems in peoples’ everyday lives.  Especially challenging are those cases in which the interpretation of such ISAs depends on context.  To test a system’s ability to perform ISA resolution we need a corpus, but developing such a corpus is difficult, especialy given the contex-dependent requirement. This paper addresses the difficult problems of constructing a corpus of ISAs, taking inspiration from relevant work in using corpora for reasoning tasks. We present a formal representation of ISA Schemas required for such testing, including a measure of the difficulty of a particular schema.  We develop an approach to authoring these schemas using corpus analysis and crowdsourcing, to maximize realism and minimize the amount of expert authoring needed.  Finally, we describe several characteristics of collected data, and potential future work.


Quality Estimation for Partially Subjective Classification Tasks via Crowdsourcing

Yoshinao Sato and Kouki Miyazawa

The quality estimation of artifacts generated by creators via crowdsourcing has great significance for the construction of a large-scale data resource. A common approach to this problem is to ask multiple reviewers to evaluate the same artifacts. However, the commonly used majority voting method to aggregate reviewers’ evaluations does not work effectively for partially subjective or purely subjective tasks because reviewers’ sensitivity and bias of evaluation tend to have a wide variety. To overcome this difficulty, we propose a probabilistic model for subjective classification tasks that incorporates the qualities of artifacts as well as the abilities and biases of creators and reviewers as latent variables to be jointly inferred. We applied this method to the partially subjective task of speech classification into the following four attitudes: agreement, disagreement, stalling, and question. The result shows that the proposed method estimates the quality of speech more effectively than a vote aggregation, measured by correlation with a fine-grained classification by experts.


Crowdsourcing in the Development of a Multilingual FrameNet: A Case Study of Korean FrameNet

Younggyun Hahm, Youngbin Noh, Ji Yoon Han, Tae Hwan Oh, Hyonsu Choe, Hansaem Kim and Key-Sun Choi

Using current methods, the construction of multilingual resources in FrameNet is an expensive and complex task. While crowdsourcing is a viable alternative, it is difficult to include non-native English speakers in such efforts as they often have difficulty with English-based FrameNet tools. In this work, we investigated cross-lingual issues in crowdsourcing approaches for multilingual FrameNets, specifically in the context of the newly constructed Korean FrameNet. To accomplish this, we evaluated the effectiveness of various crowdsourcing settings whereby certain types of information are provided to workers, such as English definitions in FrameNet or translated definitions. We then evaluated whether the crowdsourced results accurately captured the meaning of frames both cross-culturally and cross-linguistically, and found that by allowing the crowd workers to make intuitive choices, they achieved a quality comparable to that of trained FrameNet experts (F1 > 0.75). The outcomes of this work are now publicly available as a new release of Korean FrameNet 1.1.


Towards a Reliable and Robust Methodology for Crowd-Based Subjective Quality Assessment of Query-Based Extractive Text Summarization

Neslihan Iskender, Tim Polzehl and Sebastian Möller

The intrinsic and extrinsic quality evaluation is an essential part of the summary evaluation methodology usually conducted in a traditional controlled laboratory environment. However, processing large text corpora using these methods reveals expensive from both the organizational and the financial perspective. For the first time, and as a fast, scalable, and cost-effective alternative, we propose micro-task crowdsourcing to evaluate both the intrinsic and extrinsic quality of query-based extractive text summaries. To investigate the appropriateness of crowdsourcing for this task, we conduct intensive comparative crowdsourcing and laboratory experiments, evaluating nine extrinsic and intrinsic quality measures on 5-point MOS scales. Correlating results of crowd and laboratory ratings reveals high applicability of crowdsourcing for the factors overall quality, grammaticality, non-redundancy, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness. Further, we investigate the effect of the number of repetitions of assessments on the robustness of mean opinion score of crowd ratings, measured against the increase of correlation coefficients between crowd and laboratory. Our results suggest that the optimal number of repetitions in crowdsourcing setups, in which any additional repetitions do no longer cause an adequate increase of overall correlation coefficients, lies between seven and nine for intrinsic and extrinsic quality factors.


A Seed Corpus of Hindu Temples in India

Priya Radhakrishnan

Temples are an integral part of culture and heritage of India and are centers of religious practice for practicing Hindus. A scientific study of temples can reveal valuable insights into Indian culture and heritage. However to the best of our knowledge, learning resources that aid such a study are either not publicly available or non-existent. In this endeavour we present our initial efforts to create a corpus of Hindu temples in India. In this paper, we present a simple, re-usable platform that creates temple corpus from web text on temples. Curation is improved using classifiers trained on textual data in Wikipedia articles on Hindu temples. The training data is verified by human volunteers. The temple corpus consists of 4933 high accuracy facts about 573 temples. We make the corpus and the platform freely available. We also test the re-usability of the platform by creating a corpus of museums in India. We believe the temple corpus will aid scientific study of temples and the platform will aid in construction of similar corpuses. We believe both these will significantly contribute in promoting research on culture and heritage of a region.


Do You Believe It Happened? Assessing Chinese Readers’ Veridicality Judgments

Yu-Yun Chang and Shu-Kai Hsieh

This work collects and studies Chinese readers’ veridicality judgments to news events (whether an event is viewed as happening or not). For instance, in “The FBI alleged in court documents that Zazi had admitted having a handwritten recipe for explosives on his computer”, do people believe that Zazi had a handwritten recipe for explosives? The goal is to observe the pragmatic behaviors of linguistic features under context which affects readers in making veridicality judgments. Exploring from the datasets, it is found that features such as event-selecting predicates (ESP), modality markers, adverbs, temporal information, and statistics have an impact on readers’ veridicality judgments. We further investigated that modality markers with high certainty do not necessarily trigger readers to have high confidence in believing an event happened. Additionally, the source of information introduced by an ESP presents low effects to veridicality judgments, even when an event is attributed to an authority (e.g. “The FBI”). A corpus annotated with Chinese readers’ veridicality judgments is released as the Chinese PragBank for further analysis.


Creating Expert Knowledge by Relying on Language Learners: a Generic Approach for Mass-Producing Language Resources by Combining Implicit Crowdsourcing and Language Learning

Lionel Nicolas, Verena Lyding, Claudia Borg, Corina Forascu, Karën Fort, Katerina Zdravkova, Iztok Kosem, Jaka Čibej, Špela Arhar Holdt, Alice Millour, Alexander König, Christos Rodosthenous, Federico Sangati, Umair ul Hassan, Anisia Katinskaia, Anabela Barreiro, Lavinia Aparaschivei and Yaakov HaCohen-Kerner

We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of this generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.


MAGPIE: A Large Corpus of Potentially Idiomatic Expressions

Hessel Haagsma, Johan Bos and Malvina Nissim

Given the limited size of existing idiom corpora, we aim to enable progress in automatic idiom processing and linguistic analysis by creating the largest-to-date corpus of idioms for English. Using a fixed idiom list, automatic pre-extraction, and a strictly controlled crowdsourced annotation procedure, we show that it is feasible to build a high-quality corpus comprising more than 50K instances, an order of a magnitude larger than previous resources. Crucial ingredients of crowdsourcing were the selection of crowdworkers, clear and comprehensive instructions, and an interface that breaks down the task in small, manageable steps. Analysis of the resulting corpus revealed strong effects of genre on idiom distribution, providing new evidence for existing theories on what influences idiom usage. The corpus also contains rich metadata, and is made publicly available.


CRWIZ: A Framework for Crowdsourcing Real-Time Wizard-of-Oz Dialogues

Francisco Javier Chiyah Garcia, José Lopes, Xingkun Liu and Helen Hastie

Large corpora of task-based and open-domain conversational dialogues are hugely valuable in the field of data-driven dialogue systems. Crowdsourcing platforms, such as Amazon Mechanical Turk, have been an effective method for collecting such large amounts of data. However, difficulties arise when task-based dialogues require expert domain knowledge or rapid access to domain-relevant information, such as databases for tourism. This will become even more prevalent as dialogue systems become increasingly ambitious, expanding into tasks with high levels of complexity that require collaboration and forward planning, such as in our domain of emergency response. In this paper, we propose CRWIZ: a framework for collecting real-time Wizard of Oz dialogues through crowdsourcing for collaborative, complex tasks. This framework uses semi-guided dialogue to avoid interactions that breach procedures and processes only known to experts, while enabling the capture of a wide variety of interactions.


Effort Estimation in Named Entity Tagging Tasks

Inês Gomes, Rui Correia, Jorge Ribeiro and João Freitas

Named Entity Recognition (NER) is an essential component of many Natural Language Processing pipelines. However, building these language dependent models requires large amounts of annotated data. Crowdsourcing emerged as a scalable solution to collect and enrich data in a more time-efficient manner. To manage these annotations at scale, it is important to predict completion timelines and compute fair pricing for workers in advance. To achieve these goals, we need to know how much effort will be taken to complete each task. In this paper, we investigate which variables influence the time spent on a named entity annotation task by a human. Our results are two-fold: first, the understanding of the effort-impacting factors which we divided into cognitive load and input length; and second, the performance of the prediction itself. On the latter, through model adaptation and feature engineering, we attained a Root Mean Squared Error (RMSE) of 25.68 words per minute with a Nearest Neighbors model.


Using Crowdsourced Exercises for Vocabulary Training to Expand ConceptNet

Christos Rodosthenous, Verena Lyding, Federico Sangati, Alexander König, Umair ul Hassan, Lionel Nicolas, Jolita Horbacauskiene, Anisia Katinskaia and Lavinia Aparaschivei

In this work, we report on a crowdsourcing experiment conducted using the V-TREL vocabulary trainer which is accessed via a Telegram chatbot interface to gather knowledge on word relations suitable for expanding ConceptNet. V-TREL is built on top of a generic architecture implementing the implicit crowdsourding paradigm in order to offer vocabulary training exercises generated from the commonsense knowledge-base ConceptNet and — in the background — to collect and evaluate the learners’ answers to extend ConceptNet with new words. In the experiment about 90 university students learning English at C1 level, based on Common European Framework of Reference for Languages (CEFR), trained their vocabulary with V-TREL over a period of 16 calendar days. The experiment allowed to gather more than 12,000 answers from learners on different question types. In this paper we present in detail the experimental setup and the outcome of the experiment, which indicates the potential of our approach for both crowdsourcing data as well as fostering vocabulary skills.


Session Computer-Assisted Language Learning (CALL)

Back to Top

Predicting Multidimensional Subjective Ratings of Children’ Readings from the Speech Signals for the Automatic Assessment of Fluency

Gérard Bailly, Erika Godde, Anne-Laure Piat-Marchand and Marie-Line Bosse

The objective of this research is to estimate multidimensional subjective ratings of the reading performance of young readers from signal-based objective measures. We here combine linguistic features (number of correct words, repetitions, deletions, insertions uttered per minute . . . ) with phonetic features. Expressivity is particularly difficult to predict since there is no unique golden standard. We here propose a novel framework for performing such an estimation that exploits multiple references performed by adults and demonstrate its efficiency using recordings of 273 pupils.


Constructing Multimodal Language Learner Texts Using LARA: Experiences with Nine Languages

Elham Akhlaghi, Branislav Bédi, Fatih Bektaş, Harald Berthelsen, Matthias Butterweck, Cathy Chua, Catia Cucchiarin, Gülşen Eryiğit, Johanna Gerlach, Hanieh Habibi, Neasa Ní Chiaráin, Manny Rayner, Steinþór Steingrímsson and Helmer Strik

LARA (Learning and Reading Assistant) is an open source platform whose purpose is to support easy conversion of plain texts into multimodal online versions suitable for use by language learners. This involves semi-automatically tagging the text, adding other annotations and recording audio. The platform is suitable for creating texts in multiple languages via crowdsourcing techniques that can be used for teaching a language via reading and listening. We present results of initial experiments by various collaborators where we measure the time required to produce substantial LARA resources, up to the length of short novels, in Dutch, English, Farsi, French, German, Icelandic, Irish, Swedish and Turkish. The first results are encouraging. Although there are some startup problems, the conversion task seems manageable for the languages tested so far. The resulting enriched texts are posted online and are freely available in both source and compiled form.


A Dataset for Investigating the Impact of Feedback on Student Revision Outcome

Ildiko Pilan, John Lee, Chak Yan Yeung and Jonathan Webster

We present an annotation scheme and a dataset of teacher feedback provided for texts written by non-native speakers of English. The dataset consists of student-written sentences in their original and revised versions with teacher feedback provided for the errors. Feedback appears both in the form of open-ended comments and error category tags. We focus on a specific error type, namely linking adverbial (e.g. however, moreover) errors. The dataset has been annotated for two aspects: (i) revision outcome establishing whether the re-written student sentence was correct and (ii) directness, indicating whether teachers provided explicitly the correction in their feedback. This dataset allows for studies around the characteristics of teacher feedback and how these influence students’ revision outcome. We describe the data preparation process and we present initial statistical investigations regarding the effect of different feedback characteristics on revision outcome. These show that open-ended comments and mitigating expressions appear in a higher proportion of successful revisions than unsuccessful ones, while directness and metalinguistic terms have no effect. Given that the use of this type of data is relatively unexplored in natural language processing (NLP) applications, we also report some observations and challenges when working with feedback data.


Creating Corpora for Research in Feedback Comment Generation

Ryo Nagata, Kentaro Inui and Shin’ichiro Ishikawa

In this paper, we report on datasets that we created for research in feedback comment generation — a task of automatically generating feedback comments such as a hint or an explanatory note for writing learning. There has been almost no such corpus open to the public and accordingly there has been a very limited amount of work on this task. In this paper, we first discuss the principle and guidelines for feedback comment annotation. Then, we describe two corpora that we have manually annotated with feedback comments (approximately 50,000 general comments and 6,700 on preposition use). A part of the annotation results is now available on the web, which will facilitate research in feedback comment generation


Using Multilingual Resources to Evaluate CEFRLex for Learner Applications

Johannes Graën, David Alfter and Gerold Schneider

The Common European Framework of Reference for Languages (CEFR) defines six levels of learner proficiency,  and links them to particular communicative abilities.  The CEFRLex project aims at compiling lexical resources that link single words and multi-word expressions to particular CEFR levels.  The resources are thought to reflect second language learner needs as they are compiled from CEFR-graded textbooks and other learner-directed texts. In this work, we investigate the applicability of CEFRLex resources for building language learning applications. Our main concerns were that vocabulary in language learning materials might be sparse, i.e. that not all vocabulary items that belong to a particular level would also occur in materials for that level, and, on the other hand, that vocabulary items might be used on lower-level materials if required by the topic (e.g. with a simpler paraphrasing or translation). Our results indicate that the English CEFRLex resource is in accordance with external resources that we jointly employ as gold standard. Together with other values obtained from monolingual and parallel corpora, we can indicate which entries need to be adjusted to obtain values that are even more in line with this gold standard. We expect that this finding also holds for the other languages


Immersive Language Exploration with Object Recognition and Augmented Reality

Benny Platte, Anett Platte, Christian Roschke, Rico Thomanek, Thony Rolletschke, Frank Zimmer and Marc Ritter

The use of Augmented Reality (AR) in teaching and learning contexts for language is still young. The ideas are endless, the concrete educational offers available emerge only gradually. Educational opportunities that were unthinkable a few years ago are now feasible. We present a concrete realization: an executable application for mobile devices with which users can explore their environment interactively in different languages. The software recognizes up to 1000 objects in the user’s environment using a deep learning method based on Convolutional Neural Networks and names this objects accordingly. Using Augmented Reality the objects are superimposed with 3D information in different languages. By switching the languages, the user is able to interactively discover his surrounding everyday items in all languages. The application is available as Open Source.


A Process-oriented Dataset of Revisions during Writing

Rianne Conijn, Emily Dux Speltz, Menno van Zaanen, Luuk Van Waes and Evgeny Chukharev-Hudilainen

Revision plays a major role in writing and the analysis of writing processes. Revisions can be analyzed using a product-oriented approach (focusing on a finished product, the text that has been produced) or a process-oriented approach (focusing on the process that the writer followed to generate this product). Although several language resources exist for the product-oriented approach to revisions, there are hardly any resources available yet for an in-depth analysis of the process of revisions. Therefore, we provide an extensive dataset on revisions made during writing (accessible via This dataset is based on keystroke data and eye tracking data of 65 students from a variety of backgrounds (undergraduate and graduate English as a first language and English as a second language students) and a variety of tasks (argumentative text and academic abstract). In total, 7,120 revisions were identified in the dataset. For each revision, 18 features have been manually annotated and 31 features have been automatically extracted. As a case study, we show two potential use cases of the dataset. In addition, future uses of the dataset are described.


Automated Writing Support Using Deep Linguistic Parsers

Luís Morgado da Costa, Roger V P Winder, Shu Yun Li, Benedict Christopher Lin Tzer Liang, Joseph Mackinnon and Francis Bond

This paper introduces a new web system that integrates English Grammatical Error Detection (GED) and course-specific stylistic guidelines to automatically review and provide feedback on student assignments. The system is being developed as a pedagogical tool for  English Scientific Writing.  It uses both general NLP methods and high precision parsers to check student assignments before they are submitted for grading. Instead of generalized error detection, our system aims to identify, with high precision, specific classes of problems that are known to be common among engineering students. Rather than correct the errors, our system generates constructive feedback to help students identify and correct them on their own.  A preliminary evaluation of the system’s in-class performance has shown measurable improvements in the quality of student assignments.


TLT-school: a Corpus of Non Native Children Speech

Roberto Gretter, Marco Matassoni, Stefano Bannò and Falavigna Daniele

This paper describes “TLT-school” a corpus of speech utterances collected in schools of northern Italy for assessing the performance of students learning both English and German. The corpus was recorded in the years 2017 and 2018 from students aged between nine and sixteen years, attending primary, middle and high school. All utterances have been scored, in terms of some predefined proficiency indicators, by human experts. In addition, most of utterances recorded in 2017 have been manually transcribed carefully. Guidelines and procedures used for manual transcriptions of utterances will be described in detail, as well as results achieved by means of an automatic speech recognition system developed by us. Part of the corpus is going to be freely distributed to scientific community particularly interested both in non-native speech recognition and automatic assessment of second language proficiency.


Toward a Paradigm Shift in Collection of Learner Corpora

Anisia Katinskaia, Sardana Ivanova and Roman Yangarber

We present the first version of the longitudinal Revita Learner Corpus (ReLCo), for Russian. In contrast to traditional learner corpora, ReLCo is collected and annotated fully automatically, while students perform exercises using the Revita language-learning platform. The corpus currently contains 8 422 sentences exhibiting several types of errors—grammatical, lexical, orthographic, etc.—which were committed by learners during practice and were automatically annotated by Revita. The corpus provides valuable information about patterns of learner errors and can be used as a language resource for a number of research tasks, while its creation is much cheaper and faster than for traditional learner corpora. A crucial advantage of ReLCo that it grows continually while learners practice with Revita, which opens the possibility of creating an unlimited learner resource with longitudinal data collected over time. We make the pilot version of the Russian ReLCo publicly available.


Quality Focused Approach to a Learner Corpus Development

Roberts Darģis, Ilze Auziņa, Kristīne Levāne-Petrova and Inga Kaija

The paper presents quality focused approach to a learner corpus development. The methodology was developed with multiple design considerations put in place to make the annotation process easier and at the same time reduce the amount of mistakes that could be introduced due to inconsistent text correction or carelessness. The approach suggested in this paper consists of multiple parts: comparison of digitized texts by several annotators, text correction, automated morphological analysis, and manual review of annotations. The described approach is used to create Latvian Language Learner corpus (LaVA) which is part of a currently ongoing project Development of Learner corpus of Latvian: methods, tools and applications.


An Exploratory Study into Automated Précis Grading

Orphee De Clercq and Senne Van Hoecke

Automated writing evaluation is a popular research field, but the main focus has been on evaluating argumentative essays. In this paper, we consider a different genre, namely précis texts. A précis is a written text that provides a coherent summary of main points of a spoken or written text. We present a corpus of English précis texts which all received a grade assigned by a highly-experienced English language teacher and were subsequently annotated following an exhaustive error typology. With this corpus we trained a machine learning model which relies on a number of linguistic, automatic summarization and AWE features. Our results reveal that this model is able to predict the grade of précis texts with only a moderate error margin.


Session Conversational Systems/Dialogue/Chatbots/Human-Robot Interaction

Back to Top

Adjusting Image Attributes of Localized Regions with Low-level Dialogue

Tzu-Hsiang Lin, Alexander Rudnicky, Trung Bui, Doo Soon Kim and Jean Oh

Natural Language Image Editing (NLIE) aims to use natural language instructions to edit images.  Since novices are inexperienced with image editing techniques, their instructions are often ambiguous and contain high-level abstractions which require complex editing steps. Motivated by this inexperience aspect, we aim to smooth the learning curve by teaching the novices to edit images using low-level command terminologies. Towards this end, we develop a task-oriented dialogue system to investigate low-level instructions for NLIE.  Our system grounds language on the level of edit operations, and suggests options for users to choose from. Though compelled to express in low-level terms, user evaluation shows that 25\% of users found our system easy-to-use, resonating with our motivation. Analysis shows that users generally adapt to utilizing the proposed low-level language interface. We also identified object segmentation as the key factor to user satisfaction. Our work demonstrates advantages of low-level, direct language-action mapping approach that can be applied to other problem domains beyond image editing such as audio editing or industrial design.


Alignment Annotation for Clinic Visit Dialogue to Clinical Note Sentence Language Generation

Wen-wai Yim, Meliha Yetisgen, Jenny Huang and Micah Grossman

For every patient’s visit to a clinician, a clinical note is generated documenting their medical conversation, including complaints discussed, treatments, and medical plans. Despite advances in natural language processing, automating clinical note generation from a clinic visit conversation is a largely unexplored area of research. Due to the idiosyncrasies of the task, traditional methods of corpus creation are not effective enough approaches for this problem. In this paper, we present an annotation methodology that is content- and technique- agnostic while associating note sentences to sets of dialogue sentences. The sets can further be grouped with higher order tags to mark sets with related information. This direct linkage from input to output decouples the annotation from specific language understanding or generation strategies. Here we provide data statistics and qualitative analysis describing the unique annotation challenges. Given enough annotated data, such a resource would support multiple modeling methods including information extraction with template language generation, information retrieval type language generation, or sequence to sequence modeling.


MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines

Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyang Gao, Adarsh Kumar, Anuj Goyal, Peter Ku and Dilek Hakkani-Tur

MultiWOZ 2.0 (Budzianowski et al., 2018) is a recently released multi-domain dialogue dataset spanning 7 distinct domains and containing over 10,000 dialogues. Though immensely useful and one of the largest resources of its kind to-date, MultiWOZ 2.0 has a few shortcomings. Firstly, there are substantial noise in the dialogue state annotations and dialogue utterances which negatively impact the performance of state-tracking models. Secondly, follow-up work (Lee et al., 2019) has augmented the original dataset with user dialogue acts. This leads to multiple co-existent versions of the same dataset with minor modifications. In this work we tackle the aforementioned issues by introducing MultiWOZ 2.1. To fix the noisy state annotations, we use crowdsourced workers to re-annotate state and utterances based on the original utterances in the dataset. This correction process results in changes to over 32% of state annotations across 40% of the dialogue turns. In addition, we fix 146 dialogue utterances by canonicalizing slot values in the utterances to the values in the dataset ontology. To address the second problem, we combined the contributions of the follow-up works into MultiWOZ 2.1. Hence, our dataset also includes user dialogue acts as well as multiple slot descriptions per dialogue state slot. We then benchmark a number of state-of-the-art dialogue state tracking models on the MultiWOZ 2.1 dataset and show the joint state tracking performance on the corrected state annotations. We are publicly releasing MultiWOZ 2.1 to the community, hoping that this dataset resource will allow for more effective models across various dialogue subproblems to be built in the future.


A Comparison of Explicit and Implicit Proactive Dialogue Strategies for Conversational Recommendation

Matthias Kraus, Fabian Fischbach, Pascal Jansen and Wolfgang Minker

Recommendation systems aim at facilitating information retrieval for users by taking into account their preferences. Based on previous user behaviour, such a system suggests items or provides information that a user might like or find useful. Nonetheless, how to provide suggestions is still an open question. Depending on the way a recommendation is communicated influences the user’s perception of the system. This paper presents an empirical study on the effects of proactive dialogue strategies on user acceptance. Therefore, an explicit strategy based on user preferences provided directly by the user, and an implicit proactive strategy, using autonomously gathered information, are compared. The results show that proactive dialogue systems significantly affect the perception of human-computer interaction. Although no significant differences are found between implicit and explicit strategies, proactivity significantly influences the user experience compared to reactive system behaviour. The study contributes new insights to the human-agent interaction and the voice user interface design. Furthermore, we discover interesting tendencies that motivate futurework.


Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque

Arantxa Otegi, Aitor Agirre, Jon Ander Campos, Aitor Soroa and Eneko Agirre

Conversational Question Answering (CQA) systems meet user information needs by having conversations with them, where answers to the questions are retrieved from text. There exist a variety of datasets for English, with tens of thousands of training examples, and pre-trained language models have allowed to obtain impressive results. The goal of our research is to test the performance of CQA systems under low-resource conditions which are common for most non-English languages: small amounts of native annotations and other limitations linked to low resource languages, like lack of crowdworkers or smaller wikipedias. We focus on the Basque language, and present the first non-English CQA dataset and results. Our experiments show that it is possible to obtain good  results with low amounts of native data thanks to cross-lingual transfer, with quality comparable to those obtained for English. We also discovered that dialogue history models are not directly transferable to another language, calling for further research. The dataset is publicly available.


Construction and Analysis of a Multimodal Chat-talk Corpus for Dialog Systems Considering Interpersonal Closeness

Yoshihiro Yamazaki, Yuya Chiba, Takashi Nose and Akinori Ito

There are high expectations for multimodal dialog systems that can make natural small talk with facial expressions, gestures, and gaze actions as next-generation dialog-based systems. Two important roles of the chat-talk system are keeping the user engaged and establishing rapport. Many studies have conducted user evaluations of such systems, some of which reported that considering the relationship with the user is an effective way to improve the subjective evaluation. To facilitate research of such dialog systems, we are currently constructing a large-scale multimodal dialog corpus focusing on the relationship between speakers. In this paper, we describe the data collection and annotation process, and analysis of the corpus collected in the early stage of the project. This corpus contains 19,303 utterances (10 hours) from 19 pairs of participants. A dialog act tag is annotated to each utterance by two annotators. We compare the frequency and the transition probability of the tags between different closeness levels to help construct a dialog system for establishing a relationship with the user.


BLISS: An Agent for Collecting Spoken Dialogue Data about Health and Well-being

Jelte van Waterschoot, Iris Hendrickx, Arif Khan, Esther Klabbers, Marcel de Korte, Helmer Strik, Catia Cucchiarini and Mariët Theune

An important objective in health-technology is the ability to gather information about people’s well-being. Structured interviews can be used to obtain this information, but are time-consuming and not scalable. Questionnaires provide an alternative way to extract such information, though typically lack depth. In this paper, we present our first prototype of the BLISS agent, an artificial intelligent agent which intends to automatically discover what makes people happy and healthy. The goal of Behaviour-based Language-Interactive Speaking Systems (BLISS) is to understand the motivations behind people’s happiness by conducting a personalized spoken dialogue based on a happiness model. We built our first prototype of the model to collect 55 spoken dialogues, in which the BLISS agent asked questions to users about their happiness and well-being. Apart from a description of the BLISS architecture, we also provide details about our dataset, which contains over 120 activities and 100 motivations and is made available for usage.


The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

Meng Chen, Ruixue Liu, Lei Shen, Shaozu Yuan, Jingyan Zhou, Youzheng Wu, Xiaodong He and Bowen Zhou

Human conversations are complicated and building a human-like dialogue agent is an extremely challenging task. With the rapid development of deep learning techniques, data-driven models become more and more prevalent which need a huge amount of real conversation data. In this paper, we construct a large-scale real scenario Chinese E-commerce conversation corpus, JDDC, with more than 1 million multi-turn dialogues, 20 million utterances, and 150 million words. The dataset reflects several characteristics of human-human conversations, e.g., goal-driven, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and question-answering. Extra intent information and three well-annotated challenge sets are also provided. Then, we evaluate several retrieval-based and generative models to provide basic benchmark performance on the JDDC corpus. And we hope JDDC can serve as an effective testbed and benefit the development of fundamental research in dialogue task.


“Cheese!”: a Corpus of Face-to-face French Interactions. A Case Study for Analyzing Smiling and Conversational Humor

Béatrice Priego-Valverde, Brigitte Bigi and Mary Amoyal

Cheese! is a conversational corpus. It consists of 11 French face-to-face conversations lasting around 15 minutes each. Cheese! is a duplication of an American corpus (ref) in order to conduct a cross-cultural comparison of participants’ smiling behavior in humorous and non-humorous sequences in American English and French conversations. In this article, the methodology used to collect and enrich the corpus is presented: experimental protocol, technical choices, transcription, semi-automatic annotations, manual annotations of smiling and humor. An exploratory study investigating the links between smile and humor is then proposed. Based on the analysis of two interactions, two questions are asked: (1) Does smile frame humor? (2) Does smile has an impact on its success or failure?

If the experimental design of Cheese! has been elaborated to study specifically smiles and humor in conversations, the high quality of the dataset obtained, and the methodology used are also replicable and can be applied to analyze many other conversational activities and other multimodal modalities.


The Margarita Dialogue Corpus: A Data Set for Time-Offset Interactions and Unstructured Dialogue Systems

Alberto Chierici, Nizar Habash and Margarita Bicec

Time-Offset Interaction Applications (TOIAs) are systems that simulate face-to-face conversations between humans and digital human avatars recorded in the past. Developing a well-functioning TOIA involves several research areas: artificial intelligence, human-computer interaction, natural language processing, question answering, and dialogue systems. The first challenges are to define a sensible methodology for data collection and to create useful data sets for training the system to retrieve the best answer to a user’s question. In this paper, we present three main contributions: a methodology for creating the knowledge base for a TOIA, a dialogue corpus, and baselines for single-turn answer retrieval. We develop the methodology using a two-step strategy. First, we let the avatar maker list pairs by intuition, guessing what possible questions a user may ask to the avatar. Second, we record actual dialogues between random individuals and the avatar-maker. We make the Margarita Dialogue Corpus available to the research community. This corpus comprises the knowledge base in text format, the video clips for each answer, and the annotated dialogues.


How Users React to Proactive Voice Assistant Behavior While Driving

Maria Schmidt, Wolfgang Minker and Steffen Werner

Nowadays Personal Assistants (PAs) are available in multiple environments and become increasingly popular to use via voice. Therefore, we aim to provide proactive PA suggestions to car drivers via speech. These suggestions should be neither obtrusive nor increase the drivers’ cognitive load, while enhancing user experience. To assess these factors, we conducted a usability study in which 42 participants perceive proactive voice output in a Wizard-of-Oz study in a driving simulator. Traffic density was varied during a highway drive and it included six in-car-specific use cases. The latter were presented by a proactive voice assistant and in a non-proactive control condition. We assessed the users’ subjective cognitive load and their satisfaction in different questionnaires during the interaction with both PA variants. Furthermore, we analyze the user reactions: both regarding their content and the elapsed response times to PA actions. The results show that proactive assistant behavior is rated similarly positive as non-proactive behavior. Furthermore, the participants agreed to 73.8% of proactive suggestions. In line with previous research, driving-relevant use cases receive the best ratings, here we reach 82.5% acceptance. Finally, the users reacted significantly faster to proactive PA actions, which we interpret as less cognitive load compared to non-proactive behavior.


Emotional Speech Corpus for Persuasive Dialogue System

Sara Asai, Koichiro Yoshino, Seitaro Shinagawa, Sakriani Sakti and Satoshi Nakamura

Expressing emotion is known as an efficient way to persuade one’s dialogue partner to accept one’s claim or proposal. Emotional expression in speech can express the speaker’s emotion more directly than using only emotion expression in the text, which will lead to a more persuasive dialogue. In this paper, we built a speech dialogue corpus in a persuasive scenario that uses emotional expressions to build a persuasive dialogue system with emotional expressions. We extended an existing text dialogue corpus by adding variations of emotional responses to cover different combinations of broad dialogue context and a variety of emotional states by crowd-sourcing. Then, we recorded emotional speech consisting of of collected emotional expressions spoken by a voice actor. The experimental results indicate that the collected emotional expressions with their speeches have higher emotional expressiveness for expressing the system’s emotion to users.


Multimodal Analysis of Cohesion in Multi-party Interactions

Reshmashree Bangalore Kantharaju, Caroline Langlet, Mukesh Barange, Chloé Clavel and Catherine Pelachaud

Group cohesion is an emergent phenomenon that describes the tendency of the group members’ shared commitment to group tasks and the interpersonal attraction among them. This paper presents a multimodal analysis of group cohesion using a corpus of multi-party interactions. We utilize 16 two-minute segments annotated with cohesion from the AMI corpus. We define three layers of modalities: non-verbal social cues, dialogue acts and interruptions. The initial analysis is performed at the individual level and later, we combine the different modalities to observe their impact on perceived level of cohesion. Results indicate that occurrence of laughter and interruption are higher in high cohesive segments. We also observe that, dialogue acts and head nods did not have an impact on the level of cohesion by itself. However, when combined there was an impact on the perceived level of cohesion. Overall, the analysis shows that multimodal cues are crucial for accurate analysis of group cohesion.


Treating Dialogue Quality Evaluation as an Anomaly Detection Problem

Rostislav Nedelchev, Ricardo Usbeck and Jens Lehmann

Dialogue systems for interaction with humans have been enjoying increased popularity in the research and industry fields. To this day, the best way to estimate their success is through means of human evaluation and not automated approaches, despite the abundance of work done in the field. In this paper, we investigate the effectiveness of perceiving dialogue evaluation as an anomaly detection task. The paper looks into four dialogue modeling approaches and how their objective functions correlate with human annotation scores. A high-level perspective exhibits negative results. However, a more in-depth look shows some potential for using anomaly detection for evaluating dialogues.


Evaluation of Argument Search Approaches in the Context of Argumentative Dialogue Systems

Niklas Rach, Yuki Matsuda, Johannes Daxenberger, Stefan Ultes, Keiichi Yasumoto and Wolfgang Minker

We present an approach to evaluate argument search techniques in view of their use in argumentative dialogue systems by assessing quality aspects of the retrieved arguments. To this end, we introduce a dialogue system that presents arguments by means of a virtual avatar and synthetic speech to users and allows them to rate the presented content in four different categories (Interesting, Convincing, Comprehensible, Relation). The approach is applied in a user study in order to compare two state of the art argument search engines to each other and with a system based on traditional web search. The results show a significant advantage of the two search engines over the baseline. Moreover, the two search engines show significant advantages over each other in different categories, thereby reflecting strengths and weaknesses of the different underlying techniques.


PATE: A Corpus of Temporal Expressions for the In-car Voice Assistant Domain

Alessandra Zarcone, Touhidul Alam and Zahra Kolagar

The recognition and automatic annotation of temporal expressions (e.g. “Add an event for tomorrow evening at eight to my calendar”) is a key module for AI voice assistants, in order to allow them to interact with apps (for example, a calendar app). However, in the NLP literature, research on temporal expressions has focused mostly on data from the news, from the clinical domain, and from social media. The voice assistant domain is very different than the typical domains that have been the focus of work on temporal expression identification, thus requiring a dedicated data collection. We present a crowdsourcing method for eliciting natural-language commands containing temporal expressions for an AI voice assistant, by using pictures and scenario descriptions. We annotated the elicited commands (480) as well as the commands in the Snips dataset following the TimeML/TIMEX3 annotation guidelines, reaching a total of 1188 annotated commands. The commands can be later used to train the NLU components of an AI voice assistant.


Mapping the Dialog Act Annotations of the LEGO Corpus into ISO 24617-2 Communicative Functions

Eugénio Ribeiro, Ricardo Ribeiro and David Martins de Matos

ISO 24617-2, the ISO standard for dialog act annotation, sets the ground for more comparable research in the area. However, the amount of data annotated according to it is still reduced, which impairs the development of approaches for automatic recognition. In this paper, we describe a mapping of the original dialog act labels of the LEGO corpus, which have been neglected, into the communicative functions of the standard. Although this does not lead to a complete annotation according to the standard, the 347 dialogs provide a relevant amount of data that can be used in the development of automatic communicative function recognition approaches, which may lead to a wider adoption of the standard. Using the 17 English dialogs of the DialogBank as gold standard, our preliminary experiments have shown that including the mapped dialogs during the training phase leads to improved performance while recognizing communicative functions in the Task dimension.


Estimating User Communication Styles for Spoken Dialogue Systems

Juliana Miehle, Isabel Feustel, Julia Hornauer, Wolfgang Minker and Stefan Ultes

We present a neural network approach to estimate the communication style of spoken interaction, namely the stylistic variations elaborateness and directness, and investigate which type of input features to the estimator are necessary to achive good performance. First, we describe our annotated corpus of recordings in the health care domain and analyse the corpus statistics in terms of agreement, correlation and reliability of the ratings. We use this corpus to estimate the elaborateness and the directness of each utterance. We test different feature sets consisting of dialogue act features, grammatical features and linguistic features as input for our classifier and perform classification in two and three classes. Our classifiers use only features that can be automatically derived during an ongoing interaction in any spoken dialogue system without any prior annotation. Our results show that the elaborateness can be classified by only using the dialogue act and the amount of words contained in the corresponding utterance. The directness is a more difficult classification task and additional linguistic features in form of word embeddings improve the classification results. Afterwards, we run a comparison with a support vector machine and a recurrent neural network classifier.


The ISO Standard for Dialogue Act Annotation, Second Edition

Harry Bunt, Volha Petukhova, Emer Gilmartin, Catherine Pelachaud, Alex Fang, Simon Keizer and Laurent Prévot

ISO standard 24617-2 for dialogue act annotation, established in 2012, has in the past few years been used both in corpus annotation and in the design of components for spoken and multimodal dialogue systems. This has brought some inaccuracies and undesirbale limitations of the standard to light, which are addressed in a proposed second edition. This second edition allows a more accurate annotation of dependence relations and rhetorical relations in dialogue. Following the ISO 24617-4 principles of semantic annotation, and borrowing ideas from EmotionML, a triple-layered plug-in mechanism is introduced which allows dialogue act descriptions to be enriched with information about their semantic content, about accompanying emotions, and other information, and allows the annotation scheme to be customised by adding application-specific dialogue act types.


The AICO Multimodal Corpus – Data Collection and Preliminary Analyses

Kristiina Jokinen

This paper describes data collection and the first explorative research on the AICO Multimodal Corpus. The corpus contains eye-gaze, Kinect, and video recordings of human-robot and human-human interactions, and was collected to study cooperation, engagement and attention of human participants in task-based as well as in chatty type interactive situations. In particular, the goal was to enable comparison between human-human and human-robot interactions, besides studying multimodal behaviour and attention in the different dialogue activities. The robot partner was a humanoid Nao robot, and it was expected that its agent-like behaviour would render humanrobot interactions similar to human-human interaction but also high-light important differences due to the robot’s limited conversational capabilities. The paper reports on the preliminary studies on the corpus, concerning the participants’ eye-gaze and gesturing behaviours,which were chosen as objective measures to study differences in their multimodal behaviour patterns with a human and a robot partner.


A Corpus of Controlled Opinionated and Knowledgeable Movie Discussions for Training Neural Conversation Models

Fabian Galetzka, Chukwuemeka Uchenna Eneh and David Schlangen

Fully data driven Chatbots for non-goal oriented dialogues are known to suffer from inconsistent behaviour across their turns, stemming from a general difficulty in controlling parameters like their assumed background personality and knowledge of facts. One reason for this is the relative lack of labeled data from which personality consistency and fact usage could be learned together with dialogue behaviour. To address this, we introduce a new labeled dialogue dataset in the domain of movie discussions, where every dialogue is based on pre-specified facts and opinions. We thoroughly validate the collected dialogue for adherence of the participants to their given fact and opinion profile, and find that the general quality in this respect is high. This process also gives us an additional layer of annotation that is potentially useful for training models. We introduce as a baseline an end-to-end trained self-attention decoder model trained on this data and show that it is able to generate opinionated responses that are judged to be natural and knowledgeable and show attentiveness.


A French Medical Conversations Corpus Annotated for a Virtual Patient Dialogue System

Fréjus A. A. Laleye, Gaël de Chalendar, Antonia Blanié, Antoine Brouquet and Dan Behnamou

Data-driven  approaches  for  creating  virtual  patient  dialogue  systems  require  the  availability  of  large  data  specific to  the  language,domain and clinical cases studied.  Based on the lack of dialogue corpora in French for medical education, we propose an annotatedcorpus of dialogues including medical consultation interactions between doctor and patient. In this work, we detail the building processof the proposed dialogue corpus, describe the annotation guidelines and also present the statistics of its contents.  We then conducted aquestion categorization task to evaluate the benefits of the proposed corpus that is made publicly available.


Getting To Know You: User Attribute Extraction from Dialogues

Chien-Sheng Wu, Andrea Madotto, Zhaojiang Lin, Peng Xu and Pascale Fung

User attributes provide rich and useful information for user understanding, yet structured and easy-to-use attributes are often sparsely populated. In this paper, we leverage dialogues with conversational agents, which contain strong suggestions of user information, to automatically extract user attributes. Since no existing dataset is available for this purpose, we apply distant supervision to train our proposed two-stage attribute extractor, which surpasses several retrieval and generation baselines on human evaluation. Meanwhile, we discuss potential applications (e.g., personalized recommendation and dialogue systems) of such extracted user attributes, and point out current limitations to cast light on future work.


Augmenting Small Data to Classify Contextualized Dialogue Acts for Exploratory Visualization

Abhinav Kumar, Barbara Di Eugenio, Jillian Aurisano and Andrew Johnson

Our goal is to develop an intelligent assistant to support users explore data via visualizations. We have collected a new  corpus  of conversations, CHICAGO-CRIME-VIS, geared towards supporting data visualization exploration, and  we have annotated it for a variety of features, including contextualized dialogue acts. In this paper, we  describe  our   strategies and their evaluation for dialogue act classification. We highlight how thinking aloud affects interpretation of dialogue acts in our setting  and how to best capture that information. A key component of our strategy is data augmentation as applied  to the training data, since our corpus is inherently small. We ran experiments with the Balanced Bagging Classifier (BAGC), Condiontal Random Field (CRF), and several Long Short Term Memory (LSTM) networks, and found that all of them improved compared to the baseline (e.g., without the data augmentation pipeline). CRF outperformed the other classification algorithms, with the LSTM networks showing modest improvement, even after obtaining a performance boost from domain-trained word embeddings. This result is of note because  training a CRF is far less resource-intensive than training deep learning models, hence given  a similar if not better performance, traditional methods may still be preferable in order to lower resource consumption.


RDG-Map: A Multimodal Corpus of Pedagogical Human-Agent Spoken Interactions.

Maike Paetzel, Deepthi Karkada and Ramesh Manuvinakurike

This paper presents a multimodal corpus of 209 spoken game dialogues between a human and a remote-controlled artificial agent. The interactions involve people collaborating with the agent to identify countries on the world map as quickly as possible, which allows studying rapid and spontaneous dialogue with complex anaphoras, disfluent utterances and incorrect descriptions. The corpus consists of two parts: 8 hours of game interactions have been collected with a virtual unembodied agent online and 26.8 hours have been recorded with a physically embodied robot in a research lab. In addition to spoken audio recordings available for both parts, camera recordings and skeleton-, facial expression- and eye-gaze tracking data have been collected for the lab-based part of the corpus.  this paper, we introduce the pedagogical reference resolution game (RDG-Map) and the characteristics of the corpus collected. We also present an annotation scheme we developed in order to study the dialogue strategies utilized by the players. Based on a subset of 330 minutes of interactions annotated so far, we discuss initial insights into these strategies as well as the potential of the corpus for future research.


MPDD: A Multi-Party Dialogue Dataset for Analysis of Emotions and Interpersonal Relationships

Yi-Ting Chen, Hen-Hsen Huang and Hsin-Hsi Chen

A dialogue dataset is an indispensable resource for building a dialogue system. Additional information like emotions and interpersonal relationships labeled on conversations enables the system to capture the emotion flow of the participants in the dialogue. However, there is no publicly available Chinese dialogue dataset with emotion and relation labels. In this paper, we collect the conversions from TV series scripts, and annotate emotion and interpersonal relationship labels on each utterance. This dataset contains 25,548 utterances from 4,142 dialogues. We also set up some experiments to observe the effects of the responded utterance on the current utterance, and the correlation between emotion and relation types in emotion and relation classification tasks.


 “Alexa in the wild” – Collecting Unconstrained Conversations with a Modern Voice Assistant in a Public Environment

Ingo Siegert

Datasets featuring modern voice assistants such as Alexa, Siri, Cortana and others allow an easy study of human-machine interactions. But data collections offering an unconstrained, unscripted public interaction are quite rare. Many studies so far have focused on private usage, short pre-defined task or specific domains. This contribution presents a dataset providing a large amount of unconstrained public interactions with a voice assistant. Up to now around 40 hours of device directed utterances were collected during a science exhibition touring through Germany. The data recording was part of an exhibit that engages visitors to interact with a commercial voice assistant system (Amazon’s ALEXA), but did not restrict them to a specific topic. A specifically developed quiz was starting point of the conversation, as the voice assistant was presented to the visitors as a possible joker for the quiz. But the visitors were not forced to solve the quiz with the help of the voice assistant and thus many visitors had an open conversation. The provided dataset – Voice Assistant Conversations in the wild (VACW) – includes the transcripts of both visitors requests and Alexa answers, identified topics and sessions as well as acoustic characteristics automatically extractable from the visitors’ audio files.


EDA: Enriching Emotional Dialogue Acts using an Ensemble of Neural Annotators

Chandrakant Bothe, Cornelius Weber, Sven Magg and Stefan Wermter

The recognition of emotion and dialogue acts enriches conversational analysis and help to build natural dialogue systems. Emotion interpretation makes us understand feelings and dialogue acts reflect the intentions and performative functions in the utterances. However, most of the textual and multi-modal conversational emotion corpora contain only emotion labels but not dialogue acts. To address this problem, we propose to use a pool of various recurrent neural models trained on a dialogue act corpus, with and without context. These neural models annotate the emotion corpora with dialogue act labels, and an ensemble annotator extracts the final dialogue act label. We annotated two accessible multi-modal emotion corpora: IEMOCAP and MELD. We analyzed the co-occurrence of emotion and dialogue act labels and discovered specific relations. For example, Accept/Agree dialogue acts often occur with the Joy emotion, Apology with Sadness, and Thanking with Joy. We make the Emotional Dialogue Acts (EDA) corpus publicly available to the research community for further study and analysis.


PACO: a Corpus to Analyze the Impact of Common Ground in Spontaneous Face-to-Face Interaction

Mary Amoyal, Béatrice Priego-Valverde and Stephane Rauzy

PAC0 is a French audio-video conversational corpus made of 15 face-to-face dyadic interactions, lasting around 20 min each. This compared corpus has been created in order to explore the impact of the lack of personal common ground (Clark, 1996) on participants collaboration during conversation and specifically on their smile during topic transitions. We have constituted this conversational corpus ” PACO” by replicating the experimental protocol of “Cheese!” (Priego-valverde & al.,2018). The only difference that distinguishes these two corpora is the degree of CG of the interlocutors: in Cheese! interlocutors are friends, while in PACO they do not know each other. This experimental protocol allows to analyze how the participants are getting acquainted. This study brings two main contributions. First, the PACO conversational corpus enables to compare the impact of the interlocutors’ common ground. Second, the semi-automatic smile annotation protocol allows to obtain reliable and reproducible smile annotations while reducing the annotation time by a factor 10.


Dialogue Act Annotation in a Multimodal Corpus of First Encounter Dialogues

Costanza Navarretta and Patrizia Paggio

This paper deals with the annotation of dialogue acts in a multimodal corpus of first encounter dialogues, i.e. face-to- face dialogues in which two people who meet for the first time talk with no particular purpose other than just talking. More specifically, we describe the method used to annotate dialogue acts in the corpus, including the evaluation of the annotations. Then, we present descriptive statistics of the annotation, particularly focusing on which dialogue acts often follow each other across speakers and which dialogue acts overlap with gestural behaviour. Finally, we discuss how feedback is expressed in the corpus by means of feedback dialogue acts with or without co-occurring gestural behaviour, i.e. multimodal vs. unimodal feedback.


A Conversation-Analytic Annotation of Turn-Taking Behavior in Japanese Multi-Party Conversation and its Preliminary Analysis

Mika Enomoto, Yasuharu Den and Yuichi Ishimoto

In this study, we propose a conversation-analytic annotation scheme for turn-taking behavior in multi-party conversations. The annotation scheme is motivated by a proposal of a proper model of turn-taking incorporating various ideas developed in the literature of conversation analysis. Our annotation consists of two sets of tags: the beginning and the ending type of the utterance. Focusing on the ending-type tags, in some cases combined with the beginning-type tags, we emphasize the importance of the distinction among four selection types: i) selecting other participant as next speaker, ii) not selecting next speaker but followed by a switch of the speakership, iii) not selecting next speaker and followed by a continuation of the speakership, and iv)being inside a multi-unit turn. Based on the annotation of Japanese multi-party conversations, we analyze how syntactic and prosodic features of utterances vary across the four selection types. The results show that the above four-way distinction is essential to account for the distributions of the syntactic and prosodic features, suggesting the insufficiency of previous turn-taking models that do not consider the distinction between i) and ii) or between ii) or iii).


Understanding User Utterances in a Dialog System for Caregiving

Yoshihiko Asao, Julien Kloetzer, Junta Mizuno, Dai Saiki, Kazuma Kadowaki and Kentaro Torisawa

A dialog system that can monitor the health status of seniors has a huge potential for solving the labor force shortage in the caregiving industry in aging societies. As a part of efforts to create such a system, we are developing two modules that are aimed to correctly interpret user utterances: (i) a yes/no response classifier, which categorizes responses to health-related yes/no questions that the system asks; and (ii) an entailment recognizer, which detects users’ voluntary mentions about their health status. To apply machine learning approaches to the development of the modules, we created large annotated datasets of 280,467 question-response pairs and 38,868 voluntary utterances. For question-response pairs, we asked annotators to avoid direct “yes” or “no” answers, so that our data could cover a wide range of possible natural language responses. The two modules were implemented by fine-tuning a BERT model, which is a recent successful neural network model. For the yes/no response classifier, the macro-average of the average precisions (APs) over all of our four categories (Yes/No/Unknown/Other) was 82.6% (96.3% for “yes” responses and 91.8% for “no” responses), while for the entailment recognizer it was 89.9%.


Designing Multilingual Interactive Agents using Small Dialogue Corpora

Donghui Lin, Masayuki Otani, Ryosuke Okuno and Toru Ishida

Interactive dialogue agents like smart speakers have become more and more popular in recent years. These agents are being developed on machine learning technologies that use huge amounts of language resources. However, many entities in specialized fields are struggling to develop their own interactive agents due to a lack of language resources such as dialogue corpora, especially when the end users need interactive agents that offer multilingual support. Therefore, we aim at providing a general design framework for multilingual interactive agents in specialized domains that, it is assumed, have small or non-existent dialogue corpora. To achieve our goal, we first integrate and customize external language services for supporting multilingual functions of interactive agents. Then, we realize context-aware dialogue generation under the situation of small corpora. Third, we develop a gradual design process for acquiring dialogue corpora and improving the interactive agents. We implement a multilingual interactive agent in the field of healthcare and conduct experiments to illustrate the effectiveness of the implemented agent.


Multimodal Corpus of Bidirectional Conversation of Human-human and Human-robot Interaction during fMRI Scanning

Birgit Rauchbauer, Youssef Hmamouche, Brigitte Bigi, Laurent Prévot, Magalie Ochs and Thierry Chaminade

In this paper we present investigation of real-life, bi-directional conversations. We introduce the multimodal corpus derived from these natural conversations alternating between human-human and human-robot interactions. The human-robot interactions were used as a control condition for the social nature of the human-human conversations. The experimental set up consisted of conversations between the participant in a functional magnetic resonance imaging (fMRI) scanner and a human confederate or conversational robot outside the scanner room, connected via bidirectional audio and unidirectional videoconferencing (from the outside to inside the scanner). A cover story provided a framework for natural, real-life conversations about images of an advertisement campaign. During the conversations we collected a multimodal corpus for a comprehensive characterization of bi-directional conversations. In this paper we introduce this multimodal corpus which includes neural data from functional magnetic resonance imaging (fMRI), physiological data (blood flow pulse and respiration), transcribed conversational data, as well as face and eye-tracking recordings. Thus, we present a unique corpus to study human conversations including neural, physiological and behavioral data.


The Brain-IHM Dataset: a New Resource for Studying the Brain Basis of Human-Human and Human-Machine Conversations

Magalie Ochs, Roxane Bertrand, Aurélie Goujon, Deirdre Bolger, Anne-Sophie Dubarry and Philippe Blache

This paper presents an original dataset of controlled interactions, focusing on the study of feedback items. It consists on recordings of different conversations between a doctor and a patient, played by actors. In this corpus, the patient is mainly a listener and produces different feedbacks, some of them being (voluntary) incongruent. Moreover, these conversations have been re-synthesized in a virtual reality context, in which the patient is played by an artificial agent. The final corpus is made of different movies of human-human conversations plus the same conversations replayed in a human-machine context, resulting in the first human-human/human-machine parallel corpus. The corpus is then enriched with different multimodal annotations at the verbal and non-verbal levels. Moreover, and this is the first dataset of this type, we have designed an experiment during which different participants had to watch the movies and give an evaluation of the interaction. During this task, we  recorded participant’s brain signal. The Brain-IHM dataset is then conceived with a triple purpose: 1/ studying feedbacks by comparing congruent vs. incongruent feedbacks 2/ comparing human-human and human-machine production of feedbacks 3/ studying the brain basis of feedback perception.


Dialogue-AMR: Abstract Meaning Representation for Dialogue

Claire Bonial, Lucia Donatelli, Mitchell Abrams, Stephanie M. Lukin, Stephen Tratz, Matthew Marge, Ron Artstein, David Traum and Clare Voss

This paper describes a schema that enriches Abstract Meaning Representation (AMR) in order to provide a semantic representation for facilitating Natural Language Understanding (NLU) in dialogue systems. AMR offers a valuable level of abstraction of the propositional content of an utterance; however, it does not capture the illocutionary force or speaker’s intended contribution in the broader dialogue context (e.g., make a request or ask a question), nor does it capture tense or aspect. We explore dialogue in the domain of human-robot interaction, where a conversational robot is engaged in search and navigation tasks with a human partner. To address the limitations of standard AMR, we develop an inventory of speech acts suitable for our domain, and present “Dialogue-AMR”, an enhanced AMR that represents not only the content of an utterance, but the illocutionary force behind it, as well as tense and aspect. To showcase the coverage of the schema, we use both manual and automatic methods to construct the “DialAMR” corpus—a corpus of human-robot dialogue annotated with standard AMR and our enriched Dialogue-AMR schema. Our automated methods can be used to incorporate AMR into a larger NLU pipeline supporting human-robot dialogue.



Relation between Degree of Empathy for Narrative Speech and Type of Responsive Utterance in Attentive Listening

Koichiro Ito, Masaki Murata, Tomohiro Ohno and Shigeki Matsubara

Nowadays, spoken dialogue agents such as communication robots and smart speakers listen to narratives of humans. In order for such an agent to be recognized as a listener of narratives and convey the attitude of attentive listening, it is necessary to generate responsive utterances. Moreover, responsive utterances can express empathy to narratives and showing an appropriate degree of empathy to narratives is significant for enhancing speaker’s motivation. The degree of empathy shown by responsive utterances is thought to depend on their type. However, the relation between responsive utterances and degrees of the empathy has not been explored yet. This paper describes the classification of responsive utterances based on the degree of empathy in order to explain that relation. In this research, responsive utterances are classified into five levels based on the effect of utterances and literature on attentive listening. Quantitative evaluations using 37,995 responsive utterances showed the appropriateness of the proposed classification.


Intent Recognition in Doctor-Patient Interviews

Robin Rojowiec, Benjamin Roth and Maximilian Fink

Learning to interview patients to find out their disease is an essential part of the training of medical students. The practical part of this training has traditionally relied on paid actors that play the role of a patient to be interviewed. This process is expensive and severely limits the amount of practice per student. In this work, we present a novel data set and methods based on Natural Language Processing, for making progress towards modern applications and e-learning tools that support this training by providing language-based user interfaces with virtual patients. A data set of german transcriptions from live doctor-patient interviews was collected. These transcriptions are based on audio recordings of exercise sessions within the university and only the doctor’s utterances could be transcribed. We annotated each utterance with an intent inventory characterizing the purpose of the question or statement. For some intent classes, the data only contains a few samples, and we apply Information Retrieval and Deep Learning methods that are robust with respect to small amounts of training data for recognizing the intent of an utterance and providing the correct response. Our results show that the models are effective and they provide baseline performance scores on the data set for further research.


BrainPredict: a Tool for Predicting and Visualising Local Brain Activity

Youssef Hmamouche, Laurent Prévot, Magalie Ochs and Thierry Chaminade

In this paper, we present a tool allowing dynamic prediction and visualization of an individual’s local brain activity during a conversation. The prediction module of this tool is based on classifiers trained using a corpus of human-human and human-robot conversations including fMRI recordings. More precisely, the module takes as input behavioral features computed from raw data, mainly the participant and the interlocutor speech but also the participant’s visual input and eye movements. The visualisation module shows in real-time the dynamics of brain active areas synchronised with the behavioral raw data. In addition, it shows which integrated behavioral features are used to predict the activity in individual brain areas.


MTSI-BERT: A Session-aware Knowledge-based Conversational Agent

Matteo Antonio Senese, Giuseppe Rizzo, Mauro Dragoni and Maurizio Morisio

In the last years, the state of the art of NLP research has made a huge step forward.  Since the release of ELMo (Peters et al., 2018), a new race for the leading scoreboards of all the main linguistic tasks has begun.  Several models have been published achieving promising results in all the major NLP applications, from question answering to text classification, passing through named entity recognition. These great research discoveries coincide with an increasing trend for voice-based technologies in the customer care market. One of the next biggest challenges in this scenario will be the handling of multi-turn conversations, a type of conversations that differs from single-turn by the presence of multiple related interactions.  The proposed work is an attempt to exploit one of these new milestones to handle multi-turn conversations. MTSI-BERT is a BERT-based model achieving promising results in intent classification, knowledge base action prediction and end of dialogue session detection, to determine the right moment to fulfill the user request. The study about the realization of PuffBot, an intelligent chatbot to support and monitor people suffering from asthma, shows how this type of technique could be an important piece in the development of future chatbots.


Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers

Kallirroi Georgila, Carla Gordon, Volodymyr Yanov and David Traum

We collected a corpus of dialogues in a Wizard of Oz (WOz) setting in the Internet of Things (IoT) domain. We asked users participating in these dialogues to rate the system on a number of aspects, namely, intelligence, naturalness, personality, friendliness, their enjoyment, overall quality, and whether they would recommend the system to others. Then we asked dialogue observers, i.e., Amazon Mechanical Turkers (MTurkers), to rate these dialogues on the same aspects. We also generated simulated dialogues between dialogue policies and simulated users and asked MTurkers to rate them again on the same aspects. Using linear regression, we developed dialogue evaluation functions based on features from the simulated dialogues and the MTurkers’ ratings, the WOz dialogues and the MTurkers’ ratings, and the WOz dialogues and the WOz participants’ ratings. We applied all these dialogue evaluation functions to a held-out portion of our WOz dialogues, and we report results on the predictive power of these different types of dialogue evaluation functions. Our results suggest that for three conversational aspects (intelligence, naturalness, overall quality) just training evaluation functions on simulated data could be sufficient.

  • LREC 2020 Paper Dissemination (1/10)
  • LREC 2020 Paper Dissemination (2/10)
  • LREC 2020 Paper Dissemination (3/10)
  • LREC 2020 Paper Dissemination (4/10)
  • LREC 2020 Paper Dissemination (5/10)
  • LREC 2020 Paper Dissemination (6/10)
  • LREC 2020 Paper Dissemination (7/10)
  • LREC 2020 Paper Dissemination (8/10)
  • LREC 2020 Paper Dissemination (9/10)
  • LREC 2020 Paper Dissemination (10/10)