LREC 2020 Paper Dissemination (10/10)

LREC 2020 was not held in Marseille this year and only the Proceedings were published.

The ELRA Board and the LREC 2020 Programme Committee now feel that those papers should be disseminated again, in a thematic-oriented way, shedding light on specific “topics/sessions”.

Packages with several sessions will be disseminated every Tuesday for 10 weeks, from Nov 10, 2020 until the end of January 2021.

Each session displays papers’ title and authors, with corresponding abstract (for ease of reading) and url, in like manner as the Book of Abstracts we used to print and distribute at LRECs.

We hope that you discover interesting, even exciting, work that may be useful for your own research.

Group of papers sent on January 26, 2021

Links to each session


Abstractive Document Summarization without Parallel Data

Nikola I. Nikolov and Richard Hahnloser

Abstractive summarization typically relies on large collections of paired articles and summaries. However, in many cases, parallel data is scarce and costly to obtain. We develop an abstractive summarization system that relies only on large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor that selects salient sentences to include in the final summary, as well as a sentence abstractor that is trained on pseudo-parallel and synthetic data, that paraphrases each of the extracted sentences. We perform an extensive evaluation of our method: on the CNN/DailyMail benchmark, on which we compare our approach to fully supervised baselines, as well as on the novel task of automatically generating a press release from a scientific journal article, which is well suited for our system. We show promising performance on both tasks, without relying on any article-summary pairs.


GameWikiSum: a Novel Large Multi-Document Summarization Dataset

Diego Antognini and Boi Faltings

Today’s research progress in the field of multi-document summarization is obstructed by the small number of available datasets. Since the acquisition of reference summaries is costly, existing datasets contain only hundreds of samples at most, resulting in heavy reliance on hand-crafted features or necessitating additional, manually annotated data. The lack of large corpora therefore hinders the development of sophisticated models. Additionally, most publicly available multi-document summarization corpora are in the news domain, and no analogous dataset exists in the video game domain. In this paper, we propose GameWikiSum, a new domain-specific dataset for multi-document summarization, which is one hundred times larger than commonly used datasets, and in another domain than news. Input documents consist of long professional video game reviews as well as references of their gameplay sections in Wikipedia pages. We analyze the proposed dataset and show that both abstractive and extractive models can be trained on it. We release GameWikiSum for further research:


Summarization Corpora of Wikipedia Articles

Dominik Frefel

In this paper we propose a process to extract summarization corpora from Wikipedia articles. Applied to the German language we create a corpus of 240,000 texts. We use ROUGE scores for the extraction and evaluation of our corpus. For this we provide a ROUGE metric implementation adapted to the German language. The extracted corpus is used to train three abstractive summarization models which we compare to different baselines. The resulting summaries sound natural and cover the input text very well.

The corpus can be downloaded at


Language Agnostic Automatic Summarization Evaluation

Christopher Tauchmann and Margot Mieskes

So far work on automatic summarization has dealt primarily with English data. Accordingly, evaluation methods were primarily developed with this language in mind. In our work, we present experiments of adapting available evaluation methods such as ROUGE and PYRAMID to non-English data. We base our experiments on various English and non-English homogeneous benchmark data sets as well as a non-English heterogeneous data set. Our results indicate that ROUGE can indeed be adapted to non-English data — both homogeneous and heterogeneous. Using a recent implementation of performing an automatic PYRAMID evaluation, we also show its adaptability to non-English data.


Two Huge Title and Keyword Generation Corpora of Research Articles

Erion Çano and Ondřej Bojar

Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and can be used to perform research on various tasks. In this paper, we introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research, containing 34 million and 23 million records, respectively. The data were retrieved from the Open Academic Graph which is a network of research profiles and publications. We carefully processed each record and also tried several extractive and abstractive methods of both tasks to create performance baselines for other researchers. We further illustrate the performance of those methods previewing their outputs. In the near future, we would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.


A Multi-level Annotated Corpus of Scientific Papers for Scientific Document Summarization and Cross-document Relation Discovery

Ahmed AbuRa’ed, Horacio Saggion and Luis Chiruzzo

Related work sections or literature reviews are an essential part of every scientific article being crucial  for paper reviewing and assessment. The automatic generation of related work sections can be considered an instance of the multi-document summarization problem. In order to allow the study of this specific problem, we have developed a manually annotated, machine readable data-set of related work sections, cited papers (e.g. references) and sentences, together with an additional layer of papers citing the references.  We additionally present experiments on the identification of cited sentences, using as input citation contexts. The corpus alongside the gold standard are made available for use by the scientific community.


Abstractive Text Summarization based on Language Model Conditioning and Locality Modeling

Dmitrii Aksenov, Julian Moreno-Schneider, Peter Bourgonje, Robert Schwarzenberg, Leonhard Hennig and Georg Rehm

We explore to what extent knowledge about the pre-trained language model that is used is beneficial for the task of abstractive summarization. To this end, we experiment with conditioning the encoder and decoder of a Transformer-based neural model on the BERT language model. In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size. We also explore how locality modeling, i.e., the explicit restriction of calculations to the local context, can affect the summarization ability of the Transformer. This is done by introducing 2-dimensional convolutional self-attention into the first layers of the encoder. The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset. We additionally train our model on the SwissText dataset to demonstrate usability on German. Both models outperform the baseline in ROUGE scores on two datasets and show its superiority in a manual qualitative analysis.


A Data Set for the Analysis of Text Quality Dimensions in Summarization Evaluation

Margot Mieskes, Eneldo Loza Mencía and Tim Kronsbein

Automatic evaluation of summarization focuses on developing a metric to represent the quality of the resulting text. However, text qualityis represented in a variety of dimensions ranging from grammaticality to readability and coherence. In our work, we analyze the depen-dencies between a variety of quality dimensions on automatically created multi-document summaries and which dimensions automaticevaluation metrics such as ROUGE, PEAK or JSD are able to capture.  Our results indicate that variants of ROUGE are correlated tovarious quality dimensions and that some automatic summarization methods achieve higher quality summaries than others with respectto individual summary quality dimensions. Our results also indicate that differentiating between quality dimensions facilitates inspectionand fine-grained comparison of summarization methods and its characteristics.  We make the data from our two summarization qualityevaluation experiments publicly available in order to facilitate the future development of specialized automatic evaluation methods.


Summarization Beyond News: The Automatically Acquired Fandom Corpora

Benjamin Hättasch, Nadja Geisler, Christian M. Meyer and Carsten Binnig

Large state-of-the-art corpora for training neural networks to create abstractive summaries are mostly limited to the news genre, as it is expensive to acquire human-written summaries for other types of text at a large scale. In this paper, we present a novel automatic corpus construction approach to tackle this issue as well as three new large open-licensed summarization corpora based on our approach that can be used for training abstractive summarization models. Our constructed corpora contain fictional narratives, descriptive texts, and summaries about movies, television, and book series from different domains. All sources use a creative commons (CC) license, hence we can provide the corpora for download. In addition, we also provide a ready-to-use framework that implements our automatic construction approach to create custom corpora with desired parameters like the length of the target summary and the number of source documents from which to create the summary. The main idea behind our automatic construction approach is to use existing large text collections (e.g., thematic wikis) and automatically classify whether the texts can be used as (query-focused) multi-document summaries and align them with potential source texts. As a final contribution, we show the usefulness of our automatic construction approach by running state-of-the-art summarizers on the corpora and through a manual evaluation with human annotators.


Invisible to People but not to Machines: Evaluation of Style-aware HeadlineGeneration in Absence of Reliable Human Judgment

Lorenzo De Mattei, Michele Cafagna, Felice Dell’Orletta and Malvina Nissim

We automatically generate headlines that are expected to comply with the specific styles of two different Italian newspapers. Through a data alignment strategy and different training/testing settings, we aim at decoupling content from style and preserve the latter in generation. In order to evaluate the generated headlines’ quality in terms of their specific newspaper-compliance, we devise a fine-grained evaluation strategy based on automatic classification. We observe that our models do indeed learn newspaper-specific style. Importantly, we also observe that humans aren’t reliable judges for this task, since although familiar with the newspapers, they are not able to discern their specific styles even in the original human-written headlines. The utility of automatic evaluation goes therefore beyond saving the costs and hurdles of manual annotation, and deserves particular care in its design.


Align then Summarize: Automatic Alignment Methods for Summarization Corpus Creation

Paul Tardy, David Janiszek, Yannick Estève and Vincent Nguyen

Summarizing texts is not a straightforward task.

Before even considering text summarization, one should determine what kind of summary is expected. How much should the information be compressed? Is it relevant to reformulate or should the summary stick to the original phrasing? State-of-the-art on automatic text summarization mostly revolves around news articles. We suggest that considering a wider variety of tasks would lead to an improvement in the field, in terms of generalization and robustness. We explore meeting summarization: generating reports from automatic transcriptions. Our work consists in segmenting and aligning transcriptions with respect to reports, to get a suitable dataset for neural summarization. Using a bootstrapping approach, we provide pre-alignments that are corrected by human annotators, making a validation set against which we evaluate automatic models. This consistently reduces annotators’ efforts by providing iteratively better pre-alignment and maximizes the corpus size by using annotations from our automatic alignment models. Evaluation is conducted on publicmeetings, a novel corpus of aligned public meetings. We report automatic alignment and summarization performances on this corpus and show that automatic alignment is relevant for data annotation since it leads to large improvement of almost +4 on all ROUGE scores on the summarization task.


A Summarization Dataset of Slovak News Articles

Marek Suppa and Jergus Adamec

As a well established NLP task, single-document summarization has seen significant interest in the past few years. However, most of the work has been done on English datasets. This is particularly noticeable in the context of evaluation where the dominant ROUGE metric assumes its input to be written in English. In this paper we aim to address both of these issues by introducing a summarization dataset of articles from a popular Slovak news site and proposing small adaptation to the ROUGE metric that make it better suited for Slovak texts. Several baselines are evaluated on the dataset, including an extractive approach based on the Multilingual version of the BERT architecture. To the best of our knowledge, the presented dataset is the first large-scale news-based summarization dataset for text written in Slovak language. It can be reproduced using the utilities available at


DaNewsroom: A Large-scale Danish Summarisation Dataset

Daniel Varab and Natalie Schluter

Dataset development for automatic summarisation systems is notoriously English-oriented. In this paper we present the first large-scale non-English language dataset specifically curated for automatic summarisation.  The document-summary pairs are news articles and manually written summaries in the Danish language.  There has previously been no work done to establish a Danish summarisation dataset, nor any published work on the automatic summarisation of Danish.  We provide therefore the first automatic summarisation dataset for the Danish language (large-scale or otherwise).  To support the comparison of future automatic summarisation systems for Danish, we include system performance on this dataset of strong well-established unsupervised baseline systems, together with an oracle extractive summariser, which is the first account of automatic summarisation system performance for Danish.  Finally, we make all code for automatically acquiring the data freely available and make explicit how this technology can easily be adapted in order to acquire automatic summarisation datasets for further languages.



Text Mining

Back to Top

Diverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison Tasks

Jinghui Lu, Maeve Henchion and Brian Mac Namee

Jensen-Shannon divergence (JSD) is a distribution similarity measurement widely used in natural language processing. In corpus comparison tasks, where keywords are extracted to reveal the divergence between different corpora (for example, social media posts from proponents of different views on a political issue), two variants of JSD have emerged in the literature. One of these uses a weighting based on the relative sizes of the corpora being compared. In this paper we argue that this weighting is unnecessary and, in fact, can lead to misleading results. We recommend that this weighted version is not used. We base this recommendation on an analysis of the JSD variants and experiments showing how they impact corpus comparison results as the relative sizes of the corpora being compared change.


TopicNet: Making Additive Regularisation for Topic Modelling Accessible

Victor Bulatov, Vasiliy Alekseev, Konstantin Vorontsov, Darya Polyudova, Eugenia Veselova, Alexey Goncharov and Evgeny Egorov

This paper introduces TopicNet, a new Python module for topic modeling. This package, distributed under the MIT license, focuses on bringing additive regularization topic modelling (ARTM) to non-specialists using a general-purpose high-level language. The module features include powerful model visualization techniques, various training strategies, semi-automated model selection, support for user-defined goal metrics, and a modular approach to topic model training. Source code and documentation are available at


SC-CoMIcs: A Superconductivity Corpus for Materials Informatics

Kyosuke Yamaguchi, Ryoji Asahi and Yutaka Sasaki

This paper describes a novel corpus tailored for the text mining of superconducting materials in Materials Informatics (MI), named SuperConductivety Corpus for Materials Informatics (SC-CoMIcs).   Different from biomedical informatics, there exist very few corpora targeting Materials Science and Engineering (MSE). Especially, there is no sizable corpus which can be used to assist the search of superconducting materials.  A team of materials scientists and natural language processing experts jointly designed the annotation and constructed a corpus consisting of manually-annotated 1,000 MSE abstracts related to superconductivity. We conducted experiments on the corpus with a neural Named Entity Recognition (NER) tool. The experimental results show that NER performance over the corpus is around 77% in terms of micro-F1, which is comparable to human annotator agreement rates. Using the trained NER model, we automatically annotated 9,000 abstracts and created a term retrieval tool based on the term similarity. This tool can find superconductivity terms relevant to a query term within a specified Named Entity category, which demonstrates the power of our SC-CoMIcs, efficiently providing knowledge for Materials Informatics applications from rapidly expanding publications.


GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

Masato Hagiwara and Masato Mita

The lack of large-scale datasets has been a major hindrance to the development of NLP tasks such as spelling correction and grammatical error correction (GEC). As a complementary new resource for these tasks, we present the GitHub Typo Corpus, a large-scale, multilingual dataset of misspellings and grammatical errors along with their corrections harvested from GitHub, a large and popular platform for hosting and sharing git repositories. The dataset, which we have made publicly available, contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date. We also describe our process for filtering true typo edits based on learned classifiers on a small annotated subset, and demonstrate that typo edits can be identified with F1 ~ 0.9 using a very simple classifier with only three features. The detailed analyses of the dataset show that existing spelling correctors merely achieve an F-measure of approx. 0.5, suggesting that the dataset serves as a new, rich source of spelling errors that complement existing datasets.


Annotation of Adverse Drug Reactions in Patients’ Weblogs

Yuki Arase, Tomoyuki Kajiwara and Chenhui Chu

Adverse drug reactions are a severe problem that significantly degrade quality of life, or even threaten the life of patients. Patient-generated texts available on the web have been gaining attention as a promising source of information in this regard. While previous studies annotated such patient-generated content, they only reported on limited information, such as whether a text described an adverse drug reaction or not. Further, they only annotated short texts of a few sentences crawled from online forums and social networking services. The dataset we present in this paper is unique for the richness of annotated information, including detailed descriptions of drug reactions with full context. We crawled patient’s weblog articles shared on an online patient-networking platform and annotated the effects of drugs therein reported. We identified spans describing drug reactions and assigned labels for related drug names, standard codes for the symptoms of the reactions, and types of effects.  As a first dataset, we annotated 677 drug reactions with these detailed labels based on 169 weblog articles by Japanese lung cancer patients. Our annotation dataset is made publicly available at our web site ( for further research on the detection of adverse drug reactions and more broadly, on patient-generated text processing.


Beyond Citations: Corpus-based Methods for Detecting the Impact of Research Outcomes on Society

Rezvaneh Rezapour, Jutta Bopp, Norman Fiedler, Diana Steffen, Andreas Witt and Jana Diesner

This paper proposes, implements and evaluates a novel, corpus-based approach for identifying categories indicative of the impact of research via a deductive (top-down, from theory to data) and an inductive (bottom-up, from data to theory) approach. The resulting categorization schemes differ in substance. Research outcomes are typically assessed by using bibliometric methods, such as citation counts and patterns, or alternative metrics, such as references to research in the media. Shortcomings with these methods are their inability to identify impact of research beyond academia (bibliometrics) and considering text-based impact indicators beyond those that capture attention (altmetrics). We address these limitations by leveraging a mixed-methods approach for eliciting impact categories from experts, project personnel (deductive) and texts (inductive). Using these categories, we label a corpus of project reports per category schema, and apply supervised machine learning to infer these categories from project reports. The classification results show that we can predict deductively and inductively derived impact categories with 76.39% and 78.81% accuracy (F1-score), respectively. Our approach can complement solutions from bibliometrics and scientometrics for assessing the impact of research and studying the scope and types of advancements transferred from academia to society.


Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets

Paula Fortuna, Juan Soler and Leo Wanner

The field of the automatic detection of hate speech and related concepts has raised a lot of interest in the last years. Different datasets were annotated and classified by means of applying different machine learning algorithms. However, few efforts were done in order to clarify the applied categories and homogenize different datasets. Our study takes up this demand.  We analyze six different publicly available datasets in this field with respect to their similarity and compatibility. We conduct two different experiments. First, we try to make the datasets compatible and represent the dataset classes as Fast Text word vectors analyzing the similarity between different classes in a intra and inter dataset manner. Second, we submit the chosen datasets to the Perspective API Toxicity classifier, achieving different performances depending on the categories and datasets. One of the main conclusions of these experiments is that many different definitions are being used for equivalent concepts, which makes most of the publicly available datasets incompatible. Grounded in our analysis, we  provide  guidelines  for  future  dataset  collection and annotation.


Unsupervised Argumentation Mining in Student Essays

Isaac Persing and Vincent Ng

State-of-the-art systems for argumentation mining are supervised, thus relying on training data containing manually annotated argument components and the relationships between them. To eliminate the reliance on annotated data, we present a novel approach to unsupervised argument mining. The key idea is to bootstrap from a small set of argument components automatically identified using simple heuristics in combination with reliable contextual cues. Results on a Stab and Gurevych’s corpus of 402 essays show that our unsupervised approach rivals two supervised baselines in performance and achieves 73.5-83.7% of the performance of a state-of-the-art neural approach.


Aspect-Based Sentiment Analysis as Fine-Grained Opinion Mining

Gerardo Ocampo Diaz, Xuanming Zhang and Vincent Ng

We show how the general fine-grained opinion mining concepts of opinion target and opinion expression are related to aspect-based sentiment analysis (ABSA) and discuss their benefits for resource creation over popular ABSA annotation schemes. Specifically, we first discuss why opinions modeled solely in terms of (entity, aspect) pairs inadequately captures the meaning of the sentiment originally expressed by authors and how opinion expressions and opinion targets can be used to avoid the loss of information. We then design a meaning-preserving annotation scheme and apply it to two popular ABSA datasets, the 2016 SemEval ABSA Restaurant and Laptop datasets. Finally, we discuss the importance of opinion expressions and opinion targets for next-generation ABSA systems. We make our datasets publicly available for download.


Predicting Item Survival for Multiple Choice Questions in a High-Stakes Medical Exam

Victoria Yaneva, Le An Ha, Peter Baldwin and Janet Mee

One of the most resource-intensive problems in the educational testing industry relates to ensuring that newly-developed exam questions can adequately distinguish between students of high and low ability. The current practice for obtaining this information is the costly procedure of pretesting: new items are administered to test-takers and then the items that are too easy or too difficult are discarded. This paper presents the first study towards automatic prediction of an item’s probability to “survive” pretesting  (item survival), focusing on human-produced MCQs for a medical exam. Survival is modeled through a number of linguistic features and embedding types, as well as features inspired by information retrieval. The approach shows promising first results for this challenging new application and for modeling the difficulty of expert-knowledge questions.



Textual Entailment and Paraphrasing

Back to Top

Discourse Component to Sentence (DC2S): An Efficient Human-Aided Construction of Paraphrase and Sentence Similarity Dataset

Won Ik Cho, Jong In Kim, Young Ki Moon and Nam Soo Kim

Assessing the similarity of sentences and detecting paraphrases is an essential task both in theory and practice, but achieving a reliable dataset requires high resource. In this paper, we propose a discourse component-based paraphrase generation for the directive utterances, which is efficient in terms of human-aided construction and content preservation. All discourse components are expressed in natural language phrases, and the phrases are created considering both speech act and topic so that the controlled construction of the sentence similarity dataset is available. Here, we investigate the validity of our scheme using the Korean language, a language with diverse paraphrasing due to frequent subject drop and scramblings. With 1,000 intent argument phrases and thus generated 10,000 utterances, we make up a sentence similarity dataset of practically sufficient size. It contains five sentence pair types, including paraphrase, and displays a total volume of about 550K. To emphasize the utility of the scheme and dataset, we measure the similarity matching performance via conventional natural language inference models, also suggesting the multi-lingual extensibility.


Japanese Realistic Textual Entailment Corpus

Yuta Hayashibe

We perform the textual entailment (TE) corpus construction for the Japanese Language with the following three characteristics: First, the corpus consists of realistic sentences; that is, all sentences are spontaneous or almost equivalent. It does not need manual writing which causes hidden biases. Second, the corpus contains adversarial examples. We collect challenging examples that can not be solved by a recent pre-trained language model. Third, the corpus contains explanations for a part of non-entailment labels. We perform the reasoning annotation where annotators are asked to check which tokens in hypotheses are the reason why the relations are labeled. It makes easy to validate the annotation and analyze system errors. The resulting corpus consists of 48,000 realistic Japanese examples. It is the largest among publicly available Japanese TE corpora. Additionally, it is the first Japanese TE corpus that includes reasons for the annotation as we know. We are planning to distribute this corpus to the NLP community at the time of publication.


Improving the Precision of Natural Textual Entailment Problem Datasets

Jean-Philippe Bernardy and Stergios Chatzikyriakidis

In this paper, we propose a method to modify natural textual entailment problem datasets so that they better reflect a more precise notion of entailment.  We apply this method to a subset of the Recognizing Textual Entailment datasets. We thus obtain a new corpus of entailment problems, which has the following three characteristics: 1. it is precise (does not leave out implicit hypotheses) 2. it is based on “real-world” texts (i.e. most of the premises were written for purposes other than testing textual entailment). 3. its size is 150. Broadly, the method that we employ is to make any missing hypotheses explicit using a crowd of experts. We discuss the relevance of our method in improving existing NLI datasets to be more fit for precise reasoning and we argue that this corpus can be the basis a first step towards wide-coverage testing of precise natural-language inference systems.


Comparative Study of Sentence Embeddings for Contextual Paraphrasing

Louisa Pragst, Wolfgang Minker and Stefan Ultes

Paraphrasing is an important aspect of natural-language generation that can produce more variety in the way specific content is presented. Traditionally, paraphrasing has been focused on finding different words that convey the same meaning. However, in human-human interaction, we regularly express our intention with phrases that are vastly different regarding both word content and syntactic structure. Instead of exchanging only individual words, the complete surface realisation of a sentences is altered while still preserving its meaning and function in a conversation. This kind of contextual paraphrasing did not yet receive a lot of attention from the scientific community despite its potential for the creation of more varied dialogues. In this work, we evaluate several existing approaches to sentence encoding with regard to their ability to capture such context-dependent paraphrasing. To this end, we define a paraphrase classification task that incorporates contextual paraphrases, perform dialogue act clustering, and determine the performance of the sentence embeddings in a sentence swapping task.


HypoNLI: Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference

Tianyu Liu, Zheng Xin, Baobao Chang and Zhifang Sui

Many recent studies have shown that for models trained on datasets for natural language inference (NLI), it is possible to make correct predictions by merely looking at the hypothesis while completely ignoring the premise. In this work, we manage to derive adversarial examples in terms of  the hypothesis-only bias and explore eligible ways to mitigate such bias. Specifically, we extract various phrases from the hypotheses (artificial patterns) in the training sets, and show that they have been strong indicators to the specific labels. We then figure out `hard’ and `easy’ instances from the original test sets whose labels are opposite to or consistent with those indications. We also set up baselines including both pretrained models (BERT, RoBerta, XLNet) and competitive non-pretrained models (InferSent, DAM, ESIM).

Apart from the benchmark and baselines, we also investigate two debiasing approaches which exploit the artificial pattern modeling to mitigate such hypothesis-only bias: down-sampling and adversarial training. We believe those methods can be treated as competitive baselines in NLI debiasing tasks.


SAPPHIRE: Simple Aligner for Phrasal Paraphrase with Hierarchical Representation

Masato Yoshinaka, Tomoyuki Kajiwara and Yuki Arase

We present SAPPHIRE, a Simple Aligner for Phrasal Paraphrase with HIerarchical REpresentation. Monolingual phrase alignment is a fundamental problem in natural language understanding and also a crucial technique in various applications such as natural language inference and semantic textual similarity assessment. Previous methods for monolingual phrase alignment are language-resource intensive; they require large-scale synonym/paraphrase lexica and high-quality parsers. Different from them, SAPPHIRE depends only on a monolingual corpus to train word embeddings. Therefore, it is easily transferable to specific domains and different languages. Specifically, SAPPHIRE first obtains word alignments using pre-trained word embeddings and then expands them to phrase alignments by bilingual phrase extraction methods. To estimate the likelihood of phrase alignments, SAPPHIRE uses phrase embeddings that are hierarchically composed of word embeddings. Finally, SAPPHIRE searches for a set of consistent phrase alignments on a lattice of phrase alignment candidates. It achieves search-efficiency by constraining the lattice so that all the paths go through a phrase alignment pair with the highest alignment score. Experimental results using the standard dataset for phrase alignment evaluation show that SAPPHIRE outperforms the previous method and establishes the state-of-the-art performance.


TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Yves Scherrer

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 – 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists. The dataset is available at


Automated Fact-Checking of Claims from Wikipedia

Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry and Joonsuk Park

Automated fact checking is becoming increasingly vital as both truthful and fallacious information accumulate online. Research on fact checking has benefited from large-scale datasets such as FEVER and SNLI. However, such datasets suffer from limited applicability due to the synthetic nature of claims and/or evidence written by annotators that differ from real claims and evidence on the internet.  To this end, we present WikiFactCheck-English, a dataset of 124k+ triples consisting of a claim, context and an evidence document extracted from English Wikipedia articles and citations, as well as 34k+ manually written claims that are refuted by the evidence documents. This is the largest fact checking dataset consisting of real claims and evidence to date; it will allow the development of fact checking systems that can better process claims and evidence in the real world. We also show that for the NLI subtask, a logistic regression system trained using existing and novel features achieves peak accuracy of 68%, providing a competitive baseline for future work. Also, a decomposable attention model trained on SNLI significantly underperforms the models trained on this dataset, suggesting that models trained on manually generated data may not be sufficiently generalizable or suitable for fact checking real-world claims.


Towards the Necessity for Debiasing Natural Language Inference Datasets

Mithun Paul Panenghat, Sandeep Suntwal, Faiz Rafique, Rebecca Sharp and Mihai Surdeanu

Modeling natural language inference is a challenging task. With large annotated data sets available it has now become feasible to train complex neural network based inference methods which achieve state of the art performance. However, it has been shown that these models also learn from the subtle biases inherent in these datasets \cite{gururangan2018annotation}. In this work we explore two techniques for delexicalization that modify the datasets in such a way that we can control the importance that neural-network based methods place on lexical entities. We demonstrate that the proposed methods not only maintain the performance in-domain but also improve performance in some out-of-domain settings. For example, when using the delexicalized version of the FEVER dataset,  the in-domain performance of a state of the art neural network method dropped only by 1.12% while its out-of-domain performance on the FNC dataset improved by 4.63%. We release the delexicalized versions of three common datasets used in natural language inference. These datasets are delexicalized using two methods: one which replaces the lexical entities in an overlap-aware manner, and a second, which additionally incorporates semantic lifting of nouns and verbs to their WordNet hypernym synsets


A French Corpus for Semantic Similarity

Rémi Cardon and Natalia Grabar

Semantic similarity is an area of Natural Language Processing that is useful for several downstream applications, such as machine translation, natural language generation, information retrieval, or question answering. The task consists in assessing the extent to which two sentences express or do not express the same meaning. To do so, corpora with graded pairs of sentences are required. The grade is positioned on a given scale, usually going from 0 (completely unrelated) to 5 (equivalent semantics). In this work, we introduce such a corpus for French, the first that we know of. It is comprised of 1,010 sentence pairs with grades from five annotators. We describe the annotation process, analyse these data, and perform a few experiments for the automatic grading of semantic similarity.


Developing Dataset of Japanese Slot Filling Quizzes Designed for Evaluation of Machine Reading Comprehension

Takuto Watarai and Masatoshi Tsuchiya

This paper describes our developing dataset of Japanese slot filling quizzes designed for evaluation of machine reading comprehension. The dataset consists of quizzes automatically generated from Aozora Bunko, and each quiz is defined as a 4-tuple: a context passage, a query holding a slot, an answer character and a set of possible answer characters. The query is generated from the original sentence, which appears immediately after the context passage on the target book, by replacing the answer character into the slot. The set of possible answer characters consists of the answer character and the other characters who appear in the context passage. Because the context passage and the query shares the same context, a machine which precisely understand the context may select the correct answer from the set of possible answer characters. The unique point of our approach is that we focus on characters of target books as slots to generate queries from original sentences, because they play important roles in narrative texts and precise understanding their relationship is necessary for reading comprehension. To extract characters from target books, manually created dictionaries of characters are employed because some characters appear as common nouns not as named entities.



Tools, Systems, Applications

Back to Top

Detecting Negation Cues and Scopes in Spanish

Salud María Jiménez-Zafra, Roser Morante, Eduardo Blanco, María Teresa Martín Valdivia and L. Alfonso Ureña López

In this work we address the processing of negation in Spanish. We first present a machine learning system that processes negation in Spanish. Specifically, we focus on two tasks: i) negation cue detection and ii) scope identification. The corpus used in the experimental framework is the SFU Corpus. The results for cue detection outperform state-of-the-art results, whereas for scope detection this is the first system that performs the task for Spanish. Moreover, we provide a qualitative error analysis aimed at understanding the limitations of the system and showing which negation cues and scopes are straightforward to predict automatically, and which ones are challenging.


TIARA: A Tool for Annotating Discourse Relations and Sentence Reordering

Jan Wira Gotama Putra, Simone Teufel, Kana Matsumura and Takenobu Tokunaga

This paper introduces TIARA, a new publicly available web-based annotation tool for discourse relations and sentence reordering.

Annotation tasks such as these, which are based on relations between large textual objects, are inherently hard to visualise without either cluttering the display and/or confusing the annotators. TIARA deals with the visual complexity during the annotation process by systematically simplifying the layout, and by offering interactive visualisation, including coloured links, indentation, and dual-view. TIARA’s text view allows annotators to focus on the analysis of logical sequencing between sentences. A separate tree view allows them to review their analysis in terms of the overall discourse structure. The dual-view gives it an edge over other discourse annotation tools and makes it particularly attractive as an educational tool (e.g., for teaching students how to argue more effectively). As it is based on standard web technologies and can be easily customised to other annotation schemes, it can be easily used by anybody. Apart from the project it was originally designed for, in which hundreds of texts were annotated by three annotators, TIARA has already been adopted by a second discourse annotation study, which uses it in the teaching of argumentation.


Infrastructure for Semantic Annotation in the Genomics Domain

Mahmoud El-Haj, Nathan Rutherford, Matthew Coole, Ignatius Ezeani, Sheryl Prentice, Nancy Ide, Jo Knight, Scott Piao, John Mariani, Paul Rayson and Keith Suderman

We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.


Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation

Kshitij Shah and Gerard de Melo

In this paper, we explore the artificial generation of typographical errors based on real-world statistics. We first draw on a small set of annotated data to compute spelling error statistics. These are then invoked to introduce errors into substantially larger corpora. The generation methodology allows us to generate particularly challenging errors that require context-aware error detection. We use it to create a set of English language error detection and correction datasets. Finally, we examine the effectiveness of machine learning models for detecting and correcting errors based on this data.


KidSpell: A Child-Oriented, Rule-Based, Phonetic Spellchecker

Brody Downs, Oghenemaro Anuyah, Aprajita Shukla, Jerry Alan Fails, Sole Pera, Katherine Wright and Casey Kennington

For help with their spelling errors, children often turn to spellcheckers integrated in software applications like word processors and search engines. However, existing spellcheckers are usually tuned to the needs of traditional users (i.e., adults) and generally prove unsatisfactory for children. Motivated by this issue, we introduce KidSpell, an English spellchecker oriented to the spelling needs of children. KidSpell applies (i) an encoding strategy for mapping both misspelled words and spelling suggestions to their phonetic keys and (ii) a selection process that prioritizes candidate spelling suggestions that closely align with the misspelled word based on their respective keys. To assess the effectiveness of, we compare the model’s performance against several popular, mainstream spellcheckers in a number of offline experiments using existing and novel datasets. The results of these experiments show that KidSpell outperforms existing spellcheckers, as it accurately prioritizes relevant spelling corrections when handling misspellings generated by children in both essay writing and online search tasks. As a byproduct of our study, we create two new datasets comprised of spelling errors generated by children from hand-written essays and web search inquiries, which we make available to the research community.


ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation

Suteera Seeha, Ivan Bilan, Liliana Mamani Sanchez, Johannes Huber, Michael Matuschek and Hinrich Schütze

We propose ThaiLMCut, a semi-supervised approach for Thai word segmentation which utilizes a bi-directional character language model (LM) as a way to leverage useful linguistic knowledge from unlabeled data. After the language model is trained on substantial unlabeled corpora, the weights of its embedding and recurrent layers are transferred to a supervised word segmentation model which continues fine-tuning them on a word segmentation task. Our experimental results demonstrate that applying the LM always leads to a performance gain, especially when the amount of labeled data is small. In such cases, the F1 Score increased by up to 2.02%. Even on abig labeled dataset, a small improvement gain can still be obtained. The approach has also shown to be very beneficial for out-of-domain settings with a gain in F1 Score of up to 3.13%. Finally, we show that ThaiLMCut can outperform other open source state-of-the-art models achieving an F1 Score of 98.78% on the standard benchmark, InterBEST2009.


CCOHA: Clean Corpus of Historical American English

Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn and Sabine Schulte im Walde

Modelling language change is an increasingly important area of interest within the fields of sociolinguistics and historical linguistics. In recent years, there has been a growing number of publications whose main concern is studying changes that have occurred within the past centuries. The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. This paper describes methods applied to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting corpus CCOHA contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.


Outbound Translation User Interface Ptakopět: A Pilot Study

Vilém Zouhar and Ondřej Bojar

It is not uncommon for Internet users to have to produce a text in a foreign language they have very little knowledge of and are unable to verify the translation quality. We call the task “outbound translation” and explore it by introducing an open-source modular system Ptakopět. Its main purpose is to inspect human interaction with MT systems enhanced with additional subsystems, such as backward translation and quality estimation. We follow up with an experiment on (Czech) human annotators tasked to produce questions in a language they do not speak (German), with the help of Ptakopět. We focus on three real-world use cases (communication with IT support, describing administrative issues and asking encyclopedic questions) from which we gain insight into different strategies users take when faced with outbound translation tasks. Round trip translation is known to be unreliable for evaluating MT systems but our experimental evaluation documents that it works very well for users, at least on MT systems of mid-range quality.


Seshat: a Tool for Managing and Verifying Annotation Campaigns of Audio Data

Hadrien Titeux, Rachid Riad, Xuan-Nga Cao, Nicolas Hamilakis, Kris Madden, Alejandrina Cristia, Anne-Catherine Bachoud-Lévi and Emmanuel Dupoux

We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules that can be implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the gamma measure taking into account the categorisation and segmentation discrepancies.


Dragonfly: Advances in Non-Speaker Annotation for Low Resource Languages

Cash Costello, Shelby Anderson, Caitlyn Bishop, James Mayfield and Paul McNamee

Dragonfly is an open source software tool that supports annotation of text in a low resource language by non-speakers of the language. Using semantic and contextual information, non-speakers of a language familiar with the Latin script can produce high quality named entity annotations to support construction of a name tagger. We describe a procedure for annotating low resource languages using Dragonfly that others can use, which we developed based on our experience annotating data in more than ten languages. We also present performance comparisons between models trained on native speaker and non-speaker annotations.


Natural Language Processing Pipeline to Annotate Bulgarian Legislative Documents

Svetla Koeva, Nikola Obreshkov and Martin Yalamov

The paper presents the Bulgarian MARCELL corpus, part of a recently developed multilingual corpus representing the national legislation in seven European countries and the NLP pipeline that turns the web crawled data into structured, linguistically annotated dataset. The Bulgarian data is web crawled, extracted from the original HTML format, filtered by document type, tokenised, sentence split, tagged and lemmatised with a fine-grained version of the Bulgarian Language Processing Chain, dependency parsed with NLP- Cube, annotated with named entities (persons, locations, organisations and others), noun phrases, IATE terms and EuroVoc descriptors. An orchestrator process has been developed to control the NLP pipeline performing an end-to-end data processing and annotation starting from the documents identification and ending in the generation of statistical reports. The Bulgarian MARCELL corpus consists of 25,283 documents (at the beginning of November 2019), which are classified into eleven types.


CLDFBench: Give Your Cross-Linguistic Data a Lift

Robert Forkel and Johann-Mattis List

While the amount of cross-linguistic data is constantly increasing, most datasets produced today and in the past cannot be considered FAIR (findable, accessible, interoperable, and reproducible). To remedy this and to increase the comparability of cross-linguistic resources, it is not enough to set up standards and best practices for data to be collected in the future. We also need consistent workflows for the “retro-standardization” of data that has been published during the past decades and centuries. With the Cross-Linguistic Data Formats initiative, first standards for cross-linguistic data have been presented and successfully tested. So far, however, CLDF creation was hampered by the fact that it required a considerable degree of computational proficiency. With cldfbench, we introduce a framework for the retro-standardization of legacy data and the curation of new datasets that drastically simplifies the creation of CLDF by providing a consistent, reproducible workflow that rigorously supports version control and long term archiving of research data and code. The framework is distributed in form of a Python package along with usage information and examples for best practice. This study introduces the new framework and illustrates how it can be applied by showing how a resource containing structural and lexical data for Sinitic languages can be efficiently retro-standardized and analyzed.


KonText: Advanced and Flexible Corpus Query Interface

Tomáš Machálek

We present an advanced, highly customizable corpus query interface KonText built on top of core libraries of the open-source corpus search engine NoSketch Engine (NoSkE). The aim is to overcome some limitations of the original NoSkE user interface and provide integration capabilities allowing connection of the basic search service with other language resources (LRs). The introduced features are based on long-term feedback given by the users and researchers of the Czech National Corpus (CNC) along with other LRs providers running KonText as a part of their services. KonText is a fully operational and mature software deployed at the CNC since 2014 that currently handles thousands user queries per day.


Word at a Glance: Modular Word Profile Aggregator

Tomáš Machálek

Word at a Glance (WaG) is a word profile aggregator that provides means for exploring individual words, their comparison and translation, based on existing language resources and related software services. It is designed as a building kit-like application that fetches data from different sources and compiles them into a single, comprehensible and structured web page. WaG can be easily configured to support many tasks, but in general, it is intended to be used not only by language experts but also the general public.


RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP

Marc Kupietz, Nils Diewald and Eliza Margaretha

Making corpora accessible and usable for linguistic research is a huge challenge in view of (too) big data, legal issues and a rapidly evolving methodology. This does not only affect the design of user-friendly graphical interfaces to corpus analysis tools, but also the availability of programming interfaces supporting access to the functionality of these tools from various analysis and development environments. RKorAPClient is a new research tool in the form of an R package that interacts with the Web API of the corpus analysis platform KorAP, which provides access to large annotated corpora, including the German reference corpus DeReKo with 45 billion tokens.In addition to optionally authenticated KorAP API access, RKorAPClient provides further processing and visualization features to simplify common corpus analysis tasks. This paper introduces the basic functionality of RKorAPClient and exemplifies various analysis tasks based on DeReKo, that are bundled within the R package and can serve as a basic framework for advanced analysis and visualization approaches.


CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing

Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl Eryani, Alexander Erdmann and Nizar Habash

We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.


ReSiPC: a Tool for Complex Searches in Parallel Corpora

Antoni Oliver and Bojana Mikelenić

In this paper, a tool specifically designed to allow for complex searches in large parallel corpora is presented. The formalism for the queries is very powerful as it uses standard regular expressions that allow for complex queries combining word forms, lemmata and POS-tags. As queries are performed over POS-tags, at least one of the languages in the parallel corpus should be POS-tagged. Searches can be performed in one of the languages or in both languages at the same time. The program is able to POS-tag the corpora using the Freeling analyzer through its Python API. ReSiPC is developed in Python version 3 and it is distributed under a free license (GNU GPL). The tool can be used to provide data for contrastive linguistics research and an example of use in a Spanish-Croatian parallel corpus is presented. ReSiPC is designed for queries in POS-tagged corpora, but it can be easily adapted for querying corpora containing other kinds of information.


HitzalMed: Anonymisation of Clinical Text in Spanish

Salvador Lima Lopez, Naiara Perez, Laura García-Sardiña and Montse Cuadros

HitzalMed is a web-framed tool that performs automatic detection of sensitive information in clinical texts using machine learning algorithms reported to be competitive for the task. Moreover, once sensitive information is detected, different anonymisation techniques are implemented that are configurable by the user –for instance, substitution, where sensitive items are replaced by same category text in an effort to generate a new document that looks as natural as the original one. The tool is able to get data from different document formats  and  outputs  downloadable  anonymised  data. This paper presents  the  anonymisation  and  substitution  technology  and  the demonstrator which is publicly available at


The xtsv Framework and the Twelve Virtues of Pipelines

Balázs Indig, Bálint Sass and Iván Mittelholcz

We present xtsv, an abstract framework for building NLP pipelines. It covers several kinds of functionalities which can be implemented at an abstract level. We survey these features and argue that all are desired in a modern pipeline. The framework has a simple yet powerful internal communication format which is essentially tsv (tab separated values) with header plus some additional features. We put emphasis on the capabilities of the presented framework, for example its ability to allow new modules to be easily integrated or replaced, or the variety of its usage options. When a module is put into xtsv, all functionalities of the system are immediately available for that module, and the module can be be a part of an xtsv pipeline. The design also allows convenient investigation and manual correction of the data flow from one module to another. We demonstrate the power of our framework with a successful application: a concrete NLP pipeline for Hungarian called e-magyar text processing system (emtsv) which integrates Hungarian NLP tools in xtsv. All the advantages of the pipeline come from the inherent properties of the xtsv framework.


A Web-based Collaborative Annotation and Consolidation Tool

Tobias Daudert

Annotation tools are a valuable asset for the construction of labelled textual datasets. However, they tend to have a rigid structure, closed back-end and front-end, and are built in a non-user-friendly way. These downfalls difficult their use in annotation tasks requiring varied text formats, prevent researchers to optimise the tool to the annotation task, and impede people with little programming knowledge to easily modify the tool rendering it unusable for a large cohort. Targeting these needs, we present a web-based collaborative annotation and consolidation tool (AWOCATo), capable of supporting varied textual formats. AWOCATo is based on three pillars: (1) Simplicity, built with a modular architecture employing easy to use technologies; (2) Flexibility, the JSON configuration file allows an easy adaption to the annotation task; (3) Customizability, parameters such as labels, colours, or consolidation features can be easily customized. These features allow AWOCATo to support a range of tasks and domains, filling the gap left by the absence of annotation tools that can be used by people with and without programming knowledge, including those who wish to easily adapt a tool to less common tasks. AWOCATo is available for download at


Data Query Language and Corpus Tools for Slot-Filling and Intent Classification Data

Stefan Larson, Eric Guldan and Kevin Leach

Typical machine learning approaches to developing task-oriented dialog systems require the collection and management of large amounts of training data, especially for the tasks of intent classification and slot-filling. Managing this data can be cumbersome without dedicated tools to help the dialog system designer understand the nature of the data. This paper presents a toolkit for analyzing slot-filling and intent classification corpora. We present a toolkit that includes (1) a new lightweight and readable data and file format for intent classification and slot-filling corpora, (2) a new query language for searching intent classification and slot-filling corpora, and (3) tools for understanding the structure and makeup for such corpora. We apply our toolkit to several well-known NLU datasets, and demonstrate that our toolkit can be used to uncover interesting and surprising insights. By releasing our toolkit to the research community, we hope to enable others to develop more robust and intelligent slot-filling and intent classification models.


SHR++: An Interface for Morpho-syntactic Annotation of Sanskrit Corpora

Amrith Krishna, Shiv Vidhyut, Dilpreet Chawla, Sruti Sambhavi and Pawan Goyal

We propose a web-based annotation framework, SHR++, for morpho-syntactic annotation of corpora in Sanskrit.  SHR++ is designed to generate annotations for the word-segmentation, morphological parsing and dependency analysis tasks in Sanskrit. It incorporates analyses and predictions from various tools designed for processing texts in Sanskrit, and utilise them to ease the cognitive load of the human annotators. Specifically, SHR++ uses Sanskrit Heritage Reader, a lexicon driven shallow parser for enumerating all the phonetically and lexically valid word splits along with their morphological analyses for a given string. This would help the annotators in choosing the solutions, rather than performing the segmentations by themselves.  Further, predictions from a word segmentation tool are added as suggestions that can aid the human annotators in their decision making. Our evaluation shows that enabling this segmentation suggestion component reduces the annotation time by 20.15 %. SHR++ can be accessed online at and the codebase, for the independent deployment of the system elsewhere, is hosted at


KOTONOHA: A Corpus Concordance System for Skewer-Searching NINJAL Corpora

Teruaki Oka, Yuichi Ishimoto, Yutaka Yagi, Takenori Nakamura, Masayuki Asahara, Kikuo Maekawa, Toshinobu Ogiso, Hanae Koiso, Kumiko Sakoda and Nobuko Kibe

The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, `Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system `Kotonoha’ based on the `Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.


Gamification Platform for Collecting Task-oriented Dialogue Data

Haruna Ogawa, Hitoshi Nishikawa, Takenobu Tokunaga and Hikaru Yokono

Demand for massive language resources is increasing as the data-driven approach has established a leading position in Natural Language Processing. However, creating dialogue corpora is still a difficult task due to the complexity of the human dialogue structure and the diversity of dialogue topics. Though crowdsourcing is majorly used to assemble such data, it presents problems such as less-motivated workers. We propose a platform for collecting task-oriented situated dialogue data by using gamification. Combining a video game with data collection benefits such as motivating workers and cost reduction. Our platform enables data collectors to create their original video game in which they can collect dialogue data of various types of tasks by using the logging function of the platform. Also, the platform provides the annotation function that enables players to annotate their own utterances. The annotation can be gamified aswell. We aim at high-quality annotation by introducing such self-annotation method. We implemented a prototype of the proposed platform and conducted a preliminary evaluation to obtain promising results in terms of both dialogue data collection and self-annotation.


Improving the Production Efficiency and Well-formedness of Automatically-Generated Multiple-Choice Cloze Vocabulary Questions

Ralph Rose

Multiple-choice cloze (fill-in-the-blank) questions are widely used in knowledge testing and are commonly used for testing vocabulary knowledge. Word Quiz Constructor (WQC) is a Java application that is designed to produce such test items automatically from the Academic Word List (Coxhead, 2000) and using various online and offline resources. The present work evaluates recently added features of WQC to see whether they improve the production quality and well-formedness of vocabulary quiz items over previously implemented features in WQC. Results of a production test and a well-formedness survey using Amazon Mechanical Turk show that newly-introduced features (Linsear Write readability formula and Google Books NGrams frequency list) significantly improve the production quality of items over previous features (Automated Readability Index and frequency list derived from the British Academic Written English corpus). Items are produced faster and stem sentences are shorter in length without any degradation in their well-formedness. Approximately 90% of such items are judged well-formed, surpassing the rate of manually-produced items.


Improving Sentence Boundary Detection for Spoken Language Transcripts

Ines Rehbein, Josef Ruppenhofer and Thomas Schmidt

This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of punctuation. In addition, one of the main bottlenecks for many NLP applications for spoken language is the small size of the training data, as the transcription and annotation of spoken language is by far more time-consuming and labour-intensive than processing written language. We therefore investigate the benefits of data expansion and transfer learning and test different ML architectures for this task. Our results show that data expansion is not straightforward and even data from the same domain does not always improve results. They also highlight the importance of modelling, i.e. of finding the best architecture and data representation for the task at hand. For the detection of boundaries in spoken language transcripts, we achieve a substantial improvement when framing the boundary detection problem assentence pair classification task, as compared to a sequence tagging approach.


MorphAGram, Evaluation and Framework for Unsupervised Morphological Segmentation

Ramy Eskander, Francesca Callejas, Elizabeth Nichols, Judith Klavans and Smaranda Muresan

Computational morphological segmentation has been an active research topic for decades as it is beneficial for many natural language processing tasks. With the high cost of manually labeling data for morphology and the increasing interest in low-resource languages, unsupervised morphological segmentation has become essential for processing a typologically diverse set of languages, whether high-resource or low-resource. In this paper, we present and release MorphAGram, a publicly available framework for unsupervised morphological segmentation that uses Adaptor Grammars (AG) and is based on the work presented by Eskander et al. (2016). We conduct an extensive quantitative and qualitative evaluation of this framework on 12 languages and show that the framework achieves state-of-the-art results across languages of different typologies (from fusional to polysynthetic and from high-resource to low-resource).


CTAP for Italian: Integrating Components for the Analysis of Italian into a Multilingual Linguistic Complexity Analysis Tool

Nadezda Okinina, Jennifer-Carmen Frey and Zarah Weiss

Linguistic complexity research being a very actively developing field, an increasing number of text analysis tools are created that use natural language processing techniques for the automatic extraction of quantifiable measures of linguistic complexity. While most tools are designed to analyse only one language, the CTAP open source linguistic complexity measurement tool is capable of processing multiple languages, making cross-lingual comparisons possible. Although it was originally developed for English, the architecture has been ex-tended to support multi-lingual analyses. Here we present the Italian component of CTAP, describe its implementation and compare it to the existing linguistic complexity tools for Italian. Offering general text length statistics and features for lexical, syntactic, and morpho-syntactic complexity (including measures of lexical frequency, lexical diversity, lexical and syntactical variation, part-of-speech density), CTAP is currently the most comprehensive linguistic complexity measurement tool for Italian and the only one allowing the comparison of Italian texts to multiple other languages within one tool.


Do you Feel Certain about your Annotation? A Web-based Semantic Frame Annotation Tool Considering Annotators’ Concerns and Behaviors

Regina Stodden, Behrang QasemiZadeh and Laura Kallmeyer

In this system demonstration paper, we present an open-source web-based application with a responsive design for modular semantic frame annotation (SFA). Besides letting experienced and inexperienced users do suggestion-based and slightly-controlled annotations, the system keeps track of the time and changes during the annotation process and stores the users’ confidence with the current annotation. This collected metadata can be used to get insights regarding the difficulty of an annotation with the same type or frame or can be used as an input of an annotation cost measurement for an active learning algorithm. The tool was already used to build a manually annotated corpus with semantic frames and its arguments for task 2 of SemEval 2019 regarding unsupervised lexical frame induction (QasemiZadeh et al., 2019). Although English sentences from the Wall Street Journal corpus of the Penn Treebank were annotated for this task, it is also possible to use the proposed tool for the annotation of sentences in other languages.


Seq2SeqPy: A Lightweight and Customizable Toolkit for Neural Sequence-to-Sequence Modeling

Raheel Qader, François Portet and Cyril Labbe

We present Seq2SeqPy a lightweight toolkit for sequence-to-sequence modeling that prioritizes simplicity and ability to customize the standard architectures easily. The toolkit supports several known architectures such as Recurrent Neural Networks, Pointer Generator Networks, and transformer model. We evaluate the toolkit on two datasets and we show that the toolkit performs similarly or even better than a very widely used sequence-to-sequence toolkit.


Profiling-UD: a Tool for Linguistic Profiling of Texts

Dominique Brunato, Andrea Cimino, Felice Dell’Orletta, Giulia Venturi and Simonetta Montemagni

In this paper, we introduce Profiling–UD, a new text analysis tool inspired to the principles of linguistic profiling that can support language variation research from different perspectives. It allows the extraction of more than 130 features, spanning across different levels of linguistic description. Beyond the large number of features that can be monitored, a main novelty of Profiling–UD is that it has been specifically devised to be multilingual since it is based on the Universal Dependencies framework. In the second part of the paper, we demonstrate the effectiveness of these features in a number of theoretical and applicative studies in which they were successfully used for text and author profiling.


EstNLTK 1.6: Remastered Estonian NLP Pipeline

Sven Laur, Siim Orasmaa, Dage Särg and Paul Tammo

The goal of the EstNLTK Python library is to provide a unified programming interface for natural language processing in Estonian. As such, previous versions of the library have been immensely successful both in academic and industrial circles.  However, they also contained serious structural limitations — it was hard to add new components and there was a lack of fine-grained control needed for back-end programming. These issues have been explicitly addressed in the EstNLTK library while preserving the intuitive interface for novices. We have remastered the basic NLP pipeline by adding many data cleaning steps that are necessary for analyzing real-life texts, and state of the art components for morphological analysis and fact extraction. Our evaluation on unlabelled data shows that the remastered basic NLP pipeline outperforms both the previous version of the toolkit, as well as neural models of StanfordNLP. In addition, EstNLTK contains a new interface for storing, processing and querying text objects in Postgres database which greatly simplifies processing of large text collections. EstNLTK is freely available under the GNU GPL version 2 license, which is standard for academic software.


A Tree Extension for CoNLL-RDF

Christian Chiarcos and Luis Glaser

The technological bridges between knowledge graphs and natural language processing are of utmost importance for the future development of language technology. CoNLL-RDF is a technology that provides such a bridge for popular one-word-per-line formats as widely used in NLP (e.g., the CoNLL Shared Tasks), annotation (Universal Dependencies, Unimorph), corpus linguistics (Corpus WorkBench, CWB) and digital lexicography (SketchEngine): Every empty-line separated table (usually a sentence) is parsed into an graph, can be freely manipulated and enriched using W3C-standardized RDF technology, and then be serialized back into in a TSV format, RDF or other formats. An important limitation is that CoNLL-RDF provides native support for word-level annotations only. This does include dependency syntax and semantic role annotations, but neither phrase structures nor text structure. We describe the extension of the CoNLL-RDF technology stack for two vocabulary extensions of CoNLL-TSV, the PTB bracket notation used in earlier CoNLL Shared Tasks and the extension with XML markup elements featured by CWB and SketchEngine. In order to represent the necessary extensions of the CoNLL vocabulary in an adequate fashion, we employ the POWLA vocabulary for representing and navigating in tree structures.


Lemmatising Verbs in Middle English Corpora: The Benefit of Enriching the Penn-Helsinki Parsed Corpus of Middle English 2 (PPCME2), the Parsed Corpus of Middle English Poetry (PCMEP), and A Parsed Linguistic Atlas of Early Middle English (PLAEME)

Carola Trips and Michael Percillier

This paper describes the lemmatisation of three annotated corpora of Middle English—the Penn-Helsinki Parsed Corpus of Middle English 2 (PPCME2), the Parsed Corpus of Middle English Poetry (PCMEP), and A Parsed Linguistic Atlas of Early Middle English (PLAEME) — which is a prerequisite for systematically investigating the argument structures of verbs of the given time. Creating this tool and enriching existing parsed corpora of Middle English is part of the project Borrowing of Argument Structure in Contact Situations (BASICS) which seeks to explain to which extent verbs copied from Old French had an impact on the grammar of Middle English. First, we lemmatised the PPCME2 by (1) creating an inventory of form-lemma correspondences linking forms in the PPCME2 to lemmas in the MED, and (2) inserting this lemma information into the corpus (precision: 94.85%, recall: 98.92%). Second, we enriched the PCMEP and PLAEME, which adopted the annotation format of the PPCME2, with verb lemmas to undertake studies that fill the well-known data gap in the subperiod (1250–1350) of the PPCME2. The case study of reflexives shows that with our method we gain much more reliable results in terms of diachrony, diatopy and contact-induced change.


CoCo: A Tool for Automatically Assessing Conceptual Complexity of Texts

Sanja Stajner, Sergiu Nisioi and Ioana Hulpuș

Traditional text complexity assessment usually takes into account only syntactic and lexical text complexity. The task of automatic assessment of conceptual text complexity, important for maintaining reader’s interest and text adaptation for struggling readers, has only been proposed recently. In this paper, we present CoCo – a tool for automatic assessment of conceptual text complexity, based on using the current state-of-the-art unsupervised approach. We make the code and API freely available for research purposes, and describe the code and the possibility for its personalization and adaptation in details. We compare the current implementation with the state of the art, discussing the influence of the choice of entity linker on the performances of the tool. Finally, we present results obtained on two widely used text simplification corpora, discussing the full potential of the tool.


PyVallex: A Processing System for Valency Lexicon Data

Jonathan Verner and Anna Vernerová

PyVallex is a Python-based system for presenting, searching/filtering, editing/extending and automatic processing of machine-readable lexicon data originally available in a text-based format.  The system consists of several components:  a parser for the specific lexicon format used in several valency lexicons, a data-validation framework, a regular expression based search engine, a map-reduce style framework for querying the lexicon data and a web-based interface integrating complex search and some basic editing capabilities.  PyVallex provides most of the typical functionalities of a Dictionary Writing System (DWS), such as multiple presentation modes for the underlying lexical database, automatic evaluation of consistency tests, and a mechanism of merging updates coming from multiple sources.  The editing functionality is currently limited to the client-side interface and edits of existing lexical entries, but additional script-based operations on the database are also possible.  The code is published under the open source MIT license and is also available in the form of a Python module for integrating into other software.


Editing OntoLex-Lemon in VocBench 3

Manuel Fiorelli, Armando Stellato, Tiziano Lorenzetti, Andrea Turbati, Peter Schmitz, Enrico Francesconi, Najeh Hajlaoui and Brahim Batouche

OntoLex-Lemon is a collection of RDF vocabularies for specifying the verbalization of ontologies in natural language. Beyond its original scope, OntoLex-Lemon, as well as its predecessor Monnet lemon, found application in the Linguistic Linked Open Data cloud to represent and interlink language resources on the Semantic Web. Unfortunately, generic ontology and RDF editors were considered inconvenient to use with OntoLex-Lemon because of its complex design patterns and other peculiarities, including indirection, reification and subtle integrity constraints. This perception led to the development of dedicated editors, trading the flexibility of RDF in combining different models (and the features already available in existing RDF editors) for a more direct and streamlined editing of OntoLex-Lemon patterns. In this paper, we investigate on the benefits gained by extending an already existing RDF editor, VocBench 3, with capabilities closely tailored to OntoLex-Lemon and on the challenges that such extension implies. The outcome of such investigation is twofold: a vertical assessment of a new editor for OntoLex-Lemon and, in the broader scope of RDF editor design, a new perspective on which flexibility and extensibility characteristics an editor should meet in order to cover new core modeling vocabularies, for which OntoLex-Lemon represents a use case.


MALT-IT2: A New Resource to Measure Text Difficulty in Light of CEFR Levels for Italian L2 Learning

Luciana Forti, Giuliana Grego Bolli, Filippo Santarelli, Valentino Santucci and Stefania Spina

This paper presents a new resource for automatically assessing text difficulty in the context of Italian as a second or foreign language learning and teaching. It is called MALT-IT2, and it automatically classifies inputted texts according to the CEFR level they are more likely to belong to. After an introduction to the field of automatic text difficulty assessment, and an overview of previous related work, we describe the rationale of the project, the corpus and computational system it is based on. Experiments were conducted in order to investigate the reliability of the system. The results show that the system is able to obtain a good prediction accuracy, while a further analysis was conducted in order to identify the categories of features which mostly influenced the predictions.


Fintan – Flexible, Integrated Transformation and Annotation eNgineering

Christian Fäth, Christian Chiarcos, Björn Ebbrecht and Maxim Ionov

We introduce the Flexible and Integrated Transformation and Annotation eNgeneering (Fintan) platform for converting heterogeneous linguistic resources to RDF. With its modular architecture, workflow management and visualization features, Fintan facilitates the development of complex transformation pipelines by integrating generic RDF converters and augmenting them with extended graph processing capabilities: Existing converters can be easily deployed to the system by means of an ontological data structure which renders their properties and the dependencies between transformation steps. Development of subsequent graph transformation steps for resource transformation, annotation engineering or entity linking is further facilitated by a novel visual rendering of SPARQL queries. A graphical workflow manager allows to easily manage the converter modules and combine them to new transformation pipelines. Employing the stream-based graph processing approach first implemented with CoNLL-RDF, we address common challenges and scalability issues when transforming resources and showcase the performance of Fintan by means of a purely graph-based transformation of the Universal Morphology data to RDF.


Contemplata, a Free Platform for Constituency Treebank Annotation

Jakub Waszczuk, Ilaine Wang, Jean-Yves Antoine and Anaïs Halftermeyer

This paper describes Contemplata, an annotation platform that offers a generic solution for treebank building as well as treebank enrichment with relations between syntactic nodes. Contemplata is dedicated to the annotation of constituency trees. The framework includes support for syntactic parsers, which provide automatic annotations to be manually revised. The balanced strategy of annotation between automatic parsing and manual revision allows to reduce the annotator workload, which favours data reliability. The paper presents the software architecture of Contemplata, describes its practical use and eventually gives two examples of annotation projects that were conducted on the platform.


Interchange Formats for Visualization: LIF and MMIF

Kyeongmin Rim, Kelley Lynch, Marc Verhagen, Nancy Ide and James Pustejovsky

Promoting interoperrable computational linguistics (CL) and natural language processing (NLP) application platforms and interchange-able data formats have contributed improving discoverabilty and accessbility of the openly available NLP software.  In this paper, wediscuss the enhanced data visualization capabilities that are also enabled by inter-operating NLP pipelines and interchange formats.For adding openly available visualization tools and graphical annotation tools to the Language Applications Grid (LAPPS Grid) andComputational Linguistics Applications for Multimedia Services (CLAMS) toolboxes, we have developed interchange formats that cancarry annotations and metadata for text and audiovisual source data.  We descibe those data formats and present case studies where wesuccessfully adopt open-source visualization tools and combine them with CL tools.


Developing NLP Tools with a New Corpus of Learner Spanish

Sam Davidson, Aaron Yamada, Paloma Fernandez Mira, Agustina Carando, Claudia H. Sanchez Gutierrez and Kenji Sagae

The development of effective NLP tools for the L2 classroom depends largely on the availability of large annotated corpora of language learner text. While annotated learner corpora of English are widely available, large learner corpora of Spanish are less common. Those Spanish corpora that are available do not contain the annotations needed to facilitate the development of tools beneficial to language learners, such as grammatical error correction. As a result, the field has seen little research in NLP tools designed to benefit Spanish language learners and teachers. We introduce COWS-L2H, a freely available corpus of Spanish learner data which includes error annotations and parallel corrected text to help researchers better understand  L2 development, to examine teaching practices empirically, and to develop NLP tools to better serve the Spanish teaching community. We demonstrate the utility of this corpus by developing a neural-network based grammatical error correction system for Spanish learner writing.


DeepNLPF: A Framework for Integrating Third Party NLP Tools

Francisco Rodrigues, Rinaldo Lima, William Domingues, Robson Fidalgo, Adrian Chifu, Bernard Espinasse and Sébastien Fournier

Natural Language Processing (NLP) of textual data is usually broken down into a sequence of several subtasks, where the output of one the subtasks becomes the input to the following one, which constitutes an NLP pipeline. Many third-party NLP tools are currently available, each performing distinct NLP subtasks. However, it is difficult to integrate several NLP toolkits into a pipeline due to many problems, including different input/output representations or formats, distinct programming languages, and tokenization issues. This paper presents DeepNLPF, a framework that enables easy integration of third-party NLP tools, allowing the user to preprocess natural language texts at lexical, syntactic, and semantic levels. The proposed framework also provides an API for complete pipeline customization including the definition of input/output formats, integration plugin management, transparent ultiprocessing execution strategies, corpus-level statistics, and database persistence. Furthermore, the DeepNLPF user-friendly GUI allows its use even by a non-expert NLP user. We conducted runtime performance analysis showing that DeepNLPF not only easily integrates existent NLP toolkits but also reduces significant runtime processing compared to executing the same NLP pipeline in a sequential manner.

  • LREC 2020 Paper Dissemination (1/10)
  • LREC 2020 Paper Dissemination (2/10)
  • LREC 2020 Paper Dissemination (3/10)
  • LREC 2020 Paper Dissemination (4/10)
  • LREC 2020 Paper Dissemination (5/10)
  • LREC 2020 Paper Dissemination (6/10)
  • LREC 2020 Paper Dissemination (7/10)
  • LREC 2020 Paper Dissemination (8/10)
  • LREC 2020 Paper Dissemination (9/10)
  • LREC 2020 Paper Dissemination (10/10)