ELDA is involved in a number of Language Resources production projects, including projects towards the production of annotated data and the production of corpora for MT evaluation.
Building a multilingual conversational telephone speech corpus
Objective of the project: build a multilingual corpus based on phone call conversations.
Languages: Arabic, English, French, German, Italian, Korean, Mandarin and Cantonese Chinese, Portuguese, Russian, and some other languages.
The orthographic transcription of the spoken language is always a tricky operation, but in this case, Cantonese is especially complicated because the transcription was made in simplified Chinese instead of the traditional version, normally used for Cantonese language. As a result, the transcription process also encompassed some sort of translation from one language to another based on the fact that some Cantonese syllables could not be written down with simplified characters and should be translated into standard Chinese.
Objective of the project: develop an automatic subtitling system primarily for deaf people, but also for hearing people with a limited command of French (for example students of French language or foreigners immersed in French culture).
For this annotation project conducted for the benefit of LISN (former LIMSI), ELDA handled the manual transcription of 20 hours of TV shows, broadcasts, documentaries, and series. Special attention was paid to the exceptional accuracy of compliance with the norms of modern French spelling. The system is based on artificial intelligence methods (recurrent neural networks and Deep Learning).
Objective of the project: produce a bilingual corpus to build an automatic translation system for some African vernacular languages, in partnership with the LIA (Laboratoire Informatique d’Avignon, France).
Language: Tamasheq, a language mostly spoken in Niger and Mali, is the first language addressed by the project.
The main difficulty lied in the language. Tamasheq is a spoken language in which a number of economic, scientific, and political terms do not exist. Therefore, native speakers must replace them with their equivalent in French, Arabic, or other African languages. However, the objective remained rigorous, that is: respecting all the nuances of interpretation on the one hand, and trying to enrich the vocabulary of the automatic translation system on the other hand. For this purpose, a couple of native speakers have been in charge of translating the Niger’s broadcast news from Tamasheq into French.
POS-tagged Corpus in French and Spanish
Objective of the project: annotate 64k sentences in French and 64k sentences in Spanish in Part-Of-Speech (POS) and 8k sentences with their corresponding phonetisation (grapheme-phoneme).
Languages: French and Spanish
Carried out in partnership with the Universities of Catalonia and of Barcelona, along with the LISN (former LIMSI in France), the project's final objective was the development of the speech of robots (Speech Recognition, Speech Synthesis) greeting tourists in Tokyo during the 2020 Olympic Games.
Audio Data Annotation for Speaker Identification
Objective of the project: annotate 300 hours of audio files from the French channel LCP programmes.
The project started in December 2019 and consisted in annotating audio data on the basis of video files so as to identify every speaker. Carried out in the framework of the European research program Chist-ERA, in collaboration with the LNE (Laboratoire national de métrologie et d'essais, France), the University of Le Mans (France), the Polytechnic University of Catalonia (Spain) and IDIAP (Switzerland), the project is on-going.
The annotated corpus is to be used by the LNE’s Artificial Intelligence Systems Evaluation team and its partners for research and evaluation of automatic Lifelong Learning systems for speaker recognition. The corpus will be made available in the ELRA Catalogue.
Building a Speaker Identification Corpus for French
Objective of the project: annotate data so as to identify the speakers while segmenting the part of speech
Two main tasks were carried out by the ELDA team: identify and collect audio flows selected to cover a wide variety of topics (at least 50 different contexts) for 400 different speakers, then annotate the data so as to identify the speakers (name, gender) while segmenting the part of speech.
The project consists of finding, collecting and annotating the audio recordings of at least 400 speakers, mainly male speakers from politics and media. For each speaker, at least 50 audio recordings from different contexts (different programmes and dates, etc.) must be collected and each record must contain a minimal length of 30 seconds. Speaking in French is mandatory and accents are allowed. The recordings are taken from television or radio programmes.
This project is part of a research program, in collaboration with the national police.
The annotated corpus will be used for research and evaluation of automatic voice comparison systems.
Improve a conversational system
Objective of the project: get natural questions in English to improve the customer’s conversation system.
The objective is achieved through rephrasing the questions of the American corpus CoQA (Conversational Question Answering), produced by the Stanford NLP Group. This corpus contains 127k+ questions with answers collected from 8k+ conversations. A conversation is a paragraph on a given topic followed by several questions pertaining to the topic. On average, each conversation is made of 15 Questions/Answers. As they are, the questions of CoQA are all in-context questions, they are conversation-dependent and may therefore depend on the previous questions.
For the project, however, each question needs to be rephrased in 3 different ways, as well as in an “out-of-context” way, e.g. by replacing pronouns, anaphors, coreferences, etc. The expected result is to process questions from 4k conversations.
CoQA’s conversations cover 7 different domains:
• children’s stories from MCTest
• literature from Gutenberg Project
• middle and high school English exams from RACE
• news articles from CNN
• articles from Wikipedia
• Reddit articles from the Writing Prompts dataset
• science articles from AI2 Science Questions.
Production of corpora for MT system
Objective of the project: The objective of this project was to carry out the construction of parallel data (English-Arabic) in 6 specific Topics: Terrorism, Security, Migration, Health Policy, Election, Foreign Policy in order train, improve and evaluate a Machine Translation system.
Languages: English, Arabic
Using existing tools, the ELDA team was able to crawl parallel web content then align English-Arabic parallel sentences in the given topics. Two types of data normalization were carried out: one following the rule-based normalization method and one using an AI model trained on data.
For the named entities, English and Arabic models were used in order to identify sentences containing at least one named entity. The proposed a pre-annotation was then validated or corrected.
In the end, the following corpora were delivered:
1. Training corpus containing ~ 360K parallel sentences (~ 17 million tokens in English)
2. Test Corpus of 6.2k parallel sentences
3. Named entities Corpus with ~ 1500 named entities.
Objective of the project: The objective of this project, completed on 2020, was to collect and annotate Tweets in 3 languages (Arabizi, French and English) for the 3 predefined themes (Hooliganism, Racism, Terrorism).
Languages: English, Arabic, Arabizi
Tools were developed by the ELDA team for the collection of tweets and their annotation.
For the collection, a tool has been developed in Python (based on the “GetOldTweets3“ library) which used information such the language (EN/FR), a keyword list as input. With this tool, a maximum of 10k tweets per couple (keyword, language) were collected for English and French. For Arabizi, a specific process was setup, consisting in creating a vocabulary in Arabizi from a corpus of Arabizi SMS (for Moroccan and Tunisian) by selecting the 1000 most frequent words, then, for each word in the vocabulary and in the keywords list, downloading the tweets containing the selected word (places = Morocco, Tunisia, Algeria). The Arabizi tweets that were kept had to contain at least 5 Arabizi words.
For the annotation, the tool running on Django has been developed in order to provide the following annotations for each tweet in a given sequence:
• Theme: with 5 possible annotations (Hooliganism, Racism, Terrorism, Others, Incomprehensible)
• Topic: the annotator can add a new topic if it does not exist in the proposed list
• Opinion: 3 possible annotations (Negative, Neutral, Positive)
In total, 17103 sequences were annotated (~ 585 K tweets) including the themes “Others” and “Incomprehensible”. Among these sequences, 4578 sequences (~ 127 K tweets) having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism, Terrorism) were obtained, 1866 sequences with an opinion change and 8733 hateful tweets.