- Language Resources
- Legal Issues
- ELRA/ELDA Projects
- Evaluation Campaigns
LRs in the ELRA Catalogue this month
One new written corpus is now available in ELRA catalogue.
ELRA-W0318 Danish Gigaword Corpus
This corpus consists of over a billion words for Danish collected from various websites. Domains are distributed as follows: Legal (308.8 million words), Social Media (261.4 million words), Subtitles (130.1 million words), Debates (108.4 million words), Conversations (0.7 million words), Web (101.02 million words), Encyclopedia (55.6 million words), Literature (31.3 million words), Manuals (2.6 million words), Books (2.1 million words), Religion (600k words), News (40 million words), Other (1.2 million words).
The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.
Figures of the month
- 22 new ISLRN numbers assigned in January 2022
- 13 new ISLRN providers in January 2022
- A total of 3199 ISLRN numbers assigned since January 2014
- A total of 249 distinct languages
The latest LRs for which an ISLRN number was requested and accepted in January are as follows:
Special focus on the ELRA Catalogue
FID Linguistik & ELRA team up to grant licenses for written corpora to German researchers
In cooperation with ELRA, FID Linguistik has developed a licensing model based on a pay-per-use principle to grant licences for individual corpora from the ELRA catalogue to German researchers. The objective is to support German linguists’ research activities by giving them access to commercialized language resources.
With the financial support of the German Research Foundation (DFG), the service is free of charge and intended exclusively for researchers affiliated to an academic institution in Germany.
Read the full Press Release.
ELRA Legal Issues publications
ELRA also contributes with various papers or use cases in the field of Language Technologies. Some of them are made available under the Legal Issues Papers section of the ELRA website.
At the end of 2021, a use case was produced within the activity of the ELRC legal helpdesk to give an analysis of legal conditions applying to audio, video and dialogue subtitles coming from emergency calls embedded in a German TV show when being re-used for developing AI models.
Legal and Ethical Issues Worskhop @ LREC2022 - Call for Papers
ELRA is co-organising the Legal and Ethical Issues Workshop that will take place at LREC 2022 on 24 June. The Call for Papers was issued and circulated early January 2022. This year, to better respond to the needs of the international language resources community, a single workshop will aim at tackling Legal and Ethical Issues in Language Resources with a specific focus on trying to build bridges between legal issues and technology.
Developments on the Data Governance Act
On November 25, 2020, the European Commission adopted its proposal for a new Data Governance Act. Its main goals are the following :
- Making public sector data available for re-use, in situations where such data is subject to rights of others.
- Sharing of data among businesses, against remuneration in any form.
- Allowing personal data to be used with the help of a "personal data-sharing intermediary", designed to help individuals exercise their rights under the General Data Protection Regulation (GDPR).
- Allowing data use on altruistic grounds.
Legislative developments on the AI Act
On January 25, 2022, the two leading committees of the European Parliament, the Internal Market and Consumer Protection and the Civil Liberties, Justice and Home Affairs organised their first joint meeting to exchange views on the proposed Artificial Intelligence Act.
The statements of the representatives of each committee ahead of this meeting are available online.
WhatsApp to clarify how they process personal data
Further to a complaint filed by the European Consumer Organisation (BEUC) in July 2021, the European Commission has sent a letter to WhatsApp in January 2022 to ask them to provide more information on their compliance with EU regulations for what concerns the protection of consumers and protection of privacy.
For more information read here.
UNESCO Recommendations on Ethical AI
On November 23, 2021, the UNESCO adopted a recommendation on ethical ArtificiaI Intelligence. This recommendation aims to help AI bring its advantages to society while mitigating the risks. It calls for action on four pillars: protection of data, banning social credit and mass surveillance, monitoring of AI systems, protection of the environment.
The full text of the recommendation is available here.
Lockdown on Google Analytics
On February 10, the CNIL, the French Authority for Data Protection and Liberties, sent a compliance formal notice to a website using Google Analytics for audience measures. Indeed, following the “Schrems II” case where the European Court of Justice decided to invalidate the Privacy Shield for transfers into the United States, all data transfers are being re-evaluated by European regulation authorities.
In this case, the CNIL decided that the data transfers made when using Google Analytics did not provide adequate safeguards to protect the data from being accessed by foreign intelligence services and therefore compelled the website to get compliant in a month's delay.
The full text of the notice available here (in French).
Panel on Legal and Ethical stakes of Language Technologies
This panel was held at the Forum on Innovation, Promotion, Technologies and Plurilingualism on February 9, 2022 (10:10 - 11:10).
The main issues that were tackled during the panel revolved around the cost of multilingualism necessary to provide citizens with accurate information in their relationship with administrations.
The second topic that was tackled was ethical aspects related to Artificial Intelligence as follows:
- Quality of Artificial Intelligence
- Bias and Discrimination issues related to Artificial Intelligence
- Possibility of deception by machines impersonating humans
- Modification of work practices involving micro-work done by annotators
- Distribution of competence among participants and new actors on the market
- Sustainability of AI systems
- Business practices and corporate AI ethics
Information on the on-going projects
European Language Equality
The European Language Equality (ELE) project develops a strategic research, innovation and implementation agenda as well as a roadmap to achieve full digital language equality in Europe by 2030. ELE counts on 52 partners representing all European countries, research and industry, together with major pan-European initiatives. The project’s approach builds up on wide-coverage individuals’ contributions to define its strategic agenda.
ELE has performed a thorough identification task resulting in the listing and description of more than 6,000 datasets and tools for all European countries. ELDA has collaborated with LISN in the identification for the French language. These metadata descriptions have been recently imported into the European Language Grid catalogue (see below).
Reports on the definition of digital language equality and the state of the art in language technology and language-centric AI are already available here.
The consortium is currently working on 31 language reports that fall within the analysis of the digital status of European languages and their risk of digital extinction. ELDA has taken care of the revision of five of them. These reports will pave the way towards the needs and remaining efforts to obtain digital language equality by 2030.
European Language Grid
The European Language Grid (ELG) project has entered its final six months of project runtime (a 36-month project extended six further months for pandemic reasons) and is working towards the third and final release of its platform due at the end of March 2022. The final report on the status of datasets, models, identified gaps, produced resources and their exploitation within ELG (D5.3) has just been submitted by ELDA (as WP leader) to the EC. At present, the ELG catalogue hosts 8,873 datasets and 2,734 tools and services. This content comprises the 4,127 metadata records for language resources and the 2,215 for tools that were identified under ELG’s sister project ELE. Both projects work in very close collaboration and ELE feeds ELG with the findings from its desk research for strategy and catalogue population.
For procedural reasons ELG deliverables are not publicly available yet, but they will be in the coming months.
Multilingual Anonymisation for Public Administrations (MAPA)
The MAPA project ended last December 2021. A de-identification toolkit has been built focusing on the health and legal domains and covering the 24 CEF languages. The system makes use of Named Entity Recognition (NER) techniques for sensitive information detection. The NER mechanism has been trained and tested on data produced within the project. ELDA has led the data production work within the initiative and the following language resources have been built for all 24 languages:
- A 1Million sentence monolingual raw corpus.
- An annotated corpus with EUR-LEX data: 2,000 sentences of parallel data for all 24 languages, annotated with named entities.
- A list of person names: 10,000 combined names (given name + surname), except for those languages requiring specific morphological changes and where given names and surnames are provided separately.
Furthermore, additional annotated data have been produced for further system training for English, French, Greek, Italian, Maltese and Spanish, comprising both clinical cases and legal text.
All datasets are available through the ELRC-SHARE repository and, shortly, through the ELG catalogue. The INCEpTION platform has been used for data annotation.
The MAPA toolkit can be easily deployed as its engines are fully dockerised. Use cases like the Spanish Ministry of Justice and Complaints Watch by DG-Justice have tested MAPA and shown interest in using it. The EC’s eTranslation platform has also tested the toolkit providing very positive feedback.
ELDA is currently assembling all trained models and preparing the deployment of the de-identification system for public use. More details on how to access it will follow in the next Newsletter.
ELDA will be organising a Workshop on the topic of de-identification of sensitive language resources at LREC 2022.
IberLEF 2022 Tasks on Sentiment, Stance and Opinions
- ABSAPT: Aspect-Based Sentiment Analysis in Portuguese
- PoliticEs: Spanish Author Profiling for Political Ideology
- Rest-Mex 2022: Recommendation System, Sentiment Analysis and Covid Semaphore Prediction for Mexican Tourist Texts
IberLEF 2022 Tasks on Harmful Content
- DA-VINCIS@IberLEF2022: Detection of Aggressive and Violent INCIdents from Social Media in Spanish
- DETESTS (DETEction and classification of racial Stereotypes in Spanish)
- EXIST 2022: sEXism Identification in Social neTworks
IberLEF 2022 Tasks on Information Extraction and Paraphrase identification
- LivingNER: Named-Entity Recognition and entity linking for living being mentions
- PAR-MEX: Paraphrase Identification in Mexican Spanish
IberLEF 2022 Tasks on Question Answering and Machine Reading
- QuALES: Question Answering Learning from Examples in Spanish
- ReCoRES: Reading Comprehension and Reasoning Explanation for Spanish
BUCC 2022 SHARED TASK: bilingual term alignment in comparable specialised corpora
MedVidQA 2022: Shared Task on Medical Video Question Answering
ReproGen 2022: Shared Task on Reproducibility of Evaluations in NLG
News from ELRA
The LREC Conference is ELRA’s biennial flagship event organized since 1998 with the support of institutions and organizations involved in HLT.
LREC 2022, the 13th edition will take place from May 20 to 25, 2022 at the Palais du Pharo, in Marseilles (France). The paper submission to the Main Conference is now over and 1300 papers have been submitted. The review is in process. Authors will be notified early April 2022. The submission to the workshops is now on-going. Information on all workshops, including the submission information, and all tutorials is available at https://lrec2022.lrec-conf.org...
Language Resources and Evaluation Journal
Volume 55, issue 4, published in December 2021. 11 articles are published in this issue.
- LRE Journal Call for Papers: Special Issue on Translation Platforms
This call for papers was launched at the beginning of January 2022 and focuses on current developments of Translation Platforms, their interaction with the distinct translation players, their integration of languages resources, and more.
- March 15, 2022: Paper submission
- May 31, 2022: Acceptance/rejection notification
- January 31, 2023: tentative publication date
For more information, please read this page.
ELRA Press Releases
On January 24, 2022, the PR announcing the opening of the OpenSLR European mirror at https://openslr.elda.org/ was circulated.
The full PR can be found here.
Job @ ELDA
Technical Engineer position opened at ELDA
ELDA is currently seeking to fill immediate vacancy for a Technical Engineer (m/f).
Under the supervision of the CEO, the Technical Engineer in Speech and Multimodal Technologies will be in charge of conducting the activities related to the production of language resources. (S)he will be managing language resources production projects and co-ordinating R&D projects while being also hands-on whenever required by the development team. Their responsibilities include designing/specifying language resources, setting up production frameworks and platforms, carrying out quality control and assessment, recruiting and managing project-dedicated human resources. (S)he will also contribute to the improvement or updating of the current language resources production workflows.
The position is based in Paris.
Please check the profile details here: http://www.elra.info/en/opportunities/
News from the community
Forum on Innovation, Technologies and plurilingualism
In the framework of the French Presidency of the Council of the European Union, the Ministry of Culture - General Delegation for the French language and languages of France (DGLFLF) - organized the online Forum "Innovation, Technologies and plurilingualism" from 7 to 9 February 2022.
The 3-day event brought together French and European policymakers, practitioners, and stakeholders in the fields of translation, language technology, digital technology, and artificial intelligence with the objective to address the role technologies can play to support and foster multilingualism in Europe.
The forum, held fully online, was a success, as shown by the Key figures below:
- 900 registrations from all over Europe and beyond
- 400 participants at the Opening session
- 70 nationalities represented
- 95 speakers
- Interpretation in 4 languages (French, English, German, and French Sign Language)
- Simultaneous subtitling in 22 official languages of the European Union
The Programme of the event is available here: https://bit.ly/3sc2GgZ
The 6th ELRC Conference will be held on March 31, 2022 via Zoom.
The event will provide an overview on the latest achievements of the European Language Resource Coordination, as well as on the efforts spent on Large Language Models, Low-resource NMT, and Multimodal Language Data across Europe.
The full programme is available at https://lr-coordination.eu/6thELRC.