ELRA Newsletter
Issue #10 | January 2025
Language Resources
LRs @ELRA
Language Resources in the ELRA Catalogue
We are happy to announce that, since May 2024, 29 new speech resources have been released and are now available in our catalogue. Moreover, 48 speech corpora are now available at reduced fees.
1. New Language Resources
Speech Resources
a. ALLIES Corpus
ISLRN: 697-328-151-668-9
A comprehensive full-form lexicon of Egyptian Arabic general vocabulary (DiaLEX-EA) including 78 million entries for 31,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Egyptian Arabic, especially morphological analysis and speech technology.
Quantity and size: 75,204,644 lines / 11,217 MB (11.0 GB)
DiaLEX – Emirati (DiaLEX-UA)
ISLRN: 836-793-503-213-8
A comprehensive full-form lexicon of Emirati Arabic general vocabulary (DiaLEX-UA) including 28 million entries for 29,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Emirati Arabic, especially morphological analysis and speech technology.
Quantity and size: 24,976,871 lines / 3,841 MB (3.8 GB)
DiaLEX – Saudi Arabian Hijazi (DiaLEX-HA)
ISLRN: 849-157-479-216-3
A comprehensive full-form lexicon of Hijazi Arabic general vocabulary (DiaLEX-HA) including 21 million entries for 30,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Hijazi Arabic, especially morphological analysis and speech technology.
Quantity and size: 20,247,655 lines / 2,835 MB (2.8 GB)
Speech Resources
ÌròyìnSpeech
ISLRN: 012-405-700-001-6
A modern, high-fidelity, multi-speaker, Yorùbá read speech corpus suitable for Speech Synthesis, Automatic Speech Recognition and Computational Linguistics research. The subject matter is drawn from the Broadcast News domain as well as fictional texts, delivering a multi-purpose, contemporary speech dataset. This corpus consists in 34000 read sentences, 42 hours of audio recorded under 48kHz, 16bit Linear PCM WAV format, for ca. 12.5 Gigabytes.
Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS)
ISLRN: 016-848-885-785-1
SANACS Corpus contains 67 recorded sessions of interactions between two native Slovak speakers. In 37 sessions an autistic child interacts with a neurotypical adult experimenter, and in 30 control sessions a neurotypical child interacts with the same neurotypical adult experimenter. The children were 6-12 years old (mean 9.2). In all sessions, the two participants are involved in a collaborative, task-oriented communication based on the Maps Task. Most tasks consist of six trials: a practice and two real trials where the experimenter is the describer and the child the follower, and then one practice and two real trials when the roles switched and the child is the describer and the experimenter is the follower.
2. Reduced fees for the following written corpus
Wojood – A corpus for nested Arabic Named Entity Recognition
ISLRN:
Wojood consists of about 550,000 tokens (Modern Standard Arabic and dialect) that are manually annotated with 21 entity types (person, group of people, occupation, organization, geopolitical entity, location, facility, event, date, time, language, website, law, product, cardinal number, ordinal number, percent, quantity, unit, money, currency). It covers multiple domains (Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics, Migration, Terrorism, social media) and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. The corpus was annotated using the IOB2 tagging scheme and is available in CSV format.
ISLRN submissions
The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.
Latest figures
- 31 new ISLRN numbers assigned between November 2023 and May 2024.
- A total of 3534 ISLRN numbers assigned since January 2014.
- A total of 275 distinct languages.
The latest LRs for which an ISLRN number was requested and accepted are as follows:
- English-BengaliBraille Parallel Corpus
- English-GujaratiBraille Parallel Corpus
- English-HindiBraille Parallel Corpus
- English-MarathiBraille Parallel Corpus
- English-TamilBraille Parallel Corpus
- English-TeluguBraille Parallel Corpus
- LoReHLT Hausa Representative Language Pack
- Automatic Content Extraction for Portuguese
- Call My Net 1
More about ISLRN.
Legal Issues
The European Data Protection Board publishes guidelines on personal data in AI models
On December 17, 2024, the European Data Protection Board (EDPB) published Opinion 28/2024, addressing important data protection aspects related to the processing of personal data in the context of AI models. The Opinion was a chance for the EDPB to provide awaited guidance on areas that highly interests the AI industry. This includes the usage of Legitimate Interests as a legal basis for data processing for development and deployment of AI systems. The EDPB therefore provides clear indications that such legal basis can be used in the context of AI. The EDPB emphasizes here that controllers must have a precise legitimate interest and only carry out processing activities that are in line with data subjects’ reasonable expectations. The EDPB recalls the already established three-step approach for assessing legitimate interests where the controller needs to identify a legitimate interest pursued by the controller or a third party, demonstrate that processing is necessary to achieve that interest and ensure the processing does not override the fundamental rights and freedoms of data subjects.
The EDPB emphasizes that the lawfulness of processing during the development phase may significantly impact the lawfulness of subsequent processing. Meaning that legitimate interests can be relied upon, but while ensuring the lawfulness of the development of the system from an AI perspective. This assessment should consider the risks raised in the deployment phase as well as the source of the personal data used in model development.
The EDPB also points out that the determination of anonymity in AI models requires a case-specific approach. To be classified as anonymous, an AI model must present a very small likelihood of personal data extraction, taking into account “all the means reasonably likely to be used” by the controller or any other party. This is obviously a stringent criterion in the state-of-the-art of LLMs and suggests that LLMs will mostly fail to meet the threshold for anonymity.
The EDPB notes the necessity of evaluating anonymity on a case-by-case basis and recognizes that there is no one-size-fits-all for anonymity. This matter is expected to receive further clarification from the EDPB in the near future. The Board has already issued Guidelines 01/2025 on Pseudonymisation and is anticipated to publish additional Guidelines on anonymization later this year. These guidelines are expected to align with and incorporate insights from anticipated Court of Justice (CJEU) rulings on the subject.
ELRA/ELDA Projects
Information on the on-going projects
Common European Language Data Space (LDS)
ELDA is currently working on the different tasks under its responsibility:
Setting up a technical and legal helpdesk to provide support and guidance to all platform users. The service running this helpdesk is already operational since late 2023 and it can be found here.
Definition and establishment of a Multistakeholder Data and Services Governance Scheme.
For that purpose, a solid collaboration has been established between the LDS and the Data Spaces Support Centre (DSSC) and work towards a full alignment of governance issues has been established. For instance, ELDA participates in the DSSC expert groups and in their work towards their blueprint (version 1.5 at present). Likewise, LDS collaborates closely with the Simpl and TEMS projects and regular meetings are held to evolve in a synchronised manner. Furthermore, LDS has contributed to Simpl-Open’s feasibility study on data spaces as one of their use cases. This contribution aimed at facilitating Simpl’s integration and interoperability in the data space ecosystem and, in particular, in the LDS development and deployment, also considering LDS specifications.
An updated version of this governance scheme (v2) was submitted in July 2024 and a new version is due at the end of February 2025. This new version will take into consideration the architectural development of the LDS infrastructure as the version 1.0.0-beta prototype is going to be released in February 2025.
Organization of events
One further Technology Workshop, the Workshop on the Linguistic and Cultural Evaluation of Large Language Models organized in collaboration with the EC and the ALT-EDIC, was held on December 11, 2024. Discussions revolve around the linguistic and cultural implications behind LLM development, which are strong concerns requiring deeper analysis. The workshop also offered an enlightening overview of the related work that is taking place across Europe with numerous experts sharing their work and approaches.
- Country Workshop in Hungary, held in Budapest on October 1, 2024.
- Country Workshop in Ireland, held in Dublin on October 10, 2024.
- Country Workshop in Slovakia, held in Bratislava on November 7, 2024.
- Country Workshop in Czechia, held in Prague on December 2, 2024.
- Country Workshop in Belgium and the Netherlands, in Breda on January 23, 2025.
The LDS Launch Conference is planned to take place in March 2025 as an official introduction of the LDS and their close-collaboration with the ALT-EDIC. This event will showcase the beta release of the LDS platform, and it will be co-located with the ALT-EDIC General Assembly of Members and the kick-off meeting of the LLMs4EU project which is coordinated by the ALT-EDIC.
Ensuring the legal compliance of the Language Data Space
ELDA is currently conducting a comprehensive legal analysis of the Common European Language Data Space and evaluating it against potential governance options. This analysis encompasses a wide range of legal frameworks, including cybersecurity legislation, competition law, data protection regulations, and various EU-level digital legislation such as the Data Governance Act and Data Act. The assessment aims to ensure that the LDS governance scheme adheres to all relevant EU regulations, particularly those related to the collection and sharing of language data and services.
Language Technology Solutions – CNECT/LUX/2022/OP/0030
This call for tenders from the European Commission was published within the Digital Europe programme (DIGITAL). It aims to achieve three specific goals:
- facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites;
- support the creation of open-source European language speech recognition solutions;
- carry out market studies on language technologies and widely disseminate their results to foster the take-up of language technologies in Europe.
ELRA, through its operational body ELDA, is involved in two of the funded projects which are described below.
LOT 1 – Solutions Supporting the Use of Automated Translations on Websites
The main goal of the European Multilingual Web (EMW) project is to set up a set of ready-to-use open-source websites automated translations solutions. In this framework, ELDA will oversee managing the Helpdesk team that will support users to report issues and ask questions about the solutions.
During this period, the WEB-T Helpdesk team continued to perform WEB-T users’ support.
ELDA also continued its legal expertise support to the project consortium by addressing necessary data protection issues for the solution.
For more information on WEB-T: https://website-translation.language-tools.ec.europa.eu/
LOT 2 –Automated speech recognition prototype solutions
The LTS Lot 2 initiative aims to create an open-source Automatic Speech Recognition (ASR) solution and a speech corpus named Low-resource European Languages Datasets (LELD) for three under-resourced European languages: Czech, Estonian, and Greek. At the same time, a market study on the current state of ASR was conducted.
ELDA is specifically involved in the creation of the LELD corpus, while the rest of the consortium, BUT and TILDE, are the main actors in the development of the ASR solutions.
LELD will contain 4,500 hours of speech per language (13,500 hours in total), and one-third of these data will be transcribed, i.e., 1,500 hours per language.
We collected audio data from four public institutions within the targeted countries:
- The European Commission (Czech, Estonian and Greek data), to collect European Parliament plenary sessions and committee sessions, meetings of the committee of the regions, and meetings of the Council of the European Union.
- The Czech cities’ councils (Czech data), to collect the city council meetings.
- The Riigikogu (Estonian data), to collect national parliament sessions and interviews.
- The ΒΟΥΛΗ ΤΩΝ ΕΛΛΗΝΩ (Greek data), to collect national parliament plenary sessions and committee meetings.
LELD’s raw and transcribed data will be used to train the three ASR solutions. The transcribed part of the corpus is produced by native speakers of each of the languages present in LELD.
The European Commission Directorate-General Interpretation (DG SCIC) has shown interest in the ASR solution presented in the project. One of DG SCIC’s missions is to provide interpretation services for the Commission, European Council, Council of the EU, Committee of the Regions, European Economic and Social Committee, European Investment Bank as well as agencies and offices in EU countries. It is therefore worthwhile training the ASR model on data generated in the framework of political bodies, both European and national, as it shares similar elements, such as register, vocabulary, intonation, with the data that DG SCIC works on.
As for the Lot 1 initiative, ELDA also offered legal expertise to the project consortium mainly concerning data protection issues and intellectual property.
As for the Lot 1 initiative, ELDA also offered legal expertise to the project consortium mainly concerning data protection issues and intellectual property.Consortium and Tasks
The consortium operating in this project is coordinated by Brno University of Technology (BUT) with the participation of TILDE and ELDA. Three main tasks are being performed with the participation of all members of the consortium, which are:
- Task 1: A comprehensive market study of the Automatic Speaker Recognition (ASR) solutions.
- Task 2: Creation of an open-source speech recognition prototype solution for three under-represented European languages (Czech, Estonian, and Greek).
- Task 3: Collection and partial transcription (one third) of speech data for the three above-mentioned European under-resourced languages.
Dissemination
News from ELRA
Advancing Humanism through Language Technologies
24-26 February 2025
UNESCO Headquarters in Paris, France
The 2nd International Conference on Language Technologies for All (LT4All 2025) has as its theme Advancing Humanism through Language Technologies and as its aim furthering the agenda of language technologies with a focus on community empowerment. It will try to explore the relationships among technologies, languages, and their related communities from scientific, technical, cultural, linguistic, economic, political,l and ethical perspectives. Its goal is to harness technology not only to advance itself but also to support and enhance individuals’ capabilities.
LT4All 2025 is organized by ELRA and SIGUL, the ELRA/ISCA Special Interest Group on Under-resourced Languages, in partnership with UNESCO. It will be held at UNESCO Headquarters in Paris from 24 to 26 February 2025 as part of the International Decade of Indigenous Languages (IDIL 2022-2032) and will commemorate the Silver Jubilee of International Mother Language Day. It will try to bring together scientific and technological solution providers and representatives of linguistic communities.
Anyone with a relevant proposal to share is invited to submit it to the LT4All 2025 Secretariat for selection and inclusion in one of the sessions of the Conference Program Tracks. The conference’s expected audience includes representatives from UNESCO, governments, academia, language technology researchers, linguists, industrialists, indigenous communities, and language policymakers worldwide. Attendance is by invitation, with a maximum of 400 participants. Anyone wishing to attend the Conference can submit an expression of interest through the Web site contact form.
From LT4All 2025 presentations and discussions the Conference Organizers hope to draw conclusions, recommendations, and – possibly – an action plan to be submitted to UNESCO.
Background
The 1st International Conference on Language Technology for All (LT4All 2019), held at UNESCO Headquarters in Paris in 2019 and themed Multilingualism for Building Knowledge Societies, highlighted the critical role of language and cutting-edge technology, including artificial intelligence, in shaping inclusive and cross-cultural dialogues. It addressed various aspects of multilingualism, from language preservation to the impact of digital technologies on communication, exploring the relationship between technologies and languages from scientific, technical, cultural, linguistic, economic and political perspectives.
Reportedly, LT4All 2019 spurred significant initiatives by various research institutions and major technological companies towards developing language technologies for a wider range of languages. These initiatives included creating language resources and advancing language technologies. Despite significant progress, however, many communities are still being left behind.
Language Resources and Evaluation Journal
Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. The Journal is edited by ELRA and published by Springer.
Since January 2024, the following issues have been published.
Volume 58
Volume 58, issue 4, December 2024, Regular Issue
Volume 58, issue 3, December 2024, Special Issue: Language Resources for Clinical Linguistics
Each of these regular issues include a number of papers in Open Access.
