LRs in the ELRA Catalogue this month
Four new written corpora and fourteen new bilingual lexicons are now available in our catalogue.
English-Punjabi Code-Mixed Social Media Content
The English-Punjabi Code-Mixed Social Media Content corpus is composed of 893,615 parallel sentences of English-Punjabi in the following domains: Agriculture, Culture, Entertainment, Health, Religion, Sports, Technology, Tourism, Education, and Entertainment.
Parallel Corpora for 6 Indian Languages
The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37,000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 words – 43,000 parallel sentences), and Urdu (1,200,000 words – 33,000 parallel sentences), translated into English.
Tham Khasi annotated corpus
This is a corpus in Khasi, an Austro-Asiatic language, comprising of Khasi sentences extracted from textbooks prescribed for students in secondary, higher secondary, graduation, and post-graduation in the year 2015-2016. The corpus contains 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations). The sentences are manually tagged for parts of speech.
"La Dépêche de Kabylie" Corpus
"La Dépêche de Kabylie" Corpus consists of about 1,570,000 words in Amazigh language collected from the Algerian newspaper entitled “La Dépêche de Kabylie”. It was collected thanks to HTTrack Website Copier All articles are gathered under one plain text file.
English-Vietnamese Special Dictionaries series
A series of specialised bilingual dictionaries is now available for the following domains:
English-Vietnamese Special Dictionary: Aesthetic - 836 entries provided in XML format (ISLRN:792-807-299-844-6)
English-Vietnamese Special Dictionary: Architecture - 18,213 entries provided in XML format (ISLRN:090-342-038-261-9)
English-Vietnamese Special Dictionary: Finance - 9,039 entries provided in XML format (ISLRN:557-620-378-687-8)
English-Vietnamese Special Dictionary: Economics - 16,255 entries provided in XML format (ISLRN:292-335-361-128-1)
English-Vietnamese Special Dictionary: Informatics - 3,835 entries provided in XML format (ISLRN:664-600-467-613-7)
English-Vietnamese Special Dictionary: Law - 3,011 entries provided in XML format (ISLRN:675-423-495-453-3)
English-Vietnamese Special Dictionary: Math - 15,004 entries provided in XML format (ISLRN:673-080-199-543-0)
English-Vietnamese Special Dictionary: Mechanical - 3,482 entries provided in XML format (ISLRN:464-013-248-151-6)
English-Vietnamese Special Dictionary: Medical - 8,073 entries provided in XML format (ISLRN:264-005-069-750-7)
English-Vietnamese Special Dictionary: Navigation - 19,393 entries provided in XML format (ISLRN:147-831-511-571-0)
English-Vietnamese Special Dictionary: Physics - 23,584 entries provided in XML format (ISLRN:288-262-689-669-3)
English-Vietnamese Special Dictionary: Real Estate - 2,585 entries provided in XML format (ISLRN:438-043-926-686-6)
English-Vietnamese Special Dictionary: Stocks - 1,094 entries provided in XML format (ISLRN:479-017-757-739-6)
English-Vietnamese Special Dictionary: Tourism - 2,235 entries provided in XML format (ISLRN:923-733-433-674-1)
The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.
The new ISLRN Web Portal is now opening
A new version of the ISLRN web portal has been recently released. In order to improve the performance of the portal, with an up-to-date technical infrastructure, ISLRN was ported into latest versions of Django and Python frameworks. Consequently, the design and functionalities, including search tool & enhanced submission pages, have been reworked to improve user experience.
Figures of the month
- 6 new ISLRN numbers assigned in February 2022
- 1 new ISLRN provider in February 2022
- A total of 3205 ISLRN numbers assigned since January 2014
- A total of 249 distinct languages
The latest LRs for which an ISLRN number was requested and accepted in February are as follows:
- Manual Arabic spelling-errors correction for collected documents – ISLRN: 922-673-450-479-2
- Spoken Digits in Hindi and Indian English – ISLRN: 452-404-795-171-3
- The Child Subglottal Resonances Database – ISLRN: 550-643-277-274-6
- Implemented spelling rules – ISLRN: 075-613-662-068-2
Legal and Ethical Issues Worskhop @ LREC2022 - Call for Papers
ELRA is co-organising the Legal and Ethical Issues Workshop that will take place at LREC 2022 on 24 June. After the Call for Papers was circulated, the workshop website and the submission page have been setup and disseminated to the community.
This year, to better respond to the needs of the international language resources community, a single workshop will aim at tackling Legal and Ethical Issues in Language Resources with a specific focus on trying to build bridges between legal issues and technology.
Focus - Follow up of CNIL’s decision on facial recognition service
Following French CNIL’s decision n° MED 2021-134 of 1 November 2021 to compel Clearview AI to stop its processing operations in the French Territory to comply with its GDPR obligations, new European countries also stepped in to provide further clarifications on the use of this facial recognition service.
In a report published on 9 March 2022, the Belgian Supervisory Body for Police Information stated that the experimentation led by the Belgian Federal Police using Clearview AI constituted a severe infringement on the rights to privacy and protection of personal data. Moreover, it was stated that Belgian law did not offer a sufficient legal basis to apply the facial recognition technology developed by Clearview AI. Finally, it was found out that the transfer of personal data and police data to a third country without an assessment of the adequate protection offered by this country. The Supervisory body concluded that this transfer of police and personal data was illegal.
In addition, the Italian DPA (Garante per la protezione dei data personali) sanctioned the company with a 20 million € fine in the light of its infringement of GDPR by failing to process the personal data of Italian residents with an appropriate legal basis and failing to appropriately inform users of the processing and its purposes. In conclusion the Italian DPA ordered Clearview AI to delete the data of Italian data subjects and to designate a representative in the EU to allow the data subjects to exercise their rights.
- The report produced by the Belgian Supervisory Body for Police Information is available here (in French).
Political Agreement for a successor to the Privacy Shield
After the invalidation of the Privacy Shield framework by the European Court of Justice in the Schrems II case, a new political agreement has been announced by European Commission President Ursula Von der Leyen and President Joe Biden. This agreement would allow free flow of data between the European Union and the United States while taking into account the results of the European Court of Justice Decision.
The full text of the agreement has yet to be released however it raised some concerns from privacy activist Max Schrems who declared the following : “It is regrettable that the EU and U.S. have not used this situation to come to a ‘no spy’ agreement, with baseline guarantees among like-minded democracies. Customers and businesses face more years of legal uncertainty.”
A full report of the agreement has been provided by Techcrunch available here.
Workshop Review - GDPR Guidelines for the Translation & Interpretation Professions
On 4 March 2022, FIT Europe organised a Workshop around the GDPR Guidelines that are applied to Translators and Interpreters.
The workshop focused on several topics:
- GDPR Application to translated content
- Identification of the roles within the translation supply chain
- Mitigation of risks implied by GDPR implementation to translation thanks to anonymisation
- Curation of translation after performance of the service
The full replay of the workshop is available here.
Workshop Review - Data Spaces and New Regulations - The Data Act and European Standardisation Strategy
On 16 March 2022, the Data Spaces Business Alliance organised a workshop centred on the development regarding the Data Act and the European Standardisation Strategy.
Regarding legislative developments some key issues were pointed out :
- Create a unified Data & Cloud Market
- Give the users a maximum of control over their data and how it is shared
Five main building blocks were established regarding the advent of this Data Market:
- The access to data generated by the Internet of Things (connected objects)
- The tackling of unfair Contractual practices for SMEs
- The availability of business data for the common good
- The preservation of rights of the data subjects
- Competition of the services markets based on the available data
Information on the on-going projects
European Language Equality
The European Language Equality (ELE) project develops a strategic research, innovation and implementation agenda as well as a roadmap to achieve full digital language equality in Europe by 2030. ELE counts on 52 partners representing all European countries, research and industry, together with major pan-European initiatives. The project’s approach builds up on wide-coverage individuals’ contributions to define its strategic agenda.
ELE has performed a thorough identification task resulting in the listing and description of more than 6,000 datasets and tools for all European countries. ELDA has collaborated with LISN in the identification for the French language. These metadata descriptions have been recently imported into the European Language Grid catalogue (see below).
Reports on the definition of digital language equality and the state of the art in language technology and language-centric AI are already available here.
The consortium is currently working on 31 language reports that fall within the analysis of the digital status of European languages and their risk of digital extinction. ELDA has taken care of the revision of five of them. These reports will pave the way towards the needs and remaining efforts to obtain digital language equality by 2030.
European Language Grid
The European Language Grid (ELG) project has entered its final six months of project runtime (a 36-month project extended six further months for pandemic reasons) and is working towards the third and final release of its platform due at the end of March 2022. The final report on the status of datasets, models, identified gaps, produced resources and their exploitation within ELG (D5.3) has just been submitted by ELDA (as WP leader) to the EC. At present, the ELG catalogue hosts 8,873 datasets and 2,734 tools and services. This content comprises the 4,127 metadata records for language resources and the 2,215 for tools that were identified under ELG’s sister project ELE. Both projects work in very close collaboration and ELE feeds ELG with the findings from its desk research for strategy and catalogue population.
For procedural reasons ELG deliverables are not publicly available yet, but they will be in the coming months.
Multilingual Anonymisation for Public Administrations (MAPA)
The MAPA project ended last December 2021. A de-identification toolkit has been built focusing on the health and legal domains and covering the 24 CEF languages. The system makes use of Named Entity Recognition (NER) techniques for sensitive information detection. The NER mechanism has been trained and tested on data produced within the project. ELDA has led the data production work within the initiative and the following language resources have been built for all 24 languages:
- A 1Million sentence monolingual raw corpus.
- An annotated corpus with EUR-LEX data: 2,000 sentences of parallel data for all 24 languages, annotated with named entities.
- A list of person names: 10,000 combined names (given name + surname), except for those languages requiring specific morphological changes and where given names and surnames are provided separately.
Furthermore, additional annotated data have been produced for further system training for English, French, Greek, Italian, Maltese and Spanish, comprising both clinical cases and legal text.
All datasets are available through the ELRC-SHARE repository and, shortly, through the EL catalogue. The INCEpTION platform has been used for data annotation.
The MAPA toolkit can be easily deployed as its engines are fully dockerised. Use cases like the Spanish Ministry of Justice and Complaints Watch by DG-Justice have tested MAPA and shown interest in using it. The EC’s eTranslation platform has also tested the toolkit providing very positive feedback.
ELDA is currently assembling all trained models and preparing the deployment of the de-identification system for public use. More details on how to access it will follow in the next Newsletter.
ELDA will be organising a Workshop on the topic of de-identification of sensitive language resources at LREC 2022.
- SemEval-2023 - The 17th International Workshop on Semantic Evaluation
- JOKER - Automatic Wordplay and Humour Translation
- iDPP - Intelligent Disease Progression Prediction
CLPsych 2022 Shared Task - The Workshop on Computational Linguistics and Clinical Psychology
CRAC 2022 Shared Task on Multilingual Coreference Resolution
LSCDiscovery Shared Task - Lexical Semantic Change Discovery in Spanish
SIGMORPHON 2022 Shared Tasks:
2022 National NLP Clinical Challenges - 3 Tracks
- Track 1: Contextualized Medication Event Extraction CMED
- Track 2: Extracting Social Determinants of Health
- Track 3: Progress Note Understanding: Assessment and Plan Reasoning
VoicePrivacy 2022 Challenge
DEFT 2022 - Défi Fouille de Texte
FNP 2022 - The 4th Financial Narrative Processing workshop:
The 4th Financial Narrative Processing workshop (FNP 2022) will be supported by the ELRA. The winning team of each shared task competitions will win one free registration generously provided by ELRA to attend LREC 2022 and present their work at the workshop in Marseille, France.
Three shared tasks:
- Financial Narrative Summarisation (FNS 2022): summarise financial data in three languages: English, Spanish and Greek.
- Financial Table of Content Extraction (FinTOC 2022): detect structure of financial documents in three languages: English, French and Spanish
- Financial Causality Detection (FinCausal 2022): detect causal effects in financial disclosures in English.
News from ELRA
The LREC Conference is ELRA’s biennial flagship event organized since 1998 with the support of institutions and organizations involved in HLT.
LREC 2022, the 13th edition will take place from May 20 to 25, 2022 at the Palais du Pharo, in Marseille (France).
On 23-24 March 2022, the members of the LREC 2022 Programme and Organizing Committees met in Marseille.
The main objective of this meeting was to work on the Conference programme (structure of the conference, special sessions, etc.). Authors of Main conference papers will be notified shortly.
Some time was also dedicated to practical aspects including the hybrid format of the conference. The rooms and spaces of the Conference Centre were again reviewed along with the offers of the local providers (hotels, catering companies, etc.).
Hotel rooms can be booked from our partner Mathez Travel at www.lrecmarseille.com.
The submission to the workshops is still on-going until the beginning of April for most of them. Information on all workshops, including the submission information, and all tutorials is available at https://lrec2022.lrec-conf.org...
As of now, LREC 2022 has received the support of:
Antonio Zampolli Prize: Call for nomination for LREC 2022
In 2004, the ELRA Board has created a prize to honour the memory of its first President, Professor Antonio Zampolli, a pioneer and visionary scientist who was internationally recognized in the field of Computational Linguistics and Human Language Technologies (HLT). He also contributed much through the establishment of ELRA and the LREC conference.
To reflect Professor Zampolli's specific interest in our field, the ELRA Antonio Zampolli Prize is awarded to individuals and small groups whose work lies within the areas of Language Resources and Language Technology Evaluation with acknowledged contributions to their advancements.
The Prize will be awarded for the ninth time in June 2022 at the LREC 2022 conference in Marseille (20-25 June 2022).
Nominations should be sent to the ELRA President Antonio Branco at AntonioZampolli-Prize@elra.info no later than April 25, 2022.
Please visit ELRA web site for the ELRA Antonio Zampolli Prize Statutes, the nomination procedure and the previous winners: http://www.elra.info/en/lrec/elra-antonio-zampolli-prize/
Language Resources and Evaluation Journal
Volume 56, issue 1, published in March 2022. 12 articles are published in this issue.
ELRA Press Releases
On March 23, 2022, the PR announcing the opening of the new version of the ISLRN web portal was circulated.
The full PR can be found here.
Job @ ELDA
Technical Engineer position opened at ELDA
ELDA is currently seeking to fill immediate vacancy for a Technical Engineer (m/f).
Under the supervision of the CEO, the Technical Engineer in Speech and Multimodal Technologies will be in charge of conducting the activities related to the production of language resources. (S)he will be managing language resources production projects and co-ordinating R&D projects while being also hands-on whenever required by the development team. Their responsibilities include designing/specifying language resources, setting up production frameworks and platforms, carrying out quality control and assessment, recruiting and managing project-dedicated human resources. (S)he will also contribute to the improvement or updating of the current language resources production workflows.
The position is based in Paris.
Please check the profile details here:http://www.elra.info/en/opportunities/
News from the community
European Commission’s call for support: Crisis response without language borders
EU grants for news media - Webinar on 6 April 2022 at 11:00 CET
On behalf of the European Commission, Unit I3 Audiovisual Industry and Media Support Programmes, we would like to invite you and your contacts to a webinar on EU funding opportunities for news media.
It will be an virtual event on Wednesday 6 April 2022, from 11h00 to 13h00 CET, and more details are available @ https://digital-strategy.ec.europa.eu/en/events.
Questions about Creative Europe should be addressed to Creative Europe Desks in each country or the relevant functional mailboxes. For questions about other calls (pilots, media innovation), the relevant contact details during and after the webinar will be provided via a dedicated event page.
Report: Digital Decade targets in jeopardy without scale-up of efforts
A substantial acceleration of digital development is needed if the EU’s Digital Decade targets are to be met, according to a new report that sheds light on the disparity between member states.
Read the full article on Euractiv @ https://bit.ly/3DpVtPQ