ELDA is involved in a number of Language Resources production projects, including projects towards the production of annotated data and the production of corpora for MT evaluation.
- Reference Corpora for MT Evaluation
- Tweet Data Annotation in French and German
- Named Entity Annotation in Arabic, English and French
ELDA is working on a new production of reference corpora for the evaluation of Machine Translation. This project looks into spontaneous-writing data such as comments posted on internet and web articles, respecting their actual format, segmentation and data noise, which is quite challenging for the professionals producing the reference translations as well as for their later technological applications. The source languages are Arabic, Chinese, Persian and Russian and 2 reference translations are being produced per language direction, targeting both English and French. This production has followed the strict guidelines and protocols of translation and quality control that ELDA has defined throughout its years of experience in parallel corpora production.
This project aims at producing two Tweet-data corpora, annotated with what are referred to as "opinions/feelings/emotions". The annotation specifications, created for this project, define annotation at different levels, also covering relations between elements inside the tweets. The languages targeted are French and German. Each corpus comprises about 15,000 tweets and 10% of these data are also considered for double annotation.
The single annotation part of the French corpus has been concluded, with only the double annotation remaining for the beginning of February. The German data is being prepared to start its annotation after the French part of the project.
ELDA is undergoing the production of three Named-Entity annotated corpora, addressing different domains (data from comments posted on internet, web articles and tweets) for the following languages: Arabic, English and French. These corpora are made up of about 200,000 words per language, out of with a part is selected for double annotation. Annotation guidelines have been created, which are based on those from early annotation projects but refined and with a richer granularity with the aim of covering the needs of the domains treated.
Following on the production of these three Named-Entity annotated corpora, ELDA has concluded the French data and is currently finishing both the Arabic and English ones. The Arabic data production has proven particularly challenging as its sources sometimes mix both Arabic and Latin scripts within the same text. This has had an impact both on the display of certain items in the data as well as on their annotation. The English data is undergoing what is expected to be its final validation at the moment.
As a reminder, the objective of this project is to create three Named-Entity corpora for evaluation. The sources comprise very complex and spontaneous-writing data from comments posted on internet, web articles and tweets.