ELRA Commissioning the Production of Language Resources
As a response to the need for more language resources, ELRA has issued a series of calls for tenders and proposals in 1998 and 1999 (December 1998, February 1999, March 1999) to help sponsor the production, and/or the packaging or customisation of existing resources, as indicated by current needs in the language engineering community.
The intended purpose of these calls is to ensure that necessary resources are developed in an acceptable framework (in terms of time and legal conditions) by the language engineering players. These calls target projects with short time scales (projects lasting up to one year) and the size of the funding is modest. ELRA funding is to be seen as effective and useful for producers being both tactical in their aims for the targeted market, and strategic with regard to content and annotation techniques in order to fulfil these needs.
ELRA is not taking over the role of the European Commission nor that of national agencies involved in the strategic, long-term creation of resources and infrastructures. Fitting within the framework of the European Commission DG XIII actions for revitalising the language engineering field, ELRA’s contribution is to support the packaging and customisation of small sets of key resources that might not be supported through the larger European programs (e.g. language engineering, Human Language Technologies, HLT).
The Language Resources - Packaging & Production (LRsP&P) (European Commission LE4-8335) project
Within the Language Resources - Packaging & Production (LRsP&P) project (LE4-8335), ELRA has been assigned the task of pursuing several activities related specifically to language resources, including language resources survey work, commission the production of new LR projects, and validating the resulting LRs.
A few reports about the language resource survey work conducted during the period of the LRsP&P project are now available. See below:
Available to everyone:
Click here to download the language resources user needs survey that was presented at LREC 2000.
For members only (you need the password):
Click here to download the extended report that discusses all of the language resources surveys conducted with the LRsP&P project.
Click here to download the language resources segmentation report.
The results of some of the early language resources surveys led ELRA to launch a call for the production/packaging or LRs. The 1999 ELRA Call for LR Production and Packaging proposals led to 8 projects we partially or fully funded:
1. Corpus of Written Business English (Ruslan Mitkov; University of Wolverhampton): This is LRsP&P subproject 1. This project has resulted in a new corpus of written Business English consisting of an ASCII text corpus of 10 million words, with SGML mark-up, part-of-speech tags, and sentence and paragraph boundary markers.
2. Sets of Bilingual Language Resources Dictionaries for English and Russian (Vera Semenova-Fluhr; SCIPER): This is LRsP&P subproject 2. This is an English-Russian and Russian-English LR-dictionary through reformatting of an existing source dictionary. Automatic inversion of the English-Russian was used to create the Russian-English LR-dictionary.
3. Crater 2 - Expanding Resources for Terminology Extraction (Tony McEnery; Lancaster University): This is LRsP&P subproject 3. This project has enhanced the CRATER resource, which is already available via ELRA, by significantly expanding the French/English component of the parallel corpus by around 50% to 1,500,000 tokens each. The Spanish corpus has also be extended in a monolingual fashion by an additional 500,000 Spanish tokens in order to make the Spanish corpus 1,500,000 tokens in size.
4. Italian Broadcast News Corpus (Marcello Federico; ITC-IRST): This is LRsP&P subproject 4. This project has resulted in the completion of a multimedia corpus of 30 hours of annotated radio broadcast news for Italian. The corpus includes audio signal, transcriptions, and documentation for the users. Broadcast news were acquired from the digital archive of the Italian major broadcaster Radio RAI and then processed and annotated by the ITC-IRST.
5. Pronunciation Lexicon of British English Place Names, Surnames and First Names (Marc Fryd; Université de Poitiers): This is LRsP&P subproject 5. The final size of this database is at 165 thousand main entries. Each word is encoded with relevant information such as : thematic status (place-name, surname, first name), phonetic transcriptions ("main" and "secondary" phonetic variants), number of letters, number of syllables.
6. Scientific Corpus of Modern French (Béatrice Daille and Geoffrey Williams; Université de Nantes): This is LRsP&P subproject 6. This project is a corpus of contemporary written scientific French. It is a monolingual, mono-source corpus developed from the journal "La Recherche" and is a multidisciplinary overview of scientific usage. The finished language resource consists of articles over a period of one full year from February 1998 to January 1999 covering 30 large themes with an approximate total of 800,000 words.
7. German-French Parallel Corpus of 30 Million Words (Wolfgang Teubert; Institut für deutsche Sprache, University of Mannheim): This is LRsP&P subproject 7. This German-French parallel corpus is a 30 million word corpus (15 million for each language) for the purpose of developing, enhancing and improving translation aids (dictionaries, lexicons, platforms) for French-German and German-French translation.
8. Colombian Spanish SpeechDat-like (Siemens Colombia & Department of Signal Theory and Communications of the Universitat Politècnica de Catalunya): This is LRsP&P subproject 8. This database is comprised of telephone recordings from 1,065 speakers (563 males speakers and 502 female speakers) recorded directly over the fixed telephone network using an E-1 interface. Speech files are stored as sequences of 8-bit 8 kHz A-law uncompressed speech samples (CCITT G.711 recommendation). Each prompted utterance is stored within a separate file. Each speech file has an accompanying ASCII SAM label file. Speech file format and SAM label files follow the specifications given by the SpeechDat project.
The LRsP&P project also promotes the validation of language resources as a new recognised activity since all language resources funded by ELDA within the LRsP&P project have included validation criteria to be applied during the internal and external validation phases.
For members only (you need the password):
Click here to download the Final LRsP&P project report.
Language Resources Projects Funded by the Délégation Générale à la Langue Française (DGLF)
Within its activities in conjunction with the French government, ELRA launched in 1998-1999 another call for tenders regarding the production of modern French corpora, which is now starting with 3 different projects. The 3 projects are based on 4 different corpora: Le Monde, Official Journal of the European Communities Written Questions and Parliamentary Debates, French Sénat reports.
Syntsem: Syntactic and Semantic Tagging of French (Jean Véronis, CILSH Lab at the Université de Provence and TALANA lab at the Université Paris VII). ½ million word corpus with complete morpho-syntactic tagging, complete lemmatisation, grammatical compound words, complete shallow parsing and semantic tagging of 10% of the content words.
Annotating Grammatical Anaphora in French Electronic Corpora (Xerox Research Centre Europe, CRISTAL-GRESEC - Université Stendhal - Grenoble 3). In particular, partners will work on grammatical anaphora (pronouns, adjectives, adverbs and verbs) and will make proposals for the coding of related anaphora.
Tagging Texts to Constitute Representating Corpora (B. Habert, LIMSI-CNRS and UMR 8503 - ENS Fontenay/Saint-Cloud). Morpho-syntactic tagging and lemmatisation work, followed with complementary tagging which consists of spotting some quantitative features as well as linguistic features that provide textual organisation (i.e. conjunctions, transitions).
The current projects should be increased regularly with new projects to create a collection of modern French corpora.