LRs Announcements

YEAR 2009 | 2008 | 2007 |

Check out the Language Resources that have been announced in 2007, 2008 and 2009.


2009
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec


September 09
ELRA-S0301 Norwegian EUROM1 (EUROM1_N)
EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0302 TC-STAR female baseline voice: Laura
Laura contains the recordings of one female English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).
ELRA-S0303 TC-STAR male baseline voice: Ian
Ian contains the recordings of one male English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).
ELRA-S0304 SpeechDat(M) Italian Mobile Network Speech Database
This speech database contains the recordings of 342 Italian speakers recorded over the Italian mobile telephone network. Each speaker uttered around 40 read and spontaneous items.
August 09
ELRA-T0373 BioLexicon
BioLexicon is a large-scale English terminological resource which has been developed to address the needs emerging in text mining efforts in the biomedical domain. It contains over 2.2M lexical entries (over 3.3M semantic relations), and information on over 1.8M variants and on over 2M synonymy relations. BioLexicon is available in a relational database format (MySQL dump format) and it adheres to the EAGLES/ISO standards for lexical resources.
July 09
ELRA-T0372 Multilingual Dictionary of Sports
This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The results are presented in the form of MS ACCESS databases. The EuRADic sport dictionary is provided under the following different subsets:
ELRA-T0372-01 English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database
ELRA-T0372-02 English-French bilingual database
ELRA-T0372-03 English-French-Greek trilingual database
ELRA-T0372-04 English-French-Arabic trilingual database
ELRA-T0372-05 English-French-German trilingual database
ELRA-T0372-06 English-French-Spanish trilingual database
ELRA-T0372-07 English-French-Portuguese trilingual database

ELRA-M0042 ItalWordNet (Italian WordNet)
ItalWordNet (Italian WordNet) is an updated version of the EuroWordNet Italian database.
ELRA-W0051 Persian-English parallel Corpus
The corpus consists of about 3,500,000 English and Persian words aligned at sentence level (about 100,000 sentences). The format of the files is Unicode.
ELRA-E0034 EASy Evaluation Package
The EASy Evaluation Package was produced within the French national project EASy (Evaluation of syntactic parsers of French), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).
June 09
ELRA-W0050 The CINTIL Corpus – International Corpus of Portuguese
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators.
May 09
ELRA-M0048 The CINTIL Corpus – International Corpus of Portuguese
LatinWordNet contains information about the following aspects of the Latin and English lexicon: lexical relations between words, semantic relations between lexical concepts, correspondences between Latin and English lexical concepts.
ELRA-M0049 Basque WordNet
The Basque WordNet models nouns, verbs and adjectives. Each sense is linked to a so-called synset (for a total of 30,281 Synsets). Every synset encodes the synonymy relation between (possibly) several words (synonyms), having a unique meaning, belonging to one and the same part of speech (specified in the POS tag value), and expressing the same lexical meaning.
ELRA-M0050 The MWN.PT - MultiWordnet of Portuguese
MWN.PT - MultiWordnet of Portuguese (version 1) spans over 17,200 manually validated concepts/synsets, linked under the semantic relations of hyponymy and hypernymy. These concepts are made of over 21,000 word senses/word forms and 16,000 lemmas from both European and American variants of Portuguese.
ELRA-S0300 SIGNUM Database
The SIGNUM Database contains both isolated and continuous utterances of various signers. The corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images.
April 09
ELRA-S0299 Alcohol Language Corpus (BAS ALC)
ALC contains recordings of 88 German speakers that are either intoxicated or sober. The type of speech ranges from read single digits to full conversation style.
March 09
ELRA-W0049 “Le Monde Diplomatique” Arabic tagged corpus
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.
ELRA-E0033 CHIL 2007 Evaluation Package
The CHIL 2007 Evaluation Package consists of the following contents:
  1. A set of audiovisual recordings of interactive seminars. The number of people present in the recording was fixed to be between 3 and 7. The recordings were done between June and September 2006 according to the “CHIL Room Setup” specification.
  2. Video annotations.
  3. Orthographic transcriptions.
ELRA-S0297 Hungarian Speecon database
The Hungarian Speecon database comprises the recordings of 555 adult Hungarian speakers and 50 child Hungarian speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0298 Czech Speecon database
The Czech Speecon database comprises the recordings of 550 adult Czech speakers and 50 child Czech speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
February 09
ELRA-S0296 FBK-Irst database of isolated meeting-room acoustic events
This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission’s Sixth Framework Programme.
January 09

No LR announced. 


2008
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec


December 08
ELRA-M0047 Czech WordNet
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 28,201 word senses (synsets).
ELRA-S0294 CHIEDE Corpus: a spontaneous child language corpus of Spanish
The spontaneous child language corpus, CHIEDE, consists of 58,163 words, in 30 texts, with 7 hours and 53 minutes of recordings and 59 child participants.
ELRA-S0295 LILA Korean database
The LILA Korean database comprises 1,000 Korean speakers (500 males and 500 females) recorded over the Korean mobile telephone network.
November 08
ELRA-S0283 Laboratory Conditions Czech Audio-Visual Speech Corpus (UWB-05-LCAVC)
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions.
ELRA-S0284 Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions (UWB-07-ICAVR I)
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems collected with impaired illumination conditions. The corpus consists of about 20 hours of audio-visual records of 50 speakers in laboratory conditions.
ELRA-S0285 Czech Sign Language Corpus for Recognition - Amateur Signer (UWB-06-SLR-A)
This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views.
ELRA-S0286 Czech Sign Language Corpus for Recognition - Professional Signer (UWB-07-SLR-P)
This database comprises 378 signs from Czech sign language as performed by 4 everyday sign-language users (4 women, 2 of them deaf).
ELRA-E0017 CHIL 2006 Evaluation Package
The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.
ELRA-S0292 Danish EUROM1 (EUROM1_D)
EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0293 The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication
The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.
October 08
ELRA-S0287 Cantonese Speecon Database
The Cantonese Speecon database comprises the recordings of 550 adult Cantonese speakers and 50 child Cantonese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0288 Thai Speecon Database
The Thai Speecon database comprises the recordings of 552 adult Thai speakers and 50 child Thai speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0289 OrienTel Jordan MCA (Modern Colloquial Arabic) database
This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0290 OrienTel Jordan MSA (Modern Standard Arabic) database
This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0291 OrienTel English as spoken in Jordan database
This speech database contains the recordings of 578 Jordanian speakers of English recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
ELRA-W0048 TUNA Corpus
The TUNA Corpus of Referring Expressions is built with the contributions from 50 native or fluent speakers of English and it contains about 2000 descriptions (referring expressions). Participants described objects (targets) in visual domains by typing and submitting referring expressions that distingued them from other objects that were shown simultaneously (distractors). Each description is annotated with semantic information.
September 08
ELRA-S0281 LILA Hindi-L1 database
The LILA Hindi-L1 database comprises 2,030 Hindi speakers (1,012 males and 1,018 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered around 60 read and spontaneous items.
ELRA-S0282-01 BAS PHATT 1.0.X (sub-set)
The Ph@ttSessionz speech database contains recordings of 864 adolescent speakers of German (age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 41 locations in Germany. Recordings were done with SpeechRecorder in selected schools in the years 2005-2007. Both channels, the headset and the desktop microphone, were recorded in high quality. The BAS PHATT corpus is available in two versions: BAS PHATT 1.0.X (sub-set, ELRA-S0282-01) and BAS PHATT 1.1.X (complete corpus, ELRA-S0282-02). BAS PHATT 1.0.X contains 41 items.
ELRA-S0282-02 BAS PHATT 1.1.X (complete corpus) The SpeechDat Galician Database for the Fixed Telephone Network contains the recordings of 653 speakers of Galician recorded over the fixed telephone network. Each speaker uttered around 44 read and spontaneous items.
August 08

No LR announced. 
July 08
ELRA-S0276 Swedish EUROM1 (EUROM1_S)
EUROM1 is the first really multilingual speech database produced in Europe . Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0277 SpeechDat Galician Database for the Fixed Telephone Network
The SpeechDat Galician Database for the Fixed Telephone Network contains the recordings of 653 speakers of Galician recorded over the fixed telephone network. Each speaker uttered around 44 read and spontaneous items.
ELRA-S0278 SmartWeb Handheld Corpus (SHC)
This corpus contains recordings spoken by 156 speakers in a human-machine query situation. Users were asked to solve several tasks with a spoken query system to the WWW using a smart phone as portable device in natural environments (office, hall, restaurant, street). Recorded channels are the Bluetooth headset over UMTS (telephone quality), the Bluetooth headset and an additional collar microphone in high quality.
See also ELRA-S0279 and ELRA-S0280.
ELRA-S0279 SmartWeb Motorbike Corpus (SMC)
This corpus contains recordings spoken by 36 speakers in a human-machine query situation on a running motor cycle (BMW). Bikers were asked to solve several tasks with a spoken query system to the WWW using an integrated system connected to a speech server via an UMTS connection. Recorded channels are the Bluetooth helmet microphone over UMTS (telephone quality), and - partly - the Bluetooth helmet microphone and an additional neck microphone in high quality. See also ELRA-S0278 and ELRA-S0280.
ELRA-S0280 SmartWeb Video Corpus (SVC)
This multimodal corpus contains 99 recordings each containing a human-human-machine dialogue: one speaker (which is being recorded) interacts with a human partner as well with a dialogue system via a smart phone (SmartWeb system). See also ELRA-S0278 and ELRA-S0279.
June 08
Update - ELRA-S0242 SALA II US English database
The SALA II US English database comprises 4,090 US English speakers (2,017 males, 2,073 females, including some speakers with Hispanic accents) recorded over the United States mobile telephone network.
ELRA-L0085 euLEX (Lexical Database for Basque)
euLEX is a general lexicon which contains 115,000 entries, divided into 94,000 dictionary entries or lemmas, 12,000 allomorphs, 7,500 verb forms and about 1,200 dependent morphemes. All entries include linguistic information such as morphology and usage. The lexicon is in XML.
May 08

No LR announced. 
April 08

ELRA-S0273 LC-STAR Slovenian Phonetic lexicon 
The LC-STAR Slovenian Phonetic lexicon comprises 110,900 entries, including a set of 64,521 common words, a set of 45,012 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 5,491 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0274 LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon 
The LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon comprises 12,722 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,522 phrase corpus. The lexicon is provided in XML format.
March 08

ELRA-S0272 MEDIA speech database for French 
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).  
 
February 08

ELRA-S0269 LC-STAR Greek Phonetic lexicon 
The LC-STAR Greek Phonetic lexicon comprises 110,708 entries, including a set of 57,519 common words, a set of 45,162 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,027 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0270 LC-STAR Italian Phonetic lexicon
The LC-STAR Italian Phonetic lexicon comprises 109,712 entries, including a set of 56,420 common words, a set of 45,253 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,039 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0271 LC-STAR English-Italian Bilingual Aligned Phrasal lexicon
The LC-STAR English- Italian Bilingual Aligned Phrasal lexicon comprises 10,466 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,524 phrasal corpus. The lexicon is provided in XML format.  
 
January 08

ELRA-S0268 UPC-TALP database of isolated meeting-room acoustic events
This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission’s Sixth Framework Programme. It contains a set of isolated acoustic events that occur in a meeting room environment and that were recorded for the CHIL Acoustic Event Detection (AED) task. The database can be used as a training material for AED technologies as well as for testing AED algorithms in quiet environments without temporal sound overlapping. Approximately 60 sounds per sound class were recorded. Ten people (5 men and 5 women) participated to three sessions. During each session a person had to produce a complete set of sounds two times.


2007
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec


December 07

ELRA-S0244 Japanese Speecon database
The Japanese Speecon database comprises the recordings of 556 adult Japanese speakers and 51 child Japanese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0265 Dutch from Belgium Speecon Database
The Dutch from Belgium Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0266 Dutch from the Netherlands Speecon Database
The Dutch from the Netherlands Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0267 Danish Speecon Database
The Danish Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0258 Orientel United Arab Emirates MCA (Modern Colloquial Arabic) 
This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0259 Orientel United Arab Emirates MSA (Modern Standard Arabic)
This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0260 Orientel English as spoken in the United Arab Emirates
This speech database contains the recordings of 500 speakers of English recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
ELRA-S0261 Hungarian SpeechDat(E) Database
This speech database contains the recordings of 1,000 Hungarian speakers recorded over the Hungarian fixed telephone network. Each speaker uttered around 50 read and spontaneous items.
ELRA-S0262 SALA II Portuguese from Brazil database
The SALA II Portuguese from Brazil database comprises 1000 Brazilian speakers recorded over the Brazilian mobile telephone network.
ELRA-S0263 SALA II Spanish from Colombia Database
The SALA II Spanish from Colombia database comprises 1000 Colombian speakers recorded over the Colombian mobile telephone network.
ELRA-S0264 SALA II US Spanish West
The SALA II US Spanish West database comprises 1000 Spanish speakers recorded over the American mobile telephone network.
ELRA-S0255 LC-STAR Finnish Phonetic lexicon 
The LC-STAR Finnish Phonetic lexicon comprises 189,409 entries, including a set of 144,233 common words, a set of 45,176 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 13,068 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0256 LC-STAR Mandarin Chinese Phonetic lexicon
The LC-STAR Mandarin Chinese Phonetic lexicon comprises 104,368 entries, including a set of 38,098 common words, a set of 57,528 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,522 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0257 LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon 
The LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format.
 
November 07

ELRA-S0249 TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech 
This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format. For corresponding recordings, see ELRA-S0251
ELRA-S0251 TC-STAR English Training Corpora for ASR: Recordings of EPPS Speech 
This corpus consists of the recordings of around 290 hours form EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English, 92 hours of which were annotated (transcribed) (the transcriptions are not provided in the present package). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz. For corresponding transcriptions, see ELRA-S0249.
ELRA-S0252 TC-STAR Spanish Training Corpora for ASR: Recordings of EPPS Speech 
This corpus consists of the recordings of around 283 hours from EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish (a mixture of native and non-native Spanish). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz.
ELRA-S0253 TC-STAR English Test Corpora for ASR 
This corpus consists of 70 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English and other European languages. From this corpus, 16 hours of English speeches (native or non native) were annotated (transcribed). Each speech file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.
ELRA-S0254 TC-STAR Spanish Test Corpora for ASR
This corpus consists of 174 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish and other European languages. From this corpus, 16 hours of Spanish speeches were annotated (transcribed). Each audio file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.
ELRA-S0250 TC-STAR English-Spanish Training Corpora for Machine Translation: Aligned Final Text Editions of EPPS 
This corpus consists of respectively 34 million (English) and 38 million (Spanish) running words of bilingual sentence segmented and aligned texts in English and Spanish obtained from the Final Text Editions provided by the European Parliament (from April 1996 to Sept. 2004, Dec. 2004 to May 2005, and Dec. 2005 to May 2006. The data is accompanied by tools for further preprocessing.
ELRA-S0245 LC-STAR German Phonetic lexicon 
The LC-STAR German Phonetic lexicon comprises 102,169 entries, including a set of 55,507 common words, a set of 46,662 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,763 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ).
ELRA-S0246 LC-STAR German Phonetic lexicon in the Touristic Domain 
The LC-STAR German Phonetic lexicon in the Touristic Domain comprises 8,782 entries from the following categories: nouns, adjectives and verbs. For each entry the following information is provided: orthographic form, part-of-speech (POS), phonemic transcription. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0247 LC-STAR Standard Arabic Phonetic lexicon 
The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0248 LC-STAR English-German Bilingual Aligned Phrasal lexicon 
The LC-STAR English-German Bilingual Aligned Phrasal lexicon comprises 10,733 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format.
 
October 07

ELRA-L0084 Macedonian Morphological Lexicon (MACPLEX) 
MACPLEX comprises two dictionaries: a dictionary of lemmas (over 80,000 entries) and a dictionary of word forms (over 1,300,000 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the more than 1,300,000 word forms, there are 345,350 nouns, 467,744 adjectives, 500,220 verbs and 19,472 adverbs. The remaining entries correspond to pronouns, adpositions, conjunctions and numerals. The lexicon is available in Unicode.
ELRA-S0242 SALA II US English database
The SALA II US English database comprises 3,065 US English speakers (1515 males, 1550 females, including some speakers with Hispanic accents ) recorded over the United States mobile telephone network.
ELRA-S0243 SpeechDat Catalan FDB database
The SpeechDat Catalan FDB database contains the recordings of 1,005 Catalan speakers (474 males, 531 females) recorded over the Spanish fixed telephone network.
AURORA-CD0005 AURORA-5
The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.
It contains artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz, as well as a set of scripts for running recognition experiments on those speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource. 
TC-STAR Evaluation Packages
The Evaluation Packages below include the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) and Spoken Language Translation (SLT) third evaluation campaign, as well as the material used for the TC-STAR 2006 and 2007 End-to-End task. They include resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0025 TC-STAR 2007 Evaluation Package - ASR English 
ELRA-E0026-01 TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES
ELRA-E0026-02 TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS
ELRA-E0027 TC-STAR 2007 Evaluation Package - ASR Mandarin Chinese
ELRA-E0028 TC-STAR 2007 Evaluation Package - SLT English-to-Spanish
ELRA-E0029-01 TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - CORTES
ELRA-E0029-02 TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - EPPS
ELRA-E0030 TC-STAR 2007 Evaluation Package - SLT Chinese-to-English
ELRA-E0031 TC-STAR 2006 Evaluation Package – End-to-End
ELRA-E0032 TC-STAR 2007 Evaluation Package – End-to-End
September 07

No LR announced. 
August 07

Update - ELRA-W0036-02 "Le Monde Diplomatique" Text corpus in French - archives from 1999
Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article.
ELRA-L0076 Polderland Dutch Lexicon of Abbreviations and Acronyms
The lexicon contains 2,180 Dutch abbreviations and acronyms. It complies with the official Dutch Spelling (2005/6). Each entry consists of an ID, word form, lemma and part of speech.
ELRA-L0077 Polderland Dutch General Lexicon
The lexicon contains 400,463 Dutch words, comprising 236,369 nouns, 90,882 adjectives, 69,744 verbs, 2,120 adverbs, and 1,348 items from other categories (pronouns, determiners, articles, adpositions, conjunctions, numerals, etc.). It complies with the official Dutch Spelling (2005/6). The lexicon contains an ID, word form, lemma and part of speech.
ELRA-L0078 Polderland Dutch Lexicon of Names
The lexicon contains 24,247 Dutch proper names. Various sorts of proper names are included, such as first names, last names, geographical names etc. Each entry contains an ID, word form, lemma, part of speech and proper name type.
ELRA-L0079 Polderland Dutch Lexicon of Business Terminology
The lexicon contains 15,987 Dutch words from the business domain, comprising 13,774 nouns, 1,267 adjectives, 895 verbs, 9 adverbs, and 42 items from other categories. It complies with the official Dutch Spelling (2005). Each entry contains an ID, word form and part of speech.
ELRA-L0080 Polderland Dutch Lexicon of Legal Terminology
The lexicon contains 6,207 Dutch words from the legal domain, comprising 4,781 nouns, 810 adjectives, 573 verbs, 12 adverbs and 31 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0081 Polderland Dutch Lexicon of Medical Terminology
The lexicon contains 17,115 Dutch words from the medical domain, comprising 12,638 nouns, 3,107 adjectives, 1,273 verbs, 11 adverbs and 86 items from other categories. The lexicon complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0082 Polderland Dutch Lexicon of Social Terminology
The lexicon contains 12,551 Dutch words from the social domain, comprising 9,984 nouns, 1,306 adjectives, 1,161 verbs, 56 adverbs and 44 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0083 Polderland Dutch Lexicon of Technical Terminology
The lexicon contains 9,940 Dutch words from the technical/scientific domain, comprising 8,832 nouns, 950 adjectives, 111 verbs, 2 adverbs and 45 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
 
July 07

ELRA-E0018 ARCADE II Evaluation Package
The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme. The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment.
The campaign is distributed over two actions: sentence alignment and translation of named entities.
ELRA-E0019 CESART Evaluation Package
The CESART Evaluation Package was produced within the French national project CESART (Evaluation of terminology extraction tools), as part of the Technolangue programme. The CESART project enabled to carry out a campaign for the evaluation of terminological resources acquisition tools.
The campaign is distributed over two actions: term extraction and relation extraction.
ELRA-E0020 CESTA Evaluation Package 
The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme. The CESTA project enabled to carry out a campaign for the evaluation of machine translation technologies.
The campaign is distributed over two actions: evaluation on a non restrictive vocabulary, evaluation on a specialised domain (evaluation after terminology enrichment).
ELRA-E0021 ESTER Evaluation Package 
The ESTER Evaluation Package was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme. The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems for French.
The campaign is distributed over three actions: orthographic transcription, segmentation and information extraction (named entity tracking).
For research or commercial use of this database, please refer to ELRA-S0241 ESTER Corpus
ELRA-E0022 EQueR Evaluation Package 
The EQueR Evaluation Package was produced within the French national project EQueR (Evaluation campaign for Question-Answering systems), as part of the Technolangue programme. The EQueR project enabled to carry out a campaign for the evaluation of Question-Answering systems in French.
The campaign is distributed over two actions: one generic task and one specialised task (medical domain).
ELRA-E0023 EvaSy Evaluation Package 
The EvaSy Evaluation Package was produced within the French national project EvaSy (Evaluation of speech synthesis systems), as part of the Technolangue programme. The EvaSy project enabled to carry out a campaign for the evaluation of speech synthesis systems using French text data.
The campaign is distributed over three actions: evaluation of grapheme-to-phoneme conversion, evaluation of prosody, global evaluation of the quality of speech synthesis systems.
ELRA-E0024 MEDIA Evaluation Package 
The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme. The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French. 
The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.
 
June 07

ELRA-M0038 SCI-ANAL English-German Bilingual Dictionary 
This bilingual dictionary contains 59,758 pairs of English-German terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".
See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037.
Update - ELRA-M0037 SCI-ANES English-Spanish Bilingual Dictionary
This bilingual dictionary contains around 60,000 pairs of English-Spanish terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".
See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0038.
ELRA-S0240 French-Canadian Speecon database
The French-Canadian Speecon database comprises the recordings of 550 adult French-Canadian speakers and 50 child French-Canadian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
 
May 07

ELRA-W0047 Catalan Corpus of News Articles
The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007 . These articles are grouped per trimester without chronological order inside.
ELRA-L0075 Bulgarian Linguistic Database
This database contains 81,647 entries in Bulgarian with a linguistic environment tool (for WINDOWS XP). The data may be used for morphological analysis and synthesis, syntactic agreement checking, phonetic stress determining. 
ELRA-S0238 MIST Multi-lingual Interoperability in Speech Technology database
The MIST Multi-lingual Interoperability in Speech Technology database comprises the recordings of 74 native Dutch speakers (52 males, 22 females) who uttered 10 sentences in Dutch, English, French and German, including 5 sentences per language identical for all speakers and 5 sentences per language per speaker unique. Dutch sentences are orthographically annotated.
ELRA-S0239 N4 (NATO Native and Non Native) database
The (NATO Native and Non Native) database comprises speech data recorded in the naval transmission training centers of four countries ( Germany , The Netherlands, United Kingdom , and Canada ) during naval communication training sessions in 2000-2002. The material consists of native and non-native speakers using NATO Naval English procedure between ships, and reading from a text, "The North Wind and the Sun," in both English and the speaker’s native language. The audio material was recorded on DAT and downsampled to 16kHz-16bit, and all the audio files have been manually transcribed and annotated with speakers identities using the tool, Transcriber.

April 07

ELRA-M0043 Russian => English MT optimized lexicon in OLIF XML
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 99,211 entries in its source language (Russian) and 134,828 entries in its target language (English). The source entries are distributed as follows: 64,487 nouns, 11,470 adjectives, 19,724 verbs, 1,762 adverbs, and 1,768 closed-class elements (interjections, special prefixes, suffixes, etc.). Nouns contain gender and number information and verbs provide details on aspect and reflexivity. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Moreover, definitions are available for 59,775 entries, as well as collocational information for 39,148 entries.
ELRA-M0044 English => Swahili Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 58,247 entries in English and 58,300 in Swahili. The source entries are distributed as follows: 36,046 nouns, 3,013 adjectives, 18,308 verbs and 880 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 17,570 entries.
ELRA-M0045 Cebuano => English Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 1,988 entries in Cebuano and 1,990 in English. The source entries are distributed as follows: 1,052 nouns, 462 adjectives, 405 verbs and 69 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 500 entries.
ELRA-M0046 English => Czech Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 31,718 entries in English and 32,125 in Czech. The source entries are distributed as follows: 17,797 nouns, 7,748 adjectives, 6,039 verbs and 134 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 3,065 entries.
Update - ELRA-S0226-01 IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project)
It contains the recordings of 1,000 French adult speakers (470 males and 530 females) recorded over the French fixed telephone network. The speakers uttered 45 phonetically rich sentences. The 45 sentences were the same for all speakers.
Update - ELRA-S0226-02 IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project)
It contains the recordings of 200 French adult speakers (97 males and 103 females) recorded over the French fixed telephone network. The speakers uttered 45 sentences per call with 10 calls per speaker. The 450 sentences per speaker are common to all speakers. Speakers were selected from the IDIOLOGOS 1 “Bootstrap” database.
ELRA-S0275 Slovenian BNSI Broadcast News Speech Corpus
This speech database consists of TV news shows (both evening news, “TV Dnevnik” and late night news, “Odmevi”), from the archive of a Slovenian national broadcaster RTV Slovenia. The recordings took place between June 1999 and May 2003. The database comprises a total of 36 hours of recordings, transcribed and manually checked using the Transcriber tool. 1,565 speakers were recorded (1,069 males, 477 females, 19 unspecified).
 
March 07

Update - ELRA-W0015 Text corpus of "Le Monde"
Corpus from "Le Monde" newspaper. Years 1987 to 2002 are available in an ASCII text format. Years 2003 to 2006 are available in .XML format. Each month consists of some 10 MB of data (circa 120 MB per year).
ELRA-S0235 LC-STAR Hebrew (Israel) phonetic lexicon 
The LC-STAR Hebrew ( Israel ) phonetic lexicon comprises 109,580 words, including a set of 62,431 common words, a set of 47,149 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,677 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0236 LC-STAR English-Hebrew (Israel) Bilingual Aligned Phrasal lexicon 
The LC-STAR English-Hebrew ( Israel ) Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,449 phrasal corpus. The lexicon is provided in XML format.
ELRA-S0237 LC-STAR US English phonetic lexicon 
The LC-STAR US English phonetic lexicon comprises 102,310 words, including a set of 51,119 common words, a set of 51,111 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,807 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
 
February 07

ELRA-S0234 SALA Spanish Chilean Database
The SALA Spanish Chilean Database comprises 1,024 Chilean speakers (477 males, 547 females) recorded over the Chilean fixed telephone network.
ELRA-S0232 Swiss-German Speecon database
The Swiss-German Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0233 US English Speecon database
The US English Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0157 NetDC Arabic BNSC (Broadcast News Speech Corpus)
The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States . The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period.
ELRA-S0229 LC-STAR Turkish lexicon
The LC-STAR Turkish lexicon comprises 104,513 words, including a set of 59,213 common words and a set of 45,300 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
ELRA-S0230 LC-STAR Russian lexicon
The LC-STAR Russian lexicon comprises about 128,000 words, including a set of 77,154 common words, a set of 51,074 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
ELRA-S0231 LC-STAR English-Russian Bilingual Aligned Phrasal lexicon
The LC-STAR English-Russian Bilingual Aligned Phrasal lexicon comprises 10,519 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,000 phrasal corpus. The lexicon is provided in XML format.
Update – ELRA-S0207 LC-STAR Catalan phonetic lexicon
The LC-STAR Catalan phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
Update – ELRA-S0208 LC-STAR Spanish phonetic lexicon
The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
January 07

ELRA and Beijing Haitian Ruisheng Science Technology Ltd today signed a major Language Resources distribution agreement. On behalf of ELRA, ELDA will act as the distribution agency for Beijing Haitian Ruisheng Science Technology Ltd and will incorporate to the ELRA Language Resources catalogue a large number of Speech resources designed and collected to boost Speech Synthesis and Speech Recognition. The resources cover mainly Mandarin Chinese with some coverage of Korean and Japanese languages.
With over 60 new resources, ELDA is strengthening its position as the leading worldwide distribution centre. With this agreement Beijing Haitian Ruisheng Science Technology Ltd will get more visibility in particular on the European market.
List of available Speech Resources
List of available Written Corpora
ELRA-L0074 POLEX Polish Lexicon
The POLEX Polish Lexicon is a morphological dictionary of Polish language. It comprises about 100,000 entries. The POLEX dictionary includes the core Polish vocabulary of general interest. It is based on a precise machine-interpretable formalism (coding system), the same for all categories (classes of speech). The dictionary entries are of the following form: BASIC_FORM+LIST_OF_STEMS+PARADIGMATIC_CODE +DISTRIBUTION_OF_STEMS
It contains more than 42,000 nouns, 12,000 verbs, 15,000 adjectives, 25,000 participles, and about 200 pronouns. A simple lemmatiser (in form of PROLOG prototype) is also included.