LRs Announcements

YEAR 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 |

Check out the Language Resources that have been announced in 2007, 2008, 2009, 2010, 2011, 2012, 2013 and 2014.


2014
Jan | Mar |


March 14
ELRA-E0042 CLEFeHealth 2013 Evaluation Package
The CLEFeHealth 2013 Task 3 Evaluation Package contains data used for the User-centred health information retrieval Shared task at the CLEFeHealth Lab conducted in 2013. Task 3 aimed at evaluating information retrieval to address questions patients may have when reading clinical reports.

January 14
ELRA-W0076Nepali Monolingual written corpus
The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
ELRA-W0077 English-Nepali Parallel Corpus
This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali).


2013
Feb | Jun | Sept | Nov | Dec


December 13
ELRA-S0365 aGender
aGender contains speech sample recordings over public telephone lines with read and (semi-)spontaneous speech. Native German speakers called a voice portal from their private phone, and read text + answered some open questions. The corpus contains the voices of 945 German speakers (approx. minimum of 100 speakers per class), each delivering 18 speech items in up to six different sessions.
ELRA-W0074 Amharic-English bilingual corpus
The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English.

November 13
The GlobalPhone Pronunciation Dictionaries: GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 17 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), and Korean (3500 syllables).
*** NEW ***
ELRA-S0363 GlobalPhone Chinese-Mandarin Pronunciation Dictionary
ELRA-S0364 GlobalPhone Korean Pronunciation Dictionary

Special prices are offered for a combined purchase of several GlobalPhone languages.
Available GlobalPhone Pronuncation Dictionaries are listed below (click on the links for further details):
ELRA-S0340 GlobalPhone French Pronunciation Dictionary
ELRA-S0341 GlobalPhone German Pronunciation Dictionary
ELRA-S0348 GlobalPhone Japanese Pronunciation Dictionary
ELRA-S0350 GlobalPhone Arabic Pronunciation Dictionary
ELRA-S0351 GlobalPhone Bulgarian Pronunciation Dictionary
ELRA-S0352 GlobalPhone Czech Pronunciation Dictionary
ELRA-S0353 GlobalPhone Hausa Pronunciation Dictionary
ELRA-S0354 GlobalPhone Polish Pronunciation Dictionary
ELRA-S0355 GlobalPhone Portuguese (Brazilian) Pronunciation Dictionary
ELRA-S0356 GlobalPhone Swedish Pronunciation Dictionary
ELRA-S0358 GlobalPhone Croatian Pronunciation Dictionary
ELRA-S0359 GlobalPhone Russian Pronunciation Dictionary
ELRA-S0360 GlobalPhone Spanish (Latin American) Pronunciation Dictionary
ELRA-S0361 GlobalPhone Turkish Pronunciation Dictionary
ELRA-S0362 GlobalPhone Vietnamese Pronunciation Dictionary

September 13
The GlobalPhone Pronunciation Dictionaries: The GlobalPhone Pronunciation Dictionaries: GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 15 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), and Vietnamese (38504 entries/29974 words). Other 3 languages will also be released: Chinese-Mandarin, Korean and Thai.
Available GlobalPhone Pronuncation Dictionaries are listed below (click on the links for further details):
ELRA-S0358 GlobalPhone Croatian Pronunciation Dictionary
ELRA-S0359 GlobalPhone Russian Pronunciation Dictionary
ELRA-S0360 GlobalPhone Spanish (Latin American) Pronunciation Dictionary
ELRA-S0361 GlobalPhone Turkish Pronunciation Dictionary
ELRA-S0362 GlobalPhone Vietnamese Pronunciation Dictionary


 June 13 
The GlobalPhone Pronunciation Dictionaries: GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 10 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words) and Swedish (about 25000 entries). Other 8 languages will also be released: Chinese-Mandarin, Croatian, Korean, Russian, Spanish (Latin American), Thai, Turkish, and Vietnamese.
Available GlobalPhone Pronuncation Dictionaries are listed below (click on the links for further details):
ELRA-S0340 GlobalPhone French Pronunciation Dictionary
ELRA-S0341 GlobalPhone German Pronunciation Dictionary
ELRA-S0348 GlobalPhone Japanese Pronunciation Dictionary
ELRA-S0350 GlobalPhone Arabic Pronunciation Dictionary
ELRA-S0351 GlobalPhone Bulgarian Pronunciation Dictionary
ELRA-S0352 GlobalPhone Czech Pronunciation Dictionary
ELRA-S0353 GlobalPhone Hausa Pronunciation Dictionary
ELRA-S0354 GlobalPhone Polish Pronunciation Dictionary
ELRA-S0355 GlobalPhone Portuguese (Brazilian) Pronunciation Dictionary
ELRA-S0356 GlobalPhone Swedish Pronunciation Dictionary

ELRA-E0041 CHIL 2007+ Evaluation Package The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds. The CHIL 2007+ Evaluation Package includes: 1) CHIL 2007 Evaluation Package (see ELRA-E0033) and 2) additional annotations which have been created within the scope of the Metanet4u Project (ICT PSP No 270893), sponsored by the European Commission.

February 13
ELRA-W0073 Quaero Old Press Extended Named Entity corpus
This corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages. The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.
ELRA-S0349 Quaero Broadcast News Extended Named Entity corpus
This corpus consists of the manual annotation of (i) the ESTER 2 (see also ELRA-S0338) manual transcription corpus and (ii) the Quaero Speech Recognition Evaluation corpus (manual and automatic transcriptions coming from 3 different ASR systems). The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.
ELRA-W0057 PANACEA English-French and English-Greek parallel corpus acquired for Environment domain
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
ELRA-W0058 PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislation domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
ELRA-W0063 PANACEA Environment English monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the "Environment" domain. It was constructed in the summer of 2011. It contains 50,541,538 tokens, divided into a total of 28,071 documents that were crawled from 3,121 web sites.
ELRA-W0064 PANACEA Labour English monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the "Labour Legislation" domain. It was constructed in the summer of 2011. It contains 46,431,351 tokens, divided into a total of 15,197 documents that were crawled from 1,558 web sites.
ELRA-W0065 PANACEA Environment French monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the "Environment" domain. It was constructed in the summer of 2011. It contains 47,364,125 tokens, divided into a total of 23,514 documents that were crawled from 1,969 web sites.
ELRA-W0066 PANACEA Labour French monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the "Labour Legislation" domain. It was constructed in the summer of 2011. It contains 56,440,425 tokens, divided into a total of 26,675 documents that were crawled from 1,391 web sites.
ELRA-W0067 PANACEA Environment Greek monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the "Environment" domain. It was constructed in the summer of 2011. It contains 27,958,530 tokens, divided into a total of 16,073 documents that were crawled from 1,063 web sites.
ELRA-W0068 PANACEA Labour Greek monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the "Labour Legislation" domain. It was constructed in the summer of 2011. It contains 21,077,196 tokens, divided into a total of 7,124 documents that were crawled from 598 web sites.
ELRA-W0069 PANACEA Environment Italian monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the "Environment" domain. It was constructed in the summer of 2011. It contains 40,044,852 tokens, divided into a total of 16,159 documents that were crawled from 1,211 web sites.
ELRA-W0070 PANACEA Labour Italian monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the "Labour Legislation" domain. It was constructed in the summer of 2011. It contains 70,563,320 tokens, divided into a total of 12,706 documents that were crawled from 864 web sites.
ELRA-W0071 PANACEA Environment Spanish monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the "Environment" domain. It was constructed in the summer of 2011. It contains 46,225,624 tokens, divided into a total of 26,009 documents that were crawled from 2,053 web sites.
ELRA-W0072 PANACEA Labour Spanish monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the "Labour Legislation" domain. It was constructed in the summer of 2011. It contains 53,922,118 tokens, divided into a total of 13,188 documents that were crawled from 1,015 web sites.


2012
Jan | Jul | Sep | Nov | Dec


December 12
ELRA-W0059 LT Corpus
The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
ELRA-W0060 PTPARL Corpus
The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
ELRA-W0061 CINTIL-DependencyBank
The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
ELRA-W0062 CINTIL-DeepBank
The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.

November 12
ELRA-S0347 GlobalPhone Hausa
The GlobalPhone Hausa corpus contains 7,895 utterances spoken by 33 male and 69 female speakers in the age range of 16 to 60 years. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection took place in 5 different locations in Cameroon. The speech data contains a variety of accents: Maroua, Douala, Yaoundé, Bafoussam, Ngaoundéré, and Nigeria.

September 12
ELRA-S0345 Spoken Portuguese Corpus
The Spoken Portuguese corpus consists of a total of 86 recordings (8h44m), collected among sociolinguistically diverse speakers having Portuguese as mother tongue or as second language.
ELRA-S0346 Fundamental Portuguese Corpus
The Fundamental Portuguese Corpus is a corpus of spoken language, collected between 1970 and 1974, composed of 1800 recordings (500 hours) made in Continental Portugal and the Islands. Of these 1800 conversations, a sample was selected and transcribed.
ELRA-W0055 CINTIL-TreeBank
The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens).
ELRA-W0056 CINTIL-PropBank
The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens).

July 12
ELRA-S0343 VERIF1DE
The speech corpus VERIF1DE contains 20 recordings (sessions) of 150 German speakers each over the telephone network (10 sessions over fixed network and 10 sessions over GSM). Each session contains 40 single recordings, mainly speech read from a prompt sheet.
ELRA-S0344 LILA Hindi Belt database
The LILA Hindi Belt database comprises 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items.
ELRA-M0013 Bilingual Collocational Dictionary (Horst Bogatz)
This new release contains 69,000 English headwords (instead of 40,000 for the previous release). The bilingual English-German collocational dictionary consists of around 69,000 English headwords, including concepts expressed with more than one word (e.g. "the awareness of the environment" or "lame duck") and hyphenated compounds. It contains verbs, adjectives, synonyms and phrases that collocate with the headword. It provides the German equivalents for the headwords as well as their English synonyms.

January 12
ELRA-S0324 Catalan-SpeechDat For the Fixed Telephone Network Database
This speech database contains the recordings of 2000 Catalan speakers who called from Fixed telephones and who are recorded over the fixed PSTN using and ISDN-BRI interface. Each speaker uttered around 50 read and spontaneous items. The speech database follows the specifications made within the SpeechDat (II) project. The database was validated by UVIGO. The Catalan-SpeechDat for the Fixed Telephone Network Database was funded by the Catalan Government.
ELRA-S0325 Catalan-SpeechDat for the Mobile Telephone Network Database
This speech database contains the recordings of 2000 Catalan speakers who called from GSM telephones and who are recorded over the fixed PSTN using and ISDN-BRI interface. Each speaker uttered around 50 read and spontaneous items. The speech database follows the specifications made within the SpeechDat (II) project. The database was validated by UVIGO. The Catalan-SpeechDat for the Mobile Telephone Network Database was funded by the Catalan Government.
ELRA-S0326 Catalan SpeechDat-Car database
The Catalan SpeechDat-Car database contains the in-car recordings of 300 speakers who uttered from around 120 read and spontaneous items. Each speaker recorded two sessions. Recordings have been made through 4 different channels, via in-car microphones (1 close-talk microphone, 3 far-talk microphones). The 300 Catalan speakers were selected from 5 different dialectal regions and are balanced in gender and age groups. The database was validated by UVIGO. The Catalan-SpeechDat-Car Database was funded by the Catalan Government.
ELRA-S0327 Catalan Speecon database
The Catalan Speecon database comprises the recordings of 550 adult Catalan speakers who uttered over 290 items (read and spontaneous). The data were recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). The speech database follows the specifications made within the UE funded Speecon project. The database was validated by UVIGO. The Catalan-Speecon Database was funded by the Catalan Government.
ELRA-S0328 Spanish EUROM.1
EUROM1 is a multilingual European speech database. It contains over 60 speakers per language who pronounced numbers, sentences, isolated words … using close talking microphone in an anecoic room. Equivalent corpora for each of the European languages exist already, with the same number of speakers selected in the same way, and recorded in the same conditions with common file formats.
ELRA-S0329 Emotional speech synthesis database
This database contains the recordings of one male and one female Spanish professional speakers recorded in a noise-reduced room. It consists in recordings and annotations of read text material in neutral style plus six MPEG expressions, all in fast, slow, soft and loud speech styles. The text material is composed of 184 items including phonetically balanced sentences, digits and isolated words. The text material was the same for all the modes and styles, giving a total of 3h 59min recorded speech for the male speaker and 3h 53min for the female speaker. The Emotional speech synthesis database was created within the scope of the Interface EU funded project.
ELRA-S0330 FESTCAT Catalan TTS baseline male speech database
This database contains the recordings of one male Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. This database consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Male Speech Database was created within the scope of the FESTCAT project, funded by the Catalan Government.
ELRA-S0331 FESTCAT Catalan TTS baseline female speech database
This database contains the recordings of one female Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Female Speech Database was created within the scope of the FESTCAT project funded by the Catalan Government.
ELRA-S0332 FESTCAT Catalan TTS baseline speech database - 8 speakers
This database contains the recordings of four female and four male Catalan professional speakers recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 1 hour of speech per speaker for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS baseline speech database - 8 speakers was created within the scope of the FESTCAT project funded by the Catalan Government.
ELRA-S0333 Spanish Festival HTS models - male speech
This database contains the Festival HTS models trained with 10h of speech from the TC-STAR Spanish Baseline Male Speech Database (ELRA-S0310).
ELRA-S0334 Spanish Festival HTS models - female speech
This database contains the Festival HTS models trained with 10h of speech from the TC-STAR Spanish Baseline Female Speech Database (ELRA-S0309).
ELRA-S0335 Bilingual (Spanish-English) Speech synthesis HTS models
This database contains Bilingual (English and Spanish) Festival HTS models. Models were trained with 9h of speech from 2 female bilingual speakers and 2 male bilingual speakers. Each speaker recorded 2h 15 min per language. The speech data can be found in the TC-STAR Bilingual Voice-Conversion Spanish Speech Database (ELRA-S0311) and in the TC-STAR Bilingual Expressive Spanish Speech Database (ELRA-S0313).
ELRA-S0336 Spanish Festival voice male
This database contains the recordings of one male Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. This comprises read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The database includes Festival-compatible annotations. The recordings can be also found under TC-STAR Spanish Baseline Male Speech Database (ELRA-S0310).
ELRA-S0337 Spanish Festival voice female
This database contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal, of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The database includes Festival-compatible annotations. The recordings can be also found under TC-STAR Spanish Baseline Female Speech Database (ELRA-S0309).


2011
Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec


November 11
ELRA-S0323 European Parliament Interpretation Corpus (EPIC)
The EPIC corpus is a parallel corpus of European Parliament speeches and their corresponding simultaneous interpretations. This corpus includes source speeches in Italian, English and Spanish and interpreted speeches in all possible combinations and directions. It contains a total of 357 speeches (177,295 words). The corpus has been orthographically transcribed. Non-tagged transcripts in text format are also available.
September 11
ELRA-S0319 GlobalPhone Bulgarian
ELRA-S0320 GlobalPhone Polish
ELRA-S0321 GlobalPhone Thai
ELRA-S0322 GlobalPhone Vietnamese

The GlobalPhone Corpus: The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of 19 spoken languages.
Update of ELRA-W0040 Venice Italian Treebank (VIT)
The new version of VIT has a totally revised constituent-based representation and a completely new dependency-based representation which has been achieved by semi-automatic procedures.
The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, political, economic and financial, literary, scientific, and news. In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated.
April 11
ELRA-S0314 LILA Marathi database
The LILA Marathi database comprises 2,002 Marathi speakers (992 males and 1010 females) recorded over the Korean mobile telephone network. Each speaker uttered around 46 read and spontaneous items.
ELRA-S0315 A-SpeechDB
A-SpeechDB© is an Arabic speech database contains about 20 hours of continuous speech recorded through one desktop omni microphone by 205 native speakers (about 30% of females and 70% of males), aged between 20 and 45. Automatically generated transcriptions are provided with a manually revised version for each sentence.
ELRA-S0316 SmartKom Home (SKH)
Release SKH 1.0 contains 130 recordings in the technical setup ("scenario") SmartKom Home which should be an intelligent communication assistant for the private environment. Naive users were asked to test a "prototype" for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4.5 minutes while they were left alone with the system.
ELRA-S0317 SmartKom Mobil (SKM)
Release SKM 1.0 contains 146 recordings in the technical setup ("scenario") SmartKom Mobil which is a portable PDA equipped with a net link and additional intelligent communication devices. Naive users were asked to test a "prototype" for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system.
ELRA-S0318 SmartKom Audio (SKAUDIO)
Release SKAUDIO 1.0 contains all audio channel recordings of the SmartKom corpora SmartKom Public (cf. ELRA-S0136), SmartKom Home (cf. ELRA-S0316) and SmartKom Mobil (cf. ELRA-S0317).


2010
Mar | Apr | Jun | Sep | Nov


November 10
ELRA-E0036 CLEF AdHoc-News Test Suites (2004-2008) - Evaluation Package
The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoc track of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections.
ELRA-E0037 CLEF Domain Specific Test Suites (2004-2008) - Evaluation Package
The CLEF Domain Specific Test Suites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles.
ELRA-E0038 CLEF Question Answering Test Suites (2003-2008) - Evaluation Package
The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008. This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents.
September 10
ELRA-S0308 Egyptian Arabic Speecon database
The Egyptian Arabic Speecon database comprises the recordings of 550 adult Egyptian speakers and 50 child Egyptian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-W0054 Persian 1984 corpus (Multext-East framework)
This corpus contains the Persian (Farsi) translation of a part of the novel “1984” (G. Orwell) annotated in the Multext-East framework (Multilingual Text Tools and Corpora for Eastern and Central European Languages). The corpus contains approximately 100,000 words (6,604 sentences, 13,247 lemmas), with extensive headers and markup for document structure, sentences, and various sub-sentence annotations in the XML-format following the TEI guidelines. Annotation includes POS (part-of-speech) and lemmas.
ELRA-L0086 Persian Multext-East framework lexicon
This is a Persian (Farsi) morphosyntactic lexicon derived from the Persian 1984 corpus (Multext-East framework) (see ELRA-W0054). It contains the full inflectional paradigms of a superset of lemmas that appear in the Persian 1984 corpus. Each entry gives the word-form, its lemma and morphosyntactic description. The lexicon contains 13,247 entries.
ELRA-L0087 Persian lexicon
This is a Persian (Farsi) lexicon of more than 40,000 entries of non-inflected forms of words. Each word is transliterated based on the proposed framework from MBROLA (Text-To-Speech synthesizer). The database includes a large variety of descriptors for each entry (plural, homograph, …). The lexicon is provided in a MS Access database.
June 10
ELRA-T0374 Terminology database of natural sciences
This dictionary covers the three kingdoms: Animal, Vegetal, Mineral. It contains 50,000 species with numerous synonyms in French, English and Latin and many breeds and varieties. Minerals are given with their chemical formula. About 7,900 definitions in French are included. It also includes synonyms and linguistic variants.
ELRA-W0053 Catalan-Spanish Parallel Corpus
This corpus contains more than 100 million words and it contains 10 years of bilingual articles from “El Periódico de Catalunya”. The data are aligned at sentence level and stored in text files, in a one sentence per line basis. The data are provided in plain text, with no encoding whatsoever.
Please note that the content and price of the following LRs have been updated:
ELRA-T0102 Terminology database of expressions
ELRA-T0103 Terminology database of finance
ELRA-T0367 Terminology database of telecommunication
April 10
ELRA-S0307 BABEL Polish database
The BABEL Polish Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). It consists of the basic "common" set which contains the Many Talker Set (30 males, 30 females), the Few Talker Set (5 males, 5 females), the Very Few Talker Set (1 male, 1 female).
March 10
ELRA-S0305 EPAC Corpus: orthographic transcriptions
This corpus consists of approx. 100 hours of manual orthographic transcriptions, which were produced from 1,677 hours of non transcribed recordings from the ESTER Evaluation Campaign (Technolangue programme). This corpus also consists of automatic transcriptions of the full 1,677 hours.


2009
Feb | Mar | Apr | May | Jun | Jul | Aug | Sep


September 09
ELRA-S0301 Norwegian EUROM1 (EUROM1_N)
EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0302 TC-STAR female baseline voice: Laura
Laura contains the recordings of one female English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).
ELRA-S0303 TC-STAR male baseline voice: Ian
Ian contains the recordings of one male English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).
ELRA-S0304 SpeechDat(M) Italian Mobile Network Speech Database
This speech database contains the recordings of 342 Italian speakers recorded over the Italian mobile telephone network. Each speaker uttered around 40 read and spontaneous items.
August 09
ELRA-T0373 BioLexicon
BioLexicon is a large-scale English terminological resource which has been developed to address the needs emerging in text mining efforts in the biomedical domain. It contains over 2.2M lexical entries (over 3.3M semantic relations), and information on over 1.8M variants and on over 2M synonymy relations. BioLexicon is available in a relational database format (MySQL dump format) and it adheres to the EAGLES/ISO standards for lexical resources.
July 09
ELRA-T0372 Multilingual Dictionary of Sports
This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The results are presented in the form of MS ACCESS databases. The EuRADic sport dictionary is provided under the following different subsets:
ELRA-T0372-01 English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database
ELRA-T0372-02 English-French bilingual database
ELRA-T0372-03 English-French-Greek trilingual database
ELRA-T0372-04 English-French-Arabic trilingual database
ELRA-T0372-05 English-French-German trilingual database
ELRA-T0372-06 English-French-Spanish trilingual database
ELRA-T0372-07 English-French-Portuguese trilingual database

ELRA-M0042 ItalWordNet (Italian WordNet)
ItalWordNet (Italian WordNet) is an updated version of the EuroWordNet Italian database.
ELRA-W0051 Persian-English parallel Corpus
The corpus consists of about 3,500,000 English and Persian words aligned at sentence level (about 100,000 sentences). The format of the files is Unicode.
ELRA-E0034 EASy Evaluation Package
The EASy Evaluation Package was produced within the French national project EASy (Evaluation of syntactic parsers of French), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).
June 09
ELRA-W0050 The CINTIL Corpus – International Corpus of Portuguese
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators.
May 09
ELRA-M0048 The CINTIL Corpus – International Corpus of Portuguese
LatinWordNet contains information about the following aspects of the Latin and English lexicon: lexical relations between words, semantic relations between lexical concepts, correspondences between Latin and English lexical concepts.
ELRA-M0049 Basque WordNet
The Basque WordNet models nouns, verbs and adjectives. Each sense is linked to a so-called synset (for a total of 30,281 Synsets). Every synset encodes the synonymy relation between (possibly) several words (synonyms), having a unique meaning, belonging to one and the same part of speech (specified in the POS tag value), and expressing the same lexical meaning.
ELRA-M0050 The MWN.PT - MultiWordnet of Portuguese
MWN.PT - MultiWordnet of Portuguese (version 1) spans over 17,200 manually validated concepts/synsets, linked under the semantic relations of hyponymy and hypernymy. These concepts are made of over 21,000 word senses/word forms and 16,000 lemmas from both European and American variants of Portuguese.
ELRA-S0300 SIGNUM Database
The SIGNUM Database contains both isolated and continuous utterances of various signers. The corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images.
April 09
ELRA-S0299 Alcohol Language Corpus (BAS ALC)
ALC contains recordings of 88 German speakers that are either intoxicated or sober. The type of speech ranges from read single digits to full conversation style.
March 09
ELRA-W0049 “Le Monde Diplomatique” Arabic tagged corpus
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.
ELRA-E0033 CHIL 2007 Evaluation Package
The CHIL 2007 Evaluation Package consists of the following contents:
  1. A set of audiovisual recordings of interactive seminars. The number of people present in the recording was fixed to be between 3 and 7. The recordings were done between June and September 2006 according to the “CHIL Room Setup” specification.
  2. Video annotations.
  3. Orthographic transcriptions.
ELRA-S0297 Hungarian Speecon database
The Hungarian Speecon database comprises the recordings of 555 adult Hungarian speakers and 50 child Hungarian speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0298 Czech Speecon database
The Czech Speecon database comprises the recordings of 550 adult Czech speakers and 50 child Czech speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
February 09
ELRA-S0296 FBK-Irst database of isolated meeting-room acoustic events
This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission’s Sixth Framework Programme.
January 09

No LR announced. 


2008
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec


December 08
ELRA-M0047 Czech WordNet
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 28,201 word senses (synsets).
ELRA-S0294 CHIEDE Corpus: a spontaneous child language corpus of Spanish
The spontaneous child language corpus, CHIEDE, consists of 58,163 words, in 30 texts, with 7 hours and 53 minutes of recordings and 59 child participants.
ELRA-S0295 LILA Korean database
The LILA Korean database comprises 1,000 Korean speakers (500 males and 500 females) recorded over the Korean mobile telephone network.
November 08
ELRA-S0283 Laboratory Conditions Czech Audio-Visual Speech Corpus (UWB-05-LCAVC)
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions.
ELRA-S0284 Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions (UWB-07-ICAVR I)
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems collected with impaired illumination conditions. The corpus consists of about 20 hours of audio-visual records of 50 speakers in laboratory conditions.
ELRA-S0285 Czech Sign Language Corpus for Recognition - Amateur Signer (UWB-06-SLR-A)
This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views.
ELRA-S0286 Czech Sign Language Corpus for Recognition - Professional Signer (UWB-07-SLR-P)
This database comprises 378 signs from Czech sign language as performed by 4 everyday sign-language users (4 women, 2 of them deaf).
ELRA-E0017 CHIL 2006 Evaluation Package
The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.
ELRA-S0292 Danish EUROM1 (EUROM1_D)
EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0293 The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication
The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.
October 08
ELRA-S0287 Cantonese Speecon Database
The Cantonese Speecon database comprises the recordings of 550 adult Cantonese speakers and 50 child Cantonese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0288 Thai Speecon Database
The Thai Speecon database comprises the recordings of 552 adult Thai speakers and 50 child Thai speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0289 OrienTel Jordan MCA (Modern Colloquial Arabic) database
This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0290 OrienTel Jordan MSA (Modern Standard Arabic) database
This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0291 OrienTel English as spoken in Jordan database
This speech database contains the recordings of 578 Jordanian speakers of English recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
ELRA-W0048 TUNA Corpus
The TUNA Corpus of Referring Expressions is built with the contributions from 50 native or fluent speakers of English and it contains about 2000 descriptions (referring expressions). Participants described objects (targets) in visual domains by typing and submitting referring expressions that distingued them from other objects that were shown simultaneously (distractors). Each description is annotated with semantic information.
September 08
ELRA-S0281 LILA Hindi-L1 database
The LILA Hindi-L1 database comprises 2,030 Hindi speakers (1,012 males and 1,018 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered around 60 read and spontaneous items.
ELRA-S0282-01 BAS PHATT 1.0.X (sub-set)
The Ph@ttSessionz speech database contains recordings of 864 adolescent speakers of German (age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 41 locations in Germany. Recordings were done with SpeechRecorder in selected schools in the years 2005-2007. Both channels, the headset and the desktop microphone, were recorded in high quality. The BAS PHATT corpus is available in two versions: BAS PHATT 1.0.X (sub-set, ELRA-S0282-01) and BAS PHATT 1.1.X (complete corpus, ELRA-S0282-02). BAS PHATT 1.0.X contains 41 items.
ELRA-S0282-02 BAS PHATT 1.1.X (complete corpus) The SpeechDat Galician Database for the Fixed Telephone Network contains the recordings of 653 speakers of Galician recorded over the fixed telephone network. Each speaker uttered around 44 read and spontaneous items.
August 08

No LR announced. 
July 08
ELRA-S0276 Swedish EUROM1 (EUROM1_S)
EUROM1 is the first really multilingual speech database produced in Europe . Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0277 SpeechDat Galician Database for the Fixed Telephone Network
The SpeechDat Galician Database for the Fixed Telephone Network contains the recordings of 653 speakers of Galician recorded over the fixed telephone network. Each speaker uttered around 44 read and spontaneous items.
ELRA-S0278 SmartWeb Handheld Corpus (SHC)
This corpus contains recordings spoken by 156 speakers in a human-machine query situation. Users were asked to solve several tasks with a spoken query system to the WWW using a smart phone as portable device in natural environments (office, hall, restaurant, street). Recorded channels are the Bluetooth headset over UMTS (telephone quality), the Bluetooth headset and an additional collar microphone in high quality.
See also ELRA-S0279 and ELRA-S0280.
ELRA-S0279 SmartWeb Motorbike Corpus (SMC)
This corpus contains recordings spoken by 36 speakers in a human-machine query situation on a running motor cycle (BMW). Bikers were asked to solve several tasks with a spoken query system to the WWW using an integrated system connected to a speech server via an UMTS connection. Recorded channels are the Bluetooth helmet microphone over UMTS (telephone quality), and - partly - the Bluetooth helmet microphone and an additional neck microphone in high quality. See also ELRA-S0278 and ELRA-S0280.
ELRA-S0280 SmartWeb Video Corpus (SVC)
This multimodal corpus contains 99 recordings each containing a human-human-machine dialogue: one speaker (which is being recorded) interacts with a human partner as well with a dialogue system via a smart phone (SmartWeb system). See also ELRA-S0278 and ELRA-S0279.
June 08
Update - ELRA-S0242 SALA II US English database
The SALA II US English database comprises 4,090 US English speakers (2,017 males, 2,073 females, including some speakers with Hispanic accents) recorded over the United States mobile telephone network.
ELRA-L0085 euLEX (Lexical Database for Basque)
euLEX is a general lexicon which contains 115,000 entries, divided into 94,000 dictionary entries or lemmas, 12,000 allomorphs, 7,500 verb forms and about 1,200 dependent morphemes. All entries include linguistic information such as morphology and usage. The lexicon is in XML.
May 08

No LR announced. 
April 08

ELRA-S0273 LC-STAR Slovenian Phonetic lexicon 
The LC-STAR Slovenian Phonetic lexicon comprises 110,900 entries, including a set of 64,521 common words, a set of 45,012 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 5,491 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0274 LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon 
The LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon comprises 12,722 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,522 phrase corpus. The lexicon is provided in XML format.
March 08

ELRA-S0272 MEDIA speech database for French 
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).  
 
February 08

ELRA-S0269 LC-STAR Greek Phonetic lexicon 
The LC-STAR Greek Phonetic lexicon comprises 110,708 entries, including a set of 57,519 common words, a set of 45,162 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,027 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0270 LC-STAR Italian Phonetic lexicon
The LC-STAR Italian Phonetic lexicon comprises 109,712 entries, including a set of 56,420 common words, a set of 45,253 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,039 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0271 LC-STAR English-Italian Bilingual Aligned Phrasal lexicon
The LC-STAR English- Italian Bilingual Aligned Phrasal lexicon comprises 10,466 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,524 phrasal corpus. The lexicon is provided in XML format.  
 
January 08

ELRA-S0268 UPC-TALP database of isolated meeting-room acoustic events
This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission’s Sixth Framework Programme. It contains a set of isolated acoustic events that occur in a meeting room environment and that were recorded for the CHIL Acoustic Event Detection (AED) task. The database can be used as a training material for AED technologies as well as for testing AED algorithms in quiet environments without temporal sound overlapping. Approximately 60 sounds per sound class were recorded. Ten people (5 men and 5 women) participated to three sessions. During each session a person had to produce a complete set of sounds two times.


2007
Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec


December 07

ELRA-S0244 Japanese Speecon database
The Japanese Speecon database comprises the recordings of 556 adult Japanese speakers and 51 child Japanese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0265 Dutch from Belgium Speecon Database
The Dutch from Belgium Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0266 Dutch from the Netherlands Speecon Database
The Dutch from the Netherlands Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0267 Danish Speecon Database
The Danish Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0258 Orientel United Arab Emirates MCA (Modern Colloquial Arabic) 
This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0259 Orientel United Arab Emirates MSA (Modern Standard Arabic)
This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0260 Orientel English as spoken in the United Arab Emirates
This speech database contains the recordings of 500 speakers of English recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
ELRA-S0261 Hungarian SpeechDat(E) Database
This speech database contains the recordings of 1,000 Hungarian speakers recorded over the Hungarian fixed telephone network. Each speaker uttered around 50 read and spontaneous items.
ELRA-S0262 SALA II Portuguese from Brazil database
The SALA II Portuguese from Brazil database comprises 1000 Brazilian speakers recorded over the Brazilian mobile telephone network.
ELRA-S0263 SALA II Spanish from Colombia Database
The SALA II Spanish from Colombia database comprises 1000 Colombian speakers recorded over the Colombian mobile telephone network.
ELRA-S0264 SALA II US Spanish West
The SALA II US Spanish West database comprises 1000 Spanish speakers recorded over the American mobile telephone network.
ELRA-S0255 LC-STAR Finnish Phonetic lexicon 
The LC-STAR Finnish Phonetic lexicon comprises 189,409 entries, including a set of 144,233 common words, a set of 45,176 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 13,068 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0256 LC-STAR Mandarin Chinese Phonetic lexicon
The LC-STAR Mandarin Chinese Phonetic lexicon comprises 104,368 entries, including a set of 38,098 common words, a set of 57,528 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,522 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0257 LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon 
The LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format.
 
November 07

ELRA-S0249 TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech 
This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format. For corresponding recordings, see ELRA-S0251
ELRA-S0251 TC-STAR English Training Corpora for ASR: Recordings of EPPS Speech 
This corpus consists of the recordings of around 290 hours form EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English, 92 hours of which were annotated (transcribed) (the transcriptions are not provided in the present package). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz. For corresponding transcriptions, see ELRA-S0249.
ELRA-S0252 TC-STAR Spanish Training Corpora for ASR: Recordings of EPPS Speech 
This corpus consists of the recordings of around 283 hours from EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish (a mixture of native and non-native Spanish). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz.
ELRA-S0253 TC-STAR English Test Corpora for ASR 
This corpus consists of 70 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English and other European languages. From this corpus, 16 hours of English speeches (native or non native) were annotated (transcribed). Each speech file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.
ELRA-S0254 TC-STAR Spanish Test Corpora for ASR
This corpus consists of 174 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish and other European languages. From this corpus, 16 hours of Spanish speeches were annotated (transcribed). Each audio file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.
ELRA-S0250 TC-STAR English-Spanish Training Corpora for Machine Translation: Aligned Final Text Editions of EPPS 
This corpus consists of respectively 34 million (English) and 38 million (Spanish) running words of bilingual sentence segmented and aligned texts in English and Spanish obtained from the Final Text Editions provided by the European Parliament (from April 1996 to Sept. 2004, Dec. 2004 to May 2005, and Dec. 2005 to May 2006. The data is accompanied by tools for further preprocessing.
ELRA-S0245 LC-STAR German Phonetic lexicon 
The LC-STAR German Phonetic lexicon comprises 102,169 entries, including a set of 55,507 common words, a set of 46,662 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,763 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ).
ELRA-S0246 LC-STAR German Phonetic lexicon in the Touristic Domain 
The LC-STAR German Phonetic lexicon in the Touristic Domain comprises 8,782 entries from the following categories: nouns, adjectives and verbs. For each entry the following information is provided: orthographic form, part-of-speech (POS), phonemic transcription. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0247 LC-STAR Standard Arabic Phonetic lexicon 
The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0248 LC-STAR English-German Bilingual Aligned Phrasal lexicon 
The LC-STAR English-German Bilingual Aligned Phrasal lexicon comprises 10,733 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format.
 
October 07

ELRA-L0084 Macedonian Morphological Lexicon (MACPLEX) 
MACPLEX comprises two dictionaries: a dictionary of lemmas (over 80,000 entries) and a dictionary of word forms (over 1,300,000 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the more than 1,300,000 word forms, there are 345,350 nouns, 467,744 adjectives, 500,220 verbs and 19,472 adverbs. The remaining entries correspond to pronouns, adpositions, conjunctions and numerals. The lexicon is available in Unicode.
ELRA-S0242 SALA II US English database
The SALA II US English database comprises 3,065 US English speakers (1515 males, 1550 females, including some speakers with Hispanic accents ) recorded over the United States mobile telephone network.
ELRA-S0243 SpeechDat Catalan FDB database
The SpeechDat Catalan FDB database contains the recordings of 1,005 Catalan speakers (474 males, 531 females) recorded over the Spanish fixed telephone network.
AURORA-CD0005 AURORA-5
The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.
It contains artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz, as well as a set of scripts for running recognition experiments on those speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource. 
TC-STAR Evaluation Packages
The Evaluation Packages below include the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) and Spoken Language Translation (SLT) third evaluation campaign, as well as the material used for the TC-STAR 2006 and 2007 End-to-End task. They include resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0025 TC-STAR 2007 Evaluation Package - ASR English 
ELRA-E0026-01 TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES
ELRA-E0026-02 TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS
ELRA-E0027 TC-STAR 2007 Evaluation Package - ASR Mandarin Chinese
ELRA-E0028 TC-STAR 2007 Evaluation Package - SLT English-to-Spanish
ELRA-E0029-01 TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - CORTES
ELRA-E0029-02 TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - EPPS
ELRA-E0030 TC-STAR 2007 Evaluation Package - SLT Chinese-to-English
ELRA-E0031 TC-STAR 2006 Evaluation Package – End-to-End
ELRA-E0032 TC-STAR 2007 Evaluation Package – End-to-End
September 07

No LR announced. 
August 07

Update - ELRA-W0036-02 "Le Monde Diplomatique" Text corpus in French - archives from 1999
Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article.
ELRA-L0076 Polderland Dutch Lexicon of Abbreviations and Acronyms
The lexicon contains 2,180 Dutch abbreviations and acronyms. It complies with the official Dutch Spelling (2005/6). Each entry consists of an ID, word form, lemma and part of speech.
ELRA-L0077 Polderland Dutch General Lexicon
The lexicon contains 400,463 Dutch words, comprising 236,369 nouns, 90,882 adjectives, 69,744 verbs, 2,120 adverbs, and 1,348 items from other categories (pronouns, determiners, articles, adpositions, conjunctions, numerals, etc.). It complies with the official Dutch Spelling (2005/6). The lexicon contains an ID, word form, lemma and part of speech.
ELRA-L0078 Polderland Dutch Lexicon of Names
The lexicon contains 24,247 Dutch proper names. Various sorts of proper names are included, such as first names, last names, geographical names etc. Each entry contains an ID, word form, lemma, part of speech and proper name type.
ELRA-L0079 Polderland Dutch Lexicon of Business Terminology
The lexicon contains 15,987 Dutch words from the business domain, comprising 13,774 nouns, 1,267 adjectives, 895 verbs, 9 adverbs, and 42 items from other categories. It complies with the official Dutch Spelling (2005). Each entry contains an ID, word form and part of speech.
ELRA-L0080 Polderland Dutch Lexicon of Legal Terminology
The lexicon contains 6,207 Dutch words from the legal domain, comprising 4,781 nouns, 810 adjectives, 573 verbs, 12 adverbs and 31 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0081 Polderland Dutch Lexicon of Medical Terminology
The lexicon contains 17,115 Dutch words from the medical domain, comprising 12,638 nouns, 3,107 adjectives, 1,273 verbs, 11 adverbs and 86 items from other categories. The lexicon complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0082 Polderland Dutch Lexicon of Social Terminology
The lexicon contains 12,551 Dutch words from the social domain, comprising 9,984 nouns, 1,306 adjectives, 1,161 verbs, 56 adverbs and 44 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0083 Polderland Dutch Lexicon of Technical Terminology
The lexicon contains 9,940 Dutch words from the technical/scientific domain, comprising 8,832 nouns, 950 adjectives, 111 verbs, 2 adverbs and 45 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
 
July 07

ELRA-E0018 ARCADE II Evaluation Package
The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme. The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment.
The campaign is distributed over two actions: sentence alignment and translation of named entities.
ELRA-E0019 CESART Evaluation Package
The CESART Evaluation Package was produced within the French national project CESART (Evaluation of terminology extraction tools), as part of the Technolangue programme. The CESART project enabled to carry out a campaign for the evaluation of terminological resources acquisition tools.
The campaign is distributed over two actions: term extraction and relation extraction.
ELRA-E0020 CESTA Evaluation Package 
The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme. The CESTA project enabled to carry out a campaign for the evaluation of machine translation technologies.
The campaign is distributed over two actions: evaluation on a non restrictive vocabulary, evaluation on a specialised domain (evaluation after terminology enrichment).
ELRA-E0021 ESTER Evaluation Package 
The ESTER Evaluation Package was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme. The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems for French.
The campaign is distributed over three actions: orthographic transcription, segmentation and information extraction (named entity tracking).
For research or commercial use of this database, please refer to ELRA-S0241 ESTER Corpus
ELRA-E0022 EQueR Evaluation Package 
The EQueR Evaluation Package was produced within the French national project EQueR (Evaluation campaign for Question-Answering systems), as part of the Technolangue programme. The EQueR project enabled to carry out a campaign for the evaluation of Question-Answering systems in French.
The campaign is distributed over two actions: one generic task and one specialised task (medical domain).
ELRA-E0023 EvaSy Evaluation Package 
The EvaSy Evaluation Package was produced within the French national project EvaSy (Evaluation of speech synthesis systems), as part of the Technolangue programme. The EvaSy project enabled to carry out a campaign for the evaluation of speech synthesis systems using French text data.
The campaign is distributed over three actions: evaluation of grapheme-to-phoneme conversion, evaluation of prosody, global evaluation of the quality of speech synthesis systems.
ELRA-E0024 MEDIA Evaluation Package 
The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme. The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French. 
The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.
 
June 07

ELRA-M0038 SCI-ANAL English-German Bilingual Dictionary 
This bilingual dictionary contains 59,758 pairs of English-German terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".
See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037.
Update - ELRA-M0037 SCI-ANES English-Spanish Bilingual Dictionary
This bilingual dictionary contains around 60,000 pairs of English-Spanish terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".
See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0038.
ELRA-S0240 French-Canadian Speecon database
The French-Canadian Speecon database comprises the recordings of 550 adult French-Canadian speakers and 50 child French-Canadian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
 
May 07

ELRA-W0047 Catalan Corpus of News Articles
The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007 . These articles are grouped per trimester without chronological order inside.
ELRA-L0075 Bulgarian Linguistic Database
This database contains 81,647 entries in Bulgarian with a linguistic environment tool (for WINDOWS XP). The data may be used for morphological analysis and synthesis, syntactic agreement checking, phonetic stress determining. 
ELRA-S0238 MIST Multi-lingual Interoperability in Speech Technology database
The MIST Multi-lingual Interoperability in Speech Technology database comprises the recordings of 74 native Dutch speakers (52 males, 22 females) who uttered 10 sentences in Dutch, English, French and German, including 5 sentences per language identical for all speakers and 5 sentences per language per speaker unique. Dutch sentences are orthographically annotated.
ELRA-S0239 N4 (NATO Native and Non Native) database
The (NATO Native and Non Native) database comprises speech data recorded in the naval transmission training centers of four countries ( Germany , The Netherlands, United Kingdom , and Canada ) during naval communication training sessions in 2000-2002. The material consists of native and non-native speakers using NATO Naval English procedure between ships, and reading from a text, "The North Wind and the Sun," in both English and the speaker’s native language. The audio material was recorded on DAT and downsampled to 16kHz-16bit, and all the audio files have been manually transcribed and annotated with speakers identities using the tool, Transcriber.

April 07

ELRA-M0043 Russian => English MT optimized lexicon in OLIF XML
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 99,211 entries in its source language (Russian) and 134,828 entries in its target language (English). The source entries are distributed as follows: 64,487 nouns, 11,470 adjectives, 19,724 verbs, 1,762 adverbs, and 1,768 closed-class elements (interjections, special prefixes, suffixes, etc.). Nouns contain gender and number information and verbs provide details on aspect and reflexivity. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Moreover, definitions are available for 59,775 entries, as well as collocational information for 39,148 entries.
ELRA-M0044 English => Swahili Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 58,247 entries in English and 58,300 in Swahili. The source entries are distributed as follows: 36,046 nouns, 3,013 adjectives, 18,308 verbs and 880 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 17,570 entries.
ELRA-M0045 Cebuano => English Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 1,988 entries in Cebuano and 1,990 in English. The source entries are distributed as follows: 1,052 nouns, 462 adjectives, 405 verbs and 69 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 500 entries.
ELRA-M0046 English => Czech Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 31,718 entries in English and 32,125 in Czech. The source entries are distributed as follows: 17,797 nouns, 7,748 adjectives, 6,039 verbs and 134 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 3,065 entries.
Update - ELRA-S0226-01 IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project)
It contains the recordings of 1,000 French adult speakers (470 males and 530 females) recorded over the French fixed telephone network. The speakers uttered 45 phonetically rich sentences. The 45 sentences were the same for all speakers.
Update - ELRA-S0226-02 IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project)
It contains the recordings of 200 French adult speakers (97 males and 103 females) recorded over the French fixed telephone network. The speakers uttered 45 sentences per call with 10 calls per speaker. The 450 sentences per speaker are common to all speakers. Speakers were selected from the IDIOLOGOS 1 “Bootstrap” database.
ELRA-S0275 Slovenian BNSI Broadcast News Speech Corpus
This speech database consists of TV news shows (both evening news, “TV Dnevnik” and late night news, “Odmevi”), from the archive of a Slovenian national broadcaster RTV Slovenia. The recordings took place between June 1999 and May 2003. The database comprises a total of 36 hours of recordings, transcribed and manually checked using the Transcriber tool. 1,565 speakers were recorded (1,069 males, 477 females, 19 unspecified).
 
March 07

Update - ELRA-W0015 Text corpus of "Le Monde"
Corpus from "Le Monde" newspaper. Years 1987 to 2002 are available in an ASCII text format. Years 2003 to 2006 are available in .XML format. Each month consists of some 10 MB of data (circa 120 MB per year).
ELRA-S0235 LC-STAR Hebrew (Israel) phonetic lexicon 
The LC-STAR Hebrew ( Israel ) phonetic lexicon comprises 109,580 words, including a set of 62,431 common words, a set of 47,149 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,677 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0236 LC-STAR English-Hebrew (Israel) Bilingual Aligned Phrasal lexicon 
The LC-STAR English-Hebrew ( Israel ) Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,449 phrasal corpus. The lexicon is provided in XML format.
ELRA-S0237 LC-STAR US English phonetic lexicon 
The LC-STAR US English phonetic lexicon comprises 102,310 words, including a set of 51,119 common words, a set of 51,111 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,807 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
 
February 07

ELRA-S0234 SALA Spanish Chilean Database
The SALA Spanish Chilean Database comprises 1,024 Chilean speakers (477 males, 547 females) recorded over the Chilean fixed telephone network.
ELRA-S0232 Swiss-German Speecon database
The Swiss-German Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0233 US English Speecon database
The US English Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0157 NetDC Arabic BNSC (Broadcast News Speech Corpus)
The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States . The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period.
ELRA-S0229 LC-STAR Turkish lexicon
The LC-STAR Turkish lexicon comprises 104,513 words, including a set of 59,213 common words and a set of 45,300 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
ELRA-S0230 LC-STAR Russian lexicon
The LC-STAR Russian lexicon comprises about 128,000 words, including a set of 77,154 common words, a set of 51,074 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
ELRA-S0231 LC-STAR English-Russian Bilingual Aligned Phrasal lexicon
The LC-STAR English-Russian Bilingual Aligned Phrasal lexicon comprises 10,519 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,000 phrasal corpus. The lexicon is provided in XML format.
Update – ELRA-S0207 LC-STAR Catalan phonetic lexicon
The LC-STAR Catalan phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
Update – ELRA-S0208 LC-STAR Spanish phonetic lexicon
The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
January 07

ELRA and Beijing Haitian Ruisheng Science Technology Ltd today signed a major Language Resources distribution agreement. On behalf of ELRA, ELDA will act as the distribution agency for Beijing Haitian Ruisheng Science Technology Ltd and will incorporate to the ELRA Language Resources catalogue a large number of Speech resources designed and collected to boost Speech Synthesis and Speech Recognition. The resources cover mainly Mandarin Chinese with some coverage of Korean and Japanese languages.
With over 60 new resources, ELDA is strengthening its position as the leading worldwide distribution centre. With this agreement Beijing Haitian Ruisheng Science Technology Ltd will get more visibility in particular on the European market.
List of available Speech Resources
List of available Written Corpora
ELRA-L0074 POLEX Polish Lexicon
The POLEX Polish Lexicon is a morphological dictionary of Polish language. It comprises about 100,000 entries. The POLEX dictionary includes the core Polish vocabulary of general interest. It is based on a precise machine-interpretable formalism (coding system), the same for all categories (classes of speech). The dictionary entries are of the following form: BASIC_FORM+LIST_OF_STEMS+PARADIGMATIC_CODE +DISTRIBUTION_OF_STEMS
It contains more than 42,000 nouns, 12,000 verbs, 15,000 adjectives, 25,000 participles, and about 200 pronouns. A simple lemmatiser (in form of PROLOG prototype) is also included.