List of Releases
Dec 23
DiaLEX – Egyptian (DiaLEX-EA) – ISLRN: 697-328-151-668-9
A comprehensive full-form lexicon of Egyptian Arabic general vocabulary (DiaLEX-EA) including 78 million entries for 31,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Egyptian Arabic, especially morphological analysis and speech technology.
Quantity and size: 75,204,644 lines / 11,217 MB (11.0 GB)
DiaLEX – Emirati (DiaLEX-UA) – ISLRN: 836-793-503-213-8
A comprehensive full-form lexicon of Emirati Arabic general vocabulary (DiaLEX-UA) including 28 million entries for 29,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Emirati Arabic, especially morphological analysis and speech technology.
Quantity and size: 24,976,871 lines / 3,841 MB (3.8 GB)
DiaLEX – Saudi Arabian Hijazi (DiaLEX-HA) – ISLRN: 849-157-479-216-3
A comprehensive full-form lexicon of Hijazi Arabic general vocabulary (DiaLEX-HA) including 21 million entries for 30,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Hijazi Arabic, especially morphological analysis and speech technology.
Quantity and size: 20,247,655 lines / 2,835 MB (2.8 GB)
Nov 23
Corpus for fine-grained analysis and automatic detection of irony on Twitter – ISLRN: 478-366-550-085-8
This corpus was annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: “ironic by means of a polarity contrast”, “situational irony”, “other verbal irony” and “not ironic”. It consists of 4791 instances with an irony label and a tweet ID.
Bitext Synonym Data – General Language – ISLRN: 470-885-612-363-1
The Bitext Synonym Data – General Language includes 31,723 entries and more than 100,000 synonyms for English language. This dataset is a set of synonyms developed to augment the English version of Wordnet, a powerful open-source lexical database, released in 2005. All synonyms can be linked to Bitext Lexical Data – English (see ELRA-L0140) for lemmatization, POS and morphological information.
Corpus of Spontaneous Japanese (CSJ) – ISLRN: 280-594-494-328-0
The “Corpus of Spontaneous Japanese” (or CSJ) contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech materials are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed both at orthographic and phonetic levels. In addition, segment label, intonation label, and other miscellaneous annotations are provided for a subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech.
EWA-DB – Early Warning of Alzheimer speech database – ISLRN: 730-022-142-264-9
EWA-DB is a speech database that contains data from 3 clinical groups: Alzheimer’s disease, Parkinson’s disease, mild cognitive impairment, and a control group of healthy subjects. Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object and action naming (30 objects and 30 actions), picture description (two single pictures and three complex pictures). The total number of speakers in the database is 1649. Of these, there are 87 people with Alzheimer’s disease, 175 people with Parkinson’s disease, 62 people with mild cognitive impairment, 2 people with a mixed diagnosis of Alzheimer’s + Parkinson’s disease and 1323 healthy controls.
July 23
The series of Bitext Lexical Datasets for the generic vocabulary includes Lemmas, POS tagging, Frequency, Named Entities and Offensive features. Depending on the dataset and language, other syntactic and morphological features are also provided. The following 15 languages are available. As a complement to the datasets mentioned above, 11 datasets of Language Variants can also be obtained:
- Arabic (MSA) dataset and Arabic Language Variants dataset consisting of Arabic Gulf, Arabic Najdi, Arabic Egypt and Arabic MSA variants
- Chinese (Simplified) dataset, Chinese (Traditional) dataset, and Chinese Language Variants dataset (Simplified + Traditional)
- Dutch dataset and Dutch Language Variants dataset consisting of Netherlands and Belgium variants
- English dataset and English Language Variants dataset consisting of United States, United Kingdom and India variants
- Finnish dataset and Finnish Language Variants dataset consisting of Standard and Colloquial Finnish variants
- French dataset and French Language Variants dataset consisting of France, Canada and Switzerland variants
- German dataset and German Language Variants dataset consisting of Germany and Switzerland variants
- Indonesian dataset
- Italian dataset and Italian Language Variants dataset consisting of Italy and Switzerland variants
- Malay dataset
- Norwegian (Bokmal) dataset and Norwegian Language Variants dataset consisting of Bokmal and Nynorsk variants
- Portuguese dataset and Portuguese Language Variants dataset consisting of Portugal and Brazil variants
- Spanish dataset and Spanish Language Variants dataset consisting of Spain, North America, Central America, Andes and Southern Cone variants
- Ukrainian dataset.
The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for English and Spanish languages. They cover the most common intents for each vertical and include a large number of example utterances for each intent, with optional entity/slot annotations for each utterance. Data is distributed as models or open text files. For each language, the following verticals are available:
- Automotive: 52 intents (English, Spanish)
- Retail banking: 26 intents (English, Spanish)
- Education: 37 intents (English, Spanish)
- Event and ticketing: 25 intents (English, Spanish)
- Field Service: 27 intents (English, Spanish)
- Healthcare: 40 intents (English, Spanish)
- Hospitality: 24 intents (English, Spanish)
- Insurance: 38 intents (English, Spanish)
- Legal : 29 intents (English, Spanish)
- Manufacturing: 34 intents (English, Spanish)
- Media Streaming: 24 intents (English, Spanish)
- Mortgage and loans: 39 intents (English, Spanish)
- Moving and storage: 29 intents (English, Spanish)
- Real estate and construction: 28 intents (English, Spanish)
- Restaurant/ bar chains: 30 intents (English, Spanish)
- Retail Ecomm: 34 intents (English, Spanish)
- Telecommunication: 26 intents (English, Spanish)
- Travel: 33 intents (English, Spanish)
- Utilities: 21 intents (English, Spanish)
- Wealth management: 24 intents (English, Spanish)
The Persian Kids’ Speech Corpus consists of speech signals recorded by 286 children (141 girls, 145 boys), from 6 to 9 years old, through an Andreas Mic Anti-Noise microphone and a Premium Speechmike headphone. This recorded data was manually checked and labeled. Finally, a corpus containing 162,395 samples with a duration of 33 hours and 44 minutes was created. The samples are distributed as follows:
- 29,057 Words (478 minutes)
- 17,429 SubWords (260 minutes)
- 43,838 Syllables (485 minutes)
- 70,078 Phonemes (765 minutes)
- 1,993 Extra Vocabulary (36 minutes)
The prepared speech corpus comprehensively contains all the 29 Persian phonemes, 118 syllables, 56 sub-words, and 711 words and is particularly applicable to speech recognition and linguistics studies.
Reduced fees for the following Speech Resources
Chinese Mandarin (South) database: This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China.
Chinese Mandarin (North) database: This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios located in Shenzhen and in Hong Kong Special Administrative Region, People’s Republic of China.
Japanese Kids Speech database (Lower Grade): This database contains the total recordings of 179 Japanese Kids speakers (71 males and 108 females), from 6 to 9 years’ old (first, second and third graders in elementary school), recorded in quiet rooms using smartphones.
Japanese Kids Speech database (Upper Grade):This database contains the total recordings of 232 Japanese Kids speakers (104 males and 128 females), from 9 to 13 years old fourth, fifth and sixth graders in elementary school), recorded in quiet rooms using smartphones.
Jun 23
Archives of “El Mundo” Newspaper – Years 2020-2022 – ISLRN: 124-545-396-179-3
This corpus consists of 45,658 articles in Spanish from electronic archives of “El Mundo” Newspaper between 2020 and 2022. A few articles also come from publications from other related media: El Mundo Alicante, El Mundo Andalucía, El Mundo Baleares, El Mundo Catalunya, El Mundo Valéncia et Expansión. The number of articles available per year is as follows:
- 2020: 15,073 articles
- 2021: 14,461 articles
- 2022: 16,124 articles
TOTAL: 45,658 articles
All articles are provided in text format, including HTML tags. This data is released thanks to Unidad Editorial Información General, S.L.U., Spain. This corpus may be also obtained as separate years as follows:
- Archives of “El Mundo” Newspaper – Year 2020
- Archives of “El Mundo” Newspaper – Year 2021
- Archives of “El Mundo” Newspaper – Year 2022
Apr 23
MGB-5 Moroccan Dialect – ISLRN: 938-639-614-524-5
The MGB-5 Moroccan Dialect comprises 14 hours of Moroccan Arabic speech extracted from 93 YouTube videos distributed across seven genres: comedy, cooking, family/children, fashion, drama, sports, and science clips. The 93 YouTube clips have been manually labelled for speech, non-speech segments. About 12 minutes from each program were selected for transcription. In addition to the transcribed 14 hours, the full programs are also provided, which amounts 48 hours for the 93 programs.
Chinese-Vietnamese – PhraseBank with audio files – ISLRN: 428-557-564-826-7
Chinese-Vietnamese – PhraseBank with audio files of daily conversations spoken by native speakers containing 4002 sentence pairs. Scripts with Pinyin, Topic, Cat, Vietnamese translation with corresponding audio in Chinese and Vietnamese. Corpus in XML and WAV formats.
Vietnamese WordNet – ISLRN: 166-795-507-589-2
Manual translation of the 2.1 version of the English WordNet into Vietnamese containing 211000 entries, in Excel format.
Idioms French-Vietnamese Dictionary – ISLRN: 167-512-984-991-8
Idioms French-Vietnamese Dictionary with French terms translated in Vietnamese and one idiomatic sentence per Vietnamese word of 448 entries in XML format.
Vietnamese Etymology Dictionary – ISLRN: 627-237-063-692-6
Vietnamese Etymology Dictionary containing Vietnamese terms with correspondence in Kanji + Exp with meaning and examples of 3100 entries, provided in XML format.
Feb 23
Learner Corpus of Portuguese L2 – COPLE2 – ISLRN: 936-320-703-366-7
The Learner Corpus of Portuguese as Second/Foreign Language (COPLE2) is a corpus of written and oral texts produced by students of Portuguese as Foreign/Second Language courses in the Instituto de Cultura e Língua Portuguesa (the Institute of Portuguese Language and Culture) (ICLP – FLUL) and by applicants for examinations in the Centro de Avaliação de Português Língua Estrangeira (Center for Evaluation of Portuguese as a Foreign Language) (CAPLE – FLUL). The corpus contains texts from learners with 15 different native languages (L1s) and proficiencies from A1 to C1, and covers different topics and types of tasks. It is encoded in TEI format through the TEITOK environment. The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The corpus contains annotations for part of speech, lemma and learner errors. All the information encoded is searchable through the CQP query language.
CALEM (Comprehensive Arabic LEMmas) – ISLRN: 462-532-124-988-8
Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic inflected word forms (stems) and their corresponding lemmas. It is composed of 164,272 lemmas representing 7,151,106 stems, detailed as follows: 720 Arabic particles, 15,291 broken plurals, 2,464,239 verbs, 4,675,856 nouns. The lexicon is provided as plain text in UTF8 encoding and represents about 284 Mb of data.
MADED (Moroccan Arabic Dialect Electronic Dictionary) – ISLRN: 977-057-254-691-5
Moroccan Arabic Dialect Electronic Dictionary (MADED) is an electronic lexicon containing almost 13,000 entries. They are written in Arabic script wherein each Modern Standard Arabic (MSA) lemma is provided with its corresponding Moroccan Arabic equivalent. In addition, MADED entries are annotated with useful metadata such as part-of-speech (POS), origin and root. MADED is designed for the practical use of the NLP community. This dictionary is provided as a csv file and represents about 2 Mb of data.
MORV (Moroccan Morphological vocabulary) – ISLRN: 064-194-729-767-0
The Moroccan Morphological vocabulary is a lexicon containing more than 4.6 M entries describing a given Moroccan Arabic word with fourteen (14) morphological and semantic features: the word orthographic form, the segmentation (prefix and suffix), part-of-speech (POS), gender, number, tense and transitivity (for verbs), its origin, dialectal lemma, Arabic lemma, the root, voice, state, and affirmative/negative form. This vocabulary is provided as a csv file and represents about 350 Mb of data.
CroaTPAS – ISLRN: 649-554-159-147-9
CroaTPAS is a bi-lingual lexicon in Croatian and English. It was created by manual annotation from the Croatian Web as Corpus and pattern creation using the Skema editor on the Sketch Engine platform. CroaTPAS is tailor-made to represent verb polysemy and currently contains a total of 683 patterns (belonging to 180 Croatian verbs) expressing different verb senses and 22.677 annotated corpus lines. Moreover, the resource includes 109 metonymic sub patterns linked to 1112 corpus lines featuring 62 different metonymic shifts.
T-PAS – ISLRN: 432-666-503-743-8
CroaTPAS is a bi-lingual lexicon in Croatian and English. It was created by manual annotation from the Croatian Web as Corpus and pattern creation using the Skema editor on the Sketch Engine platform. CroaTPAS is tailor-made to represent verb polysemy and currently contains a total of 683 patterns (belonging to 180 Croatian verbs) expressing different verb senses and 22.677 annotated corpus lines. Moreover, the resource includes 109 metonymic sub patterns linked to 1112 corpus lines featuring 62 different metonymic shifts.
Dec 22
This corpus consists of a collection of political speeches in German crawled from the online archive of the German presidency (Bundespraësident) and the Chancellery (Bundesregierung). For the German Presidency the speeches are available from July 1, 1984 to February 17, 2012 and the corpus contain a total of 1442 texts comprising 2 392 074 tokens. For the German Chancellery, the corpus contains a total of 1831 text comprising 3 891 588 tokens covering a period from December 11, 1998 to December 6, 2011. This corpus contains speeches from the Chancellor but also from other politicians.
Reduced fees for the following speech resources:
- Chinese Mandarin (South) database
- Chinese Mandarin (North) database
- Japanese Kids Speech database (Lower Grade)
- Japanese Kids Speech database (Upper Grade)
Oct 22
67 resources were designed and collected to boost Speech Recognition in particular. They cover the following languages:
AnCora Spanish 2.0.0
ISLRN: 252-495-813-736-1
The AnCora Spanish Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: Lemma and Part of Speech, Syntactic constituents and functions, Argument structure and thematic roles, Semantic classes of the verb, Denotative type of deverbal nouns, Nouns related to WordNet synsets, Named Entities, Coreference relation.
ISLRN: 186-654-762-852-8
The AnCora Catalan Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: Lemma and Part of Speech, Syntactic constituents and functions, Argument structure and thematic roles, Semantic classes of the verb, Denotative type of deverbal nouns, Nouns related to WordNet synsets, Named Entities, Coreference relation.
ISLRN: 761-430-854-533-2
The Bulgarian Treebank Corpus is composed of 156,149 tokens (11,138 sentences) coming from three main sources in the domain of Grammar Notebooks (1,391 sentences), News (6,698 sentences), Other (3,049 sentences). It is available with syntactical and morphological annotation on a sentence basis in Universal Dependencies format.
ISLRN: 188-702-981-369-5
The Bulgarian Event Corpus is composed 324,905 tokens appropriate for training Named Entity Recognition (NER), Named Entity Linking (NEL) and Event Recognition models for Bulgarian in a multidomain context within Humanities. The texts are domain related. They include documents from the area of Social Sciences and Humanities – scientific papers, archive documents, popular documents, and Wikipedia articles in the relevant areas.
ISLRN: 188-702-981-369-5
The Bulgarian Valency Frame Lexicon is composed of 9547 lexical entries organized by frames with 960 mappings to Princeton WordNet available in XML format. It is a treebank-driven resource of extracted valency frames from BulTreeBank. The frames were manually curated. The structure of the frames follows the BulTreeBank syntactic structure.
ISLRN: 583-408-694-292-6
The How2Sign dataset consists of a parallel corpus of speech and transcriptions of instructional videos and their corresponding American Sign Language (ASL) translation videos and annotations. It has been produced by recording 11 persons (6 males and 5 females) with various hearing status (5 self-identified as hearing, 4 as deaf, 2 as hard of hearing). The video has been recorded at 30 fps in MPEG format. A total of 80 hours of Multiview American Sign Language videos were collected, as well as gloss annotations and a coarse video categorization.
ISLRN: 058-406-130-314-1
This dataset contains more than 31 hours and 30 minutes of Persian scripted monologue and dialogue data, recorded from 89 Persian speakers (39 males and 50 females) between 17-80 years old in Iran (Tehrani dialect). Data consists of read and spontaneous speech recordings: books read by a person, recorded podcasts, articles in the newspapers, radio conversations, phone dialogues. Domains are labelled and include:include Accounting, Banking, Economics, Finance, Insurance, Literature, Marketing, Medicine, Psychology, Science, Technology, Telecommunication, and Law.
ISLRN: 942-234-530-020-7
This is a new release of the Venice Italian Treebank (VIT). It consists of the Written and Spoken VIT subsets. The PennTreebank version of the treebank is also made available on both subsets using parentheses and also a slightly modified version using brackets that allows web basedweb-based visualization tools to build a tree of the structure. The Written VIT consists of 223,292 tokens excluding punctuation, but 280,641 single tokens including enclitics and punctuation. It contains a totally revised constituency basedconstituency-based representation of the corpus as well as three new files. As for the Spoken VIT, 425 new fully parsed turns were added for a total of 3973. The total count of sentences is now 5851.
ISLRN: 688-718-284-176-0
Wojood consists of about 550,000 tokens (Modern Standard Arabic and dialect) that are manually annotated with 21 entity types (person, group of people, occupation, organization, geopolitical entity, location, facility, event, date, time, language, website, law, product, cardinal number, ordinal number, percent, quantity, unit, money, currency). It covers multiple domains (Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics, Migration, Terrorism, social media) and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. The corpus was annotated using the IOB2 tagging scheme and is available in CSV format.
May 22
ISLRN: 482-848-308-105-6
The purpose of the annotated tweet corpus in Arabizi, French and English constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism). It consists of 17,103 sequences annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided.
A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia
ISLRN: 110-617-195-245-4
The bilingual English-Ukrainian lexicon of named entities uses Wikipedia metadata as a source. The extracted named entity pairs are classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). The lexicon consists of 624,168 pairs and comes in two formats: csv and xml.
ArabLEX set of data
ArabLEX set of data consists of 4 databases dedicated to Arabic language:
ArabLEX: Database of Arabic General Vocabulary (DAG)
ISLRN: 879-334-992-724-8
A comprehensive full-form lexicon of Arabic general vocabulary including all inflected, conjugated and cliticized forms. Each entry is accompanied by a rich set of morphological, grammatical, and phonological attributes. Ideally suited for NLP applications, DAG provides precise phonemic transcriptions and full vowel diacritics designed to enhance Arabic speech technology. Quantity and size: 87,930,738 lines / 24,399 MB (23.8 GB)
ArabLEX: Database of Arabic Place Names (DAP)
ISLRN: 161-842-321-771-2
This full-form Arabic-English place name database provides worldwide coverage of common place names, given in standard MSA orthography, and includes all inflected and cliticized forms for each place name. In addition, precise phonemic transcriptions and full vowel diacritics are designed to enhance Arabic speech technology. Quantity and size: 6,455,201 lines / 812 MB
ArabLEX: Database of Foreign Names in Arabic (DAF)
ISLRN: 943-592-129-040-2
This full-form database covers non-Arab personal names in both Arabic and English, some Arabic script variants, vocalized or unvocalized formats, as well as inflected and cliticized forms. The precise phonemic transcriptions and full vowel diacritics are designed to enhance Arabic speech technology. Quantity and size: 226,784,907 lines / 32,181 MB (31.4 GB)
ArabLEX: Database of Arab Names (DAN)
name variants for each name with a variety of supplementary information such as gender, name type and frequency statistics. This comprehensive lexicon (over 6.4 million variants) contains precise phonemic transcriptions and vocalized Arabic for all inflected and cliticized forms for each name. Quantity and size: 218,215,875 lines / 32,659 MB (31.9 GB)
Mar 22
ISLRN: 926-738-235-188-8
This is a corpus in Khasi, an Austro-Asiatic language, comprising of Khasi sentences extracted from textbooks prescribed for students in secondary, higher secondary, graduation, and post-graduation in the year 2015-2016. The corpus contains 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations). The sentences are manually tagged for parts of speech.
“La Dépêche de Kabylie” Corpus
ISLRN: 176-700-464-150-5
“La Dépêche de Kabylie” Corpus consists of about 1,570,000 words in Amazigh language collected from the Algerian newspaper entitled “La Dépêche de Kabylie”. It was collected using HTTrack Website Copier. All articles are gathered under one plain text file.
English-Vietnamese Special Dictionaries series
A series of specialised bilingual dictionaries is now available for the following domains:
English-Vietnamese Special Dictionary: Aesthetic – 836 entries provided in XML format (ISLRN: 792-807-299-844-6)
English-Vietnamese Special Dictionary: Architecture – 18,213 entries provided in XML format (ISLRN: 090-342-038-261-9)
English-Vietnamese Special Dictionary: Finance – 9,039 entries provided in XML format (ISLRN: 557-620-378-687-8)
English-Vietnamese Special Dictionary: Economics – 16,255 entries provided in XML format (ISLRN: 292-335-361-128-1)
English-Vietnamese Special Dictionary: Informatics – 3,835 entries provided in XML format (ISLRN: 664-600-467-613-7)
English-Vietnamese Special Dictionary: Law – 3,011 entries provided in XML format (ISLRN: 675-423-495-453-3)
English-Vietnamese Special Dictionary: Math – 15,004 entries provided in XML format (ISLRN: 673-080-199-543-0)
English-Vietnamese Special Dictionary: Mechanical – 3,482 entries provided in XML format (ISLRN: 464-013-248-151-6)
English-Vietnamese Special Dictionary: Medical – 8,073 entries provided in XML format (ISLRN: 264-005-069-750-7)
English-Vietnamese Special Dictionary: Navigation – 19,393 entries provided in XML format (ISLRN: 147-831-511-571-0)
English-Vietnamese Special Dictionary: Physics – 23,584 entries provided in XML format (ISLRN: 288-262-689-669-3)
English-Vietnamese Special Dictionary: Real Estate – 2,585 entries provided in XML format (ISLRN: 438-043-926-686-6)
English-Vietnamese Special Dictionary: Stocks – 1,094 entries provided in XML format (ISLRN: 479-017-757-739-6)
English-Vietnamese Special Dictionary: Tourism – 2,235 entries provided in XML format (ISLRN: 923-733-433-674-1)
Feb 22
ELRA-W0318 Danish Gigaword Corpus
ISLRN: 024-504-318-388-3
This corpus consists of over a billion words for Danish collected from various websites. Domains are distributed as follows: Legal (308.8 million words), Social Media (261.4 million words), Subtitles (130.1 million words), Debates (108.4 million words), Conversations (0.7 million words), Web (101.02 million words), Encyclopedia (55.6 million words), Literature (31.3 million words), Manuals (2.6 million words), Books (2.1 million words), Religion (600k words), News (40 million words), Other (1.2 million words).
English-Punjabi Code-Mixed Social Media Content
ISLRN: 695-759-706-170-8
The English-Punjabi Code-Mixed Social Media Content corpus is composed of 893,615 parallel sentences of English-Punjabi in the following domains: Agriculture, Culture, Entertainment, Health, Religion, Sports, Technology, Tourism, Education, and Entertainment.
ISLRN: 657-350-757-058-6
The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37,000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 words – 43,000 parallel sentences), and Urdu (1,200,000 words – 33,000 parallel sentences), translated into English. Each data set was created by taking around 100 Indian-language Wikipedia pages and obtaining four independent translations in English of each of the sentences in those documents via non-professional translators hired by crowdsourcing on Amazon Mechanical Turk.
Oct 21
ISLRN: 588-170-827-016-7
The Ema-lon Manipuri Corpus consists of a set of resources for Manipuri language (locally known as Meiteilon) for the purpose of machine translation. The main source for these resources is the Sangai Express news website. The resources that constitute the present corpus com-prise monolingual and parallel data in Manipuri and English (EM Corpus), as well as a FastText word embedding (EM-FT) and an ALBERT model (EM-ALBERT) available for Manipuri language.
ISLRN: 007-544-786-822-8
The NRC Emotion Lexicon was originally built by Saif M. Mohammad and Peter D. Turney through crowdsourcing. The NRC was created in order to assist with emotion analysis as other emotion lexicons were smaller at the time. After close inspection of the NRC emotion lexicon, a large number of troubling entries were identified, where words that should in most contexts be emotionally neutral, with no affect (e.g., lesbian, stone, mountain), are associated with emotional labels that are inaccurate, nonsensical, pejorative, or, at best, highly contingent and context-dependent (e.g. lesbian labeled as DISGUST and SADNESS, stone as ANGER, or mountain as ANTICIPATION). The revised NRC consists of 5,916 entries that result from the works referenced in Zad et al. (2021) “Hell Hath No Fury? Correcting Bias in the NRC Emotion Lexicon”, published at the WOAH, the 5th Workshop on Online Abuse and Harms.
ISLRN: 143-538-116-557-6
The French-Vietnamese Dictionary consists of 82,768 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples. All headwords are pronounced with true voice by native speakers. The dictionary is provided in XML format.
ISLRN: 652-215-232-618-2
The Vietnamese-French Dictionary consists of 43,296 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for source language only. The dictionary is provided in XML format.
ISLRN: 750-377-806-677-8
The German-Vietnamese Dictionary consists of 32,511 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for the source language. Headword (in Vietnamese) has true voice by native speakers.
ISLRN: 993-568-466-563-7
The Vietnamese-German Dictionary consists of 42,793 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for the source language.
ELRA announces that the CINTIL Corpus – International Corpus of Portuguese is now available for free for academic research
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus is developed over raw textual materials of several types, of which 30% are spoken materials.
The CINTIL Corpus is now available for free for academic research and can be found in the ELRA Catalogue under the following reference: ELRA-W0050 The CINTIL Corpus – International Corpus of Portuguese (ISLRN: 176-775-844-396-0)
June 21
Language Resources entrusted to ELRA for distribution and sharing by UPC, Universitat Politecnica de Catalunya, in Spain, are now available for free for academic research purposes (for ELRA institutional members) and at substantially decreased costs for commercial purposes.
They consist of data developed to enhance Speech technologies in Catalan, Spanish and Arabic.
The Language Resources can be found in the ELRA Catalogue under the following references:
ELRA-S0101 Spanish SpeechDat(II) FDB-1000 (ISLRN: 415-072-153-167-5)
ELRA-S0102 Spanish SpeechDat(II) FDB-4000 (ISLRN: 295-399-069-106-4)
ELRA-S0140 Spanish SpeechDat-Car database (ISLRN: 937-459-364-430-3)
ELRA-S0141 SALA Spanish Venezuelan Database (ISLRN: 894-744-522-508-8)
ELRA-S0173 SALA Spanish Mexican Database (ISLRN: 077-043-759-782-3)
ELRA-S0183 OrienTel Morocco MCA (Modern Colloquial Arabic) database (ISLRN: 613-578-868-832-2)
ELRA-S0184 OrienTel Morocco MSA (Modern Standard Arabic) database (ISLRN: 978-839-138-181-8)
ELRA-S0185 OrienTel French as spoken in Morocco database (ISLRN: 299-422-451-969-8)
ELRA-S0186 OrienTel Tunisia MCA (Modern Colloquial Arabic) database (ISLRN: 297-705-745-294-4)
ELRA-S0187 OrienTel Tunisia MSA (Modern Standard Arabic) database (ISLRN: 926-401-827-806-5)
ELRA-S0188 OrienTel French as spoken in Tunisia database (ISLRN: 085-972-271-578-3)
ELRA-S0207 LC-STAR Catalan phonetic lexicon (ISLRN: 102-856-174-704-7)
ELRA-S0208 LC-STAR Spanish phonetic lexicon (ISLRN: 826-939-678-247-5)
ELRA-S0243 SpeechDat Catalan FDB database (ISLRN: 373-541-490-506-3)
ELRA-S0306 TC-STAR Transcriptions of Spanish Parliamentary Speech (ISLRN: 972-398-693-247-4)
ELRA-S0309 TC-STAR Spanish Baseline Female Speech database (ISLRN: 682-113-241-701-0)
ELRA-S0310 TC-STAR Spanish Baseline Male Speech Database (ISLRN: 736-021-086-598-0)
ELRA-S0311 TC-STAR Bilingual Voice-Conversion Spanish Speech Database (ISLRN: 254-311-004-570-0)
ELRA-S0312 TC-STAR Bilingual Voice-Conversion English Speech Database (ISLRN: 522-613-023-181-1)
ELRA-S0313 TC-STAR Bilingual Expressive Speech Database (ISLRN: 088-656-828-489-3)
ELRA-S0336 Spanish Festival voice male (ISLRN: 868-352-143-949-9)
ISLRN: 893-470-491-825-6
The English-Vietnamese Parallel Corpus consists of 1,000,000 sentence pairs, with an average length of 20 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
ISLRN: 128-772-037-486-0
The Chinese-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
ISLRN: 365-128-449-700-7
The Korean-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
ISLRN: 637-630-726-817-9
The English-Chinese-Vietnamese Trilingual Parallel Corpus consists of 20,046 trilingual sets of sentence pairs. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
ISLRN: 663-014-610-121-2
This database includes gold Ezafe tags in almost 30 thousand Persian sentences. The sentences were manually annotated by six annotators who where all native Persian speakers and linguists.
ISLRN: 853-782-057-600-0
The English-Vietnamese Dictionary consists of 125,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for the source language only. The dictionary is provided in XML format.
ISLRN: 747-175-261-587-4
The Vietnamese-English Dictionary consists of 156,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for source language only. The dictionary is provided in XML format.
ISLRN: 120-577-487-890-2
The Chinese-Vietnamese Dictionary consists of 52,470 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples. The dictionary is provided in XML format.
ISLRN: 481-792-486-258-2
The Vietnamese-Chinese Dictionary consists of 50,911 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for the source language only. The dictionary is provided in XML format.
ISLRN: 056-033-674-079-4
The Japanese-Vietnamese Dictionary consists of 59,369 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for source language only. The dictionary is provided in XML format.
ISLRN: 719-247-130-680-9
The Vietnamese-Japanese Dictionary consists of 65,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available for source language only. The dictionary is provided in XML format.
ISLRN: 409-454-902-511-3
The Korean-Vietnamese Dictionary consists of 37,678 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for source language. The dictionary is provided in XML format.
ISLRN: 349-337-956-980-9
The Vietnamese-Korean Dictionary consists of 27,449 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for the source language. The dictionary is provided in XML format.
Dec 20
ELRA-S0413 Ahoslabi – esophageal speech database
ISLRN: 425-664-403-057-4
Ahoslabi was built within the frame of the RESTORE project (“Restauración, almacenamiento y rehabilitación de la voz”) (restrictions apply). The database primarily consists of recordings of 31 laryngectomees (27 males and 4 females) pronouncing 100 phonetically balanced sentences. The total size of the recordings amount 10h48min for 1.16 Gb. Esophageal voices were recorded in a soundproof recording cubicle with a Neuman microphone. Additionally, it includes parallel recordings of the sentences by 9 healthy speakers (6 males and 3 females) to facilitate speech processing tasks that require small parallel corpora, such as voice conversion or synthetic speech adaptation. A pronunciation lexicon in SAMPA is also provided.
Oct 20
ISLRN: 579-088-185-591-2
The Japanese Kids Speech database (Lower Grade) contains the total recordings of 179 Japanese Kids speakers (71 males and 108 females), from 6 to 9 years’ old (first, second and third graders in elementary school), recorded in quiet rooms using smartphones. 1019 sentence were used. Recordings were made through smartphones and audio data stored in .wav files as sequences of 16KHz Mono, 16 bits, Linear PCM. This database may be combined with the Japanese Kids Speech database (Upper Grade) also available in the ELRA Catalogue under reference ELRA-S0412.
ISLRN: 846-295-092-462-7
The Japanese Kids Speech database (Upper Grade) contains the total recordings of 232 Japanese Kids speakers (104 males and 128 females), from 9 to 13 years’ old (fourth, fifth and sixth graders in elementary school), recorded in quiet rooms using smartphones. 1018 sentences were used. Recordings were made through smartphones and audio data stored in .wav files as sequences of 16KHz Mono, 16 bits, Linear PCM. This database may be combined with the Japanese Kids Speech database (Lower Grade) also available in the ELRA Catalogue under reference ELRA-S0411.
Sep 20
ELRA-S0410 CAREGIVER Corpus
ISLRN: 072-357-063-759-1
A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The corpus contains nearly 66,000 utterance-based audio files spoken over a two-year period by 16 male and 14 female native speakers of Dutch, UK English, and Finnish. An orthographic transcription is available for every utterance. Also, time-aligned word and phone annotations for some of the sub-corpora exist.
Jun 20
The MEDIA data can be found in the ELRA Catalogue under the following references:
ELRA-S0272 MEDIA speech database for French
ISLRN: 195-971-767-455-9
ELRA-S0371 PortMedia French and Italian corpus
ISLRN: 135-793-959-390-8
May 20
This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits. For each conversation, there are two close-talking channels recorded via the microphones, one for each speaker, as well as three far-field channels recorded by iPhone, Androïd Phone, and recorder respectively.
This corpus may be obtained as a complete set or by selecting specific channels (two close-talking channels shall be understood as 1 single channel):
ISLRN: 559-956-475-937-1
ISLRN: 234-140-315-272-4
ISLRN: 383-054-806-637-3
ISLRN: 235-882-638-211-2
March 20
February 20
ISLRN: 645-563-102-594-8
The SpeechTera Pronunciation Dictionary is a machine-readable pronunciation dictionary for Brazilian Portuguese and comprises 737,347 entries. Its phonetic transcription is based on 13 linguistics varieties spoken in Brazil and contains the pronunciation of the frequent word forms found in the transcription data of the SpeechTera’s speech and text database (literary, newspaper, movies, miscellaneous). Each one of the thirteen dialects comprises 56,719 entries.
Dec 19
ELRA-W0129 Arbobanko (Esperanto Treebank)
ISLRN: 185-602-618-699-2
The Esperanto Arbobanko Treebank is a 52,000 token dependency treebank of Esperanto with texts from the MONATO news magazine, consisting of random excerpts from the period 2000-2010. All words were annotated for lemma, part-of-speech, inflection, compounding and affixing, syntactic function, dependency links, NER types, semantic types of nouns and adjectives, and verb frame categories.
Oct 19
Sep 19
ELRA-M0052 EnToFrNE – a Parallel English-French Lexicon of Named Entities
ISLRN: 233-270-965-120-8
This lexicon consists of 1,167,263 parallel named entities in English and French. The tags used are: PERSON, ORGANIZATION, LOCATION, PRODUCT and MISC. The lexicon comes in two formats: csv and xml.
July 19
ISLRN: 024-286-962-247-6
Glissando-sp includes more than 12 hours of speech in Spanish, recorded under optimal acoustic conditions, orthographically transcribed, phonetically aligned and annotated with prosodic information (location of the stressed syllables and prosodic phrasing). The corpus was recorded by 8 professional speakers and 20 non-professional speakers: 4 “news broadcaster” professional speakers (2 male and 2 female), 4 “advertising” professional speakers (2 male and 2 female), and 20 non-professional speakers (10 male and 10 female). Glissando-sp is made of three subcorpora: readings of real news texts (provided by “Cadena Ser” radio station), interactions between two speakers oriented to a specific goal in the domain of information requests, and conversations between people who have some degree of familiarity with each other.
ISLRN: 780-617-066-913-1
Glissando-ca includes more than 12 hours of speech in Catalan, recorded under optimal acoustic conditions, orthographically transcribed, phonetically aligned and annotated with prosodic information (location of the stressed syllables and prosodic phrasing). The corpus was recorded by 8 professional speakers and 20 non-professional speakers: 4 “news broadcaster” professional speakers (2 male and 2 female), 4 “advertising” professional speakers (2 male and 2 female), and 20 non-professional speakers (10 male and 10 female). Glissando-ca is made of three subcorpora: readings of real news texts (provided by “Cadena Ser” radio station), interactions between two speakers oriented to a specific goal in the domain of information requests, and conversations between people who have some degree of familiarity with each other.
ISLRN: 387-435-142-983-6
This database consists of about 30,000 bilingual parallel sentences and phrases in English and Persian (15,000 in each language). It comes with a software through which the users can search a word, phrase or chunk and receive all idioms and expressions related to the query. The database is presented in Access format and the software is executable on Windows systems.
ISLRN: 760-940-374-770-6
This bilingual terminology consists of around 25,000 terms in the field of computer engineering, computer sciences and information technology. It comes with a software through which the users can search a word, phrase or chunk and receive all entries related to the query. The database is presented in Access format and the software is executable on Windows systems.
ISLRN: 188-448-142-468-5
This bilingual terminology consists of around 15,000 terms in the field of management and economics sciences. It comes with a software through which the users can search a word, phrase or chunk and receive all entries related to the query. The main database of the software is presented in Access format and the software itself is executable on Windows systems.
May 19
ISLRN: 204-945-263-927-6
The GlobalPhone Multilingual Model Package contains about 22 hours of transcribed read speech spoken by native speakers in 22 languages (Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, and Vietnamese). The GlobalPhone Multilingual Model Package covers about 1 hour of transcribed speech from 10 speakers (5 male, 5 female) from each of the above listed 22 languages.
ISLRN: 331-592-378-424-7
The GlobalPhone 2000 Speaker Package contains transcribed read speech spoken by 2000 native speakers in 22 languages (Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, and Vietnamese). The GlobalPhone 2000 Speaker Package covers about 9,000 randomly selected utterances read by 2000 native speakers in 22 languages, i.e. on average 4.5 utterances corresponding to 40 seconds of speech per speaker amounting to a total of 22 hours of speech.
ISLRN: 112-393-061-014-3
The Speaking atlas of the regional languages of France offers the same Aesop’s fable read in French and in a number of varieties of languages of France. This work, which has a scientific and heritage dimension, consists in highlighting the linguistic diversity of Metropolitan France and Overseas Territories, through recordings collected in the field and presented via an interactive map, with their orthographic transcription. As far as Occitan is concerned, about sixty varieties were collected in Gascony, Languedoc, Provence, northern Occitania and the Linguistic Crescent. Varieties of Basque, Breton, Frannian, West Flemish, Alsatian, Corsican, Catalan, Francoprovençal and Oïl language(s) are also provided, as well as about fifty languages in the French Overseas and non-territorial languages such as Rromani and the French sign language.
ISLRN: 572-070-066-634-8
This corpus consists of phonetically rich Urdu sentences and additional sentences covering telephone numbers, addresses and personal names. This speech corpus is recorded with a variety of microphone types. Sampling rate of speech files is 16 kHz. Each utterance is stored in a separate file and is accompanied by its orthographic transcription file in Unicode.
ISLRN: 036-939-425-010-1
This corpus is a collection of XML metatextually tagged corpora containing speeches from European chambers. It is a bilingual, bidirectional corpus written corpus in English and Spanish. This first set (ECPC_EP-05) consists of (1) a “clean” version in XML of European Parliament’s 2005 daily sessions; (2) a POS-tagged version of the 2005 daily sessions; and (3) a sentence-based aligned version of 2005 daily sessions. In its raw format, ECPC_EP-05 contains 3,668,476 tokens/words (excluding tagging) in English distributed over 60 utf-8 files and 3,993,867 tokens/words (excluding tagging) in Spanish distributed over 60 utf-8 files.
ISLRN: 690-348-503-270-1
This lexicon consists of 26,155 parallel named entities in seven languages: English and six South Slavic ones: Bosnian, Bulgarian, Croatian, Macedonian, Serbian and Slovenian. The lexicon contains multiword entries which are not strictly named entities, but contain a word which is. Slovenian, Croatian and Bosnian are written in Latin script, Macedonian and Bulgarian in Cyrillic. Serbian language is specific since it may come in two scripts (Cyrillic and Latin) and two dialects (ekavica and ijekavica). This lexicon takes Serbian ekavica variant and its Cyrillic script. The lexicon comes in two formats: csv and xml.
Oct 18
ISLRN: 986-364-744-303-9
The dataset is composed of : a collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts ; and a set of 3,452 Arabizi tokens manually transliterated into Arabic, intended to train and test a system that performs Arabizi to Arabic transliteration.
ISLRN: 305-450-745-774-1
This is an Arabic stemming gold standard corpus composed by a collection of 37 sentences, selected to be representative of Arabic stemming tasks and manually annotated. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). NAFIS is represented according to the TEI standard.
ISLRN: 747-055-093-447-8
This corpus consists of 5131 sentences recorded in Mbochi, together with their transcription and French translation, as well as the results from the work made during JSALT workshop: alignments at the phonetic level and various results of unsupervised word segmentation from audio. The audio corpus is made up of 4,5 hours, downsampled at 16kHz, 16bits, with Linear PCM encoding. Data is distributed into 2 parts, one for training consisting of 4617 sentences, and one for development consisting of 514 sentences.
ISLRN: 503-886-852-083-2
This database contains the recordings of 1000 Chinese Mandarin speakers from Southern China (500 males and 500 females), from 18 to 60 years’ old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 341 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
ISLRN: 353-548-770-894-7
This database contains the recordings of 500 Chinese Mandarin speakers from Northern China (250 males and 250 females), from 18 to 60 years’ old, recorded in quiet studios. Recordings were made through microphone headsets and consist of 172 hours of audio data (about 30 minutes per speaker), stored in .WAV files as sequences of 48 KHz Mono, 16 bits, Linear PCM.
ISLRN: 133-181-128-420-9
This dictionary consists of more than 50,000 entries (along with almost all wordforms and proper names) with corresponding audio files in MP3 and English transliterations. The words have been recorded with standard Persian (Farsi) pronunciation (all by a single speaker). This dictionary is provided with its software.
January 18
ISLRN: 357-949-964-163-0
The French dictionary of definitions (SYNAPSE) consists of 216,835 entries (147,378 nouns, 80,552 adjectives, 24,001 verbs, 4,677 adverbs, 1,560 prefixes, 107 prepositions, 614 interjections, 147 pronouns, 42 conjunctions, 27 articles), 309,078 definitions and 7,395 phraseological units (phrases). Grammatical information for each entry consists of: grammatical category, gender, number, inflected forms. This dictionary is provided in XML format together with its DTD.
ISLRN: 838-483-738-912-8
This is a corpus of 500,000 English-Vietnamese sentence pairs. The parallel corpus contains English documents translated by professional translators into Vietnamese. The source texts include books, dictionaries, newspapers, online news. The texts are provided in TEI format.
ISLRN: 217-906-813-531-9
This corpus consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts. Six unique subjects (undergraduates between 19 and 25 years of age) participated in the collection. The dialogue speech was captured with two headset microphones and saved in 16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced semi-automatically, using an automatic speech recognizer followed by manual correction. All text is presented in UTF-8 as either plain text or XML.
ISLRN: 157-037-166-491-1
This corpus comprises clean microphone recordings of conversational speech from 300 German speakers (126 males and 174 females) aged 18 to 35 years, with no marked dialect/accent. The recordings were performed in an acoustically-isolated room in 2016/2017. Four scripted and four semi-spontaneous dialogs were elicited from the speakers, simulating telephone call inquiries. Additionally, spontaneous neutral and emotional speech utterances and questions were produced. All labels are provided, together with the speech recordings and the speakers’ metadata.
Sep 17
ISLRN: 049-623-948-389-2
This dictionary consists of a list of 6 million inflected forms, fully vowelized, and tagged with grammatical information which includes POS and grammatical features, including number, gender, case, definiteness, tense, mood and compatibility with clitic agglutination. The data is formatted in conformity with the data formats of Unitex/GramLab. This dictionary is also available together with recognition of agglutinated clitics and inflection system in the ELRA Catalogue under reference ELRA-L0099.
ISLRN: 963-860-792-289-9
This dictionary consists of 6 million inflected forms, fully vowelized, generated in compliance with the grammatical rules of Arabic and tagged with grammatical information which includes POS and grammatical features, including number, gender, case, definiteness, tense, mood and compatibility with clitic agglutination. It is accompanied by a grammatical resource that recognizes hundreds of millions of valid agglutinated words. In order to be able to update the full-form dictionary, a dictionary of 65 000 lemmas and the data required to inflect them and regenerate the full-form dictionary are also provided. The data is formatted in conformity with the data formats of Unitex/GramLab. This dictionary is also available without recognition of agglutinated clitics and without inflection system in the ELRA Catalogue under reference ELRA-L0098.
July 17
ISLRN: 074-825-114-781-7
The English-Persian parallel corpus contains more than 200,000 aligned sentences across a variety of text types from the domains of art, law, culture, science, religion, literature, medicine, idioms, politics and others. It is an extension of the English-Persian parallel corpus already distributed by ELRA (Catalogue Reference: ELRA-W0051). This new version of the corpus is distributed with a concordance program.
ISLRN: 941-187-059-145-7
This is a text corpus of Swahili language of 25 million words, annotated for part-of-speech, morphology and syntax. The corpus contains prose text from domains such as fiction, news media and government documents, from the period between 1953 and 2016.
ISLRN: 492-817-146-504-9
This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws. Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in TEI format.
ISLRN: 068-845-898-304-0
This speech corpus was recorded through a “Blubbery” model microphone by one male speaker in Persian (Tehrani accent) in a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 399 utterances for a total of about 2.5 hours, with orthographic and phonetic transcriptions.
May 17
ISLRN: 186-827-325-462-6
This is a phonetic lexicon of 21,560 words in Pashto transcribed manually by a native Pashto speaker (Yusufzai dialect) using the IPA Pashto phoneme set.
Apr 17
ISLRN: 425-777-374-455-4
The ETAPE Evaluation Package consists of ca. 30 hours of radio and TV data, selected to include mostly non planned speech and a reasonable proportion of multiple speaker data. All data were carefully transcribed, including named entity annotation. This package includes the material that was used for the ETAPE evaluation campaign. It includes resources, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of this evaluation package is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ISLRN: 213-212-351-142-5
The Danish Propbank (DPB) is an 87,000-token treebank from a variety of genres, annotated with morphosyntactic and semantic information, namely propositions/frames with VerbNet classes and semantic roles for both arguments and satellites. There are over 12,000 frames with 32,000 role instances. The corpus has also been annotated with 20 Named Entity classes and a 200-category semantic ontology for nouns.
ISLRN: 799-402-906-876-5
This extended version of the Bulgarian Pronunciation Dictionary called Bulgarian-Dict260k contains pronunciations of more than 260,000 word forms.
ISLRN: 574-579-221-841-3
The Accented English part of the GlobalPhone resources contains 63 recording sessions of Bulgarian, Chinese, German, and Indian native speakers reading 37 English sentences each, produced in GlobalPhone-style, i.e. 16kHz PCM encoded audio recordings of utterance-segmented read speech from the newspaper domain.
ISLRN: 910-309-096-523-6
The parallel EMG-Acoustic English GlobalPhone language resource contains 63 recordings sessions from 8 speakers articulating speech in three speaking modes, audible, whispered, and silent by reading three times 50 English sentences in GlobalPhone-style, i.e. 16kHz PCM encoded audio recordings of utterance-segmented read speech from the newspaper domain. Speech is recorded in a parallel fashion, i.e. synchronously by a standard close-talking microphone and by surface electrodes capturing the muscle activities of the articulatory muscles in the face (EelectroMmyoGgraphy =- EMG).
ISLRN: 340-994-352-616-4
This Frisian corpus consists of 203 audio segments of approximately 5 minutes long extracted from various radio programs covering a time span of almost 50 years (1966-2015), adding a longitudinal dimension to the database. The content of the recordings are very diverse including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages. There are 309 identified speakers in the FAME! Speech Corpus, 21 of whom appear at least 3 times in the database. The total duration of the manually annotated radio broadcasts sums up to 18 hours, 33 minutes and 57 seconds.
Nov 16
ISLRN: 447-281-370-489-0
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and a reference translation in English. The source texts are a selection of private emails collected from the daily life and business domains.
ISLRN: 255-358-917-604-3
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and a reference translation in French. The source texts are a selection of private emails collected from the daily life and business domains.
ISLRN: 985-956-234-357-3
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in English. The source texts are a selection of private emails collected from the daily life and business domains.
ISLRN: 239-027-077-538-0
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in French. The source texts are a selection of private emails collected from the daily life and business domains.
ISLRN: 583-080-936-563-9
SecuVoice consists of single-channel utterances in Spanish containing sequences of isolated digits from zero to nine. SecuVoice contains a total of 7,098 utterances (169 speakers x 42 utt./speaker) with 34,476 digits (204 digits/speaker). Along with the WAV files containing the speech utterances, XML annotation files containing detailed information about the speakers and the recorded sequences of digits are provided.
LRs now available for commercial purposes:
Oct 16
ISLRN: 922-732-502-473-8
This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are articles collected in 2012 from the Arabic version of Le Monde Diplomatique.
ELRA-W0099 TRAD Arabic-English Newspaper Parallel corpus – Test set 1
ISLRN:764-187-795-074-0
This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are articles collected in 2012 from the Arabic version of Le Monde Diplomatique.
ELRA-W0100 TRAD Arabic-French Newspaper Parallel corpus – Test set 2
ISLRN:722-323-886-920-3
This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in French. The source texts are articles collected in May 2013 from the Arabic version of Le Monde Diplomatique.
ELRA-W0101 TRAD Arabic-French Parallel corpus of transcribed Broadcast News Speech
ISLRN:862-201-329-808-4
This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are transcriptions of broadcast news in Arabic recorded on France 24.
ELRA-W0102 TRAD Arabic-English Parallel corpus of transcribed Broadcast News Speech
ISLRN:812-050-111-234-9
This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are transcriptions of broadcast news in Arabic recorded on France 24.
ELRA-W0103 TRAD Arabic-French Web domain (blogs) Parallel corpus
ISLRN:138-395-895-757-7
This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are blog articles from 2008 to 2013.
ELRA-W0104 TRAD Arabic-English Web domain (blogs) Parallel corpus
ISLRN:762-161-069-435-5
This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are blog articles from 2008 to 2013.
ELRA-W0105 TRAD Arabic-French Mailing lists Parallel corpus – Test set
ISLRN:895-850-015-188-4
This is a parallel corpus of 10,000 words in Arabic and 4 reference translations in French. The source texts are emails collected from Wikiar-I, a mailing list for discussions about the Arabic Wikipedia. Emails are dated from 2010 to 2012.
ELRA-W0106 TRAD Arabic-English Mailing lists Parallel corpus – Test set
ISLRN:858-529-510-480-2
This is a parallel corpus of 10,000 words in Arabic and 2 reference translations in English. The source texts are emails collected from Wikiar-I, a mailing list for discussions about the Arabic Wikipedia. Emails are dated from 2010 to 2012.
ELRA-W0107 TRAD Arabic-French Mailing lists Parallel corpus – Development set
ISLRN:333-026-450-858-0
This is a parallel corpus of 10,000 words in Arabic and a reference translation in French. The source texts are emails collected from Wikiar-I, a mailing list for discussions about the Arabic Wikipedia. The collected emails are dated from 2004 to 2007.
ELRA-W0108 TRAD Arabic-English Mailing lists Parallel corpus – Development set
ISLRN:213-044-240-074-6
This is a parallel corpus of 10,000 words in Arabic and a reference translation in English. The source texts are emails collected from Wikiar-I, a mailing list for discussions about the Arabic Wikipedia. The collected emails are dated from 2004 to 2007.
ELRA-W0109 TRAD Chinese-French Web domain (blogs) Parallel corpus
ISLRN:464-017-697-777-3
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in French. The source texts are blog articles dealing with various subjects such as economy, environment, society, technologies, etc. Articles are dated from June 2013.
ELRA-W0110 TRAD Chinese-English Web domain (blogs) Parallel corpus
ISLRN:982-341-079-331-4
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in English. The source texts are blog articles dealing with various subjects such as economy, environment, society, technologies, etc. Articles are dated from June 2013.
ELRA-W0111 TRAD Chinese-French News Articles Parallel corpus
ISLRN:153-566-144-442-2
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in French. The source texts are newspaper articles from the Chinese version of Voice of America. Articles are dated from 2011 and 2012.
ELRA-W0112 TRAD Chinese-English News Articles Parallel corpus
ISLRN:626-096-751-907-7
This is a parallel corpus of 15,000 characters in Chinese (equivalent to 10,000 words) and 2 reference translations in English. The source texts are newspaper articles from the Chinese version of Voice of America. Articles are dated from 2011 and 2012.
Sep 16
ISLRN: ISLRN: 866-568-447-697-8
This speech corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 1813 utterances for a total of 3.7 hours, with orthographic and phonetic transcriptions.
ELRA-S0385 Serbian Emotional Speech database (GEES)
ISLRN: ISLRN: 462-780-920-598-3
The database contains recordings from six actors, three of each gender. The following emotions have been recorded: neutral, anger, happiness, sadness and fear. The overall size of database is 2790 recordings or approximately 3 hours of speech.
July 16
ELRA-T376 Collins Multilingual database (MLD) – WordBank
ISLRN: 990-814-402-335-7
This multilingual lexicon covers Real Life Daily vocabulary in 32 languages. It contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs, and 10,000 additional headwords for 12 languages.
ELRA-T0377 Collins Multilingual database (MLD) – PhraseBank
ISLRN: 452-383-219-228-0
This multilingual dataset covers Real Life Daily vocabulary in 28 languages. It contains 2,000 phrases for each language, organised under 12 topics and 67 subtopics. Romanization is provided for Arabic, Farsi and Hindi.
ELRA-S0382 Collins Multilingual database (MLD) – WordBank with audio files
ISLRN: 309-438-781-042-2
This multilingual lexicon covers Real Life Daily vocabulary in 26 languages. It contains 10,000 words for each language, XML-annotated for part-of-speech, gender, irregular forms and disambiguating information for homographs, with the corresponding audio files recorded by a native speaker and 10,000 additional headwords with audio for 12 languages.
ELRA-S0383 Collins Multilingual database (MLD) – PhraseBank with audio files
ISLRN: 398-655-047-044-5
This multilingual dataset covers Real Life Daily vocabulary in 28 languages. It contains 2,000 phrases for each language, organised under 12 topics and 67 subtopics, and the corresponding audio files recorded by a native speaker.
Apr 16
ELRA-S0381 TRAD Pashto Broadcast News Speech Corpus
ISLRN: 918-508-885-913-7
This corpus contains 108 hours of broadcast news recordings transcribed, covering more than 1,000 speakers. Transcriptions are provided together with the audio files and include about 46,000 segments and 1.1M words.
ELRA-W0092 TRAD Pashto Monolingual text Corpus
ISLRN: 394-903-293-388-0
This is a monolingual text corpus in Pashto. The corpus contains about 112,000,000 tokens collected from 46 different blogs and websites.
ELRA-W0093 TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech – Training data
ISLRN: 802-643-297-429-4
This corpus consists of the transcription of 106 hours of recordings in Pashto from the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381) translated into French. It contains about 832,000 source words and 747,000 target words.
ELRA-W0094 TRAD Pashto-French Parallel corpus of transcribed Broadcast News Speech – Test data
ISLRN: 547-897-479-723-3
This is a parallel corpus, which contains 10,000 Pashto words translated into French. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
ELRA-W0095 TRAD Pashto-English Parallel corpus of transcribed Broadcast News Speech – Test data
ISLRN: 006-102-605-738-4
This is a parallel corpus, which contains 10,000 Pashto words translated into English. The source texts come from 3 broadcast news transcriptions of the TRAD Pashto Broadcast News Speech Corpus (ELRA-S0381).
ELRA-W0096 TRAD Pashto-French News Articles Parallel corpus
ISLRN: 649-628-149-051-7
This is a parallel corpus, which contains 10,000 Pashto words translated into French by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
ELRA-W0097 TRAD Pashto-English News Articles Parallel corpus
ISLRN: 612-936-517-010-2
This is a parallel corpus, which contains 10,000 Pashto words translated into English by two different translators. The source texts have been collected from the following news websites: Azadiradio, Mashaal and Voice of America Pashto.
ISLRN: 168-132-570-218-1
FoxPersonTracks is a person track dataset dedicated to person re-identification. The dataset is built from a set of real life TV shows broadcasted from BFMTV and LCP TV french channels, provided during REPERE challenge. It contains a total 4,604 persontracks (short video sequences featuring an individual with no background) from 266 persons. The dataset also provides re-identification results using space-time histograms as a baseline, together with an evaluation tool in order to ease the comparison to other re- identification methods.
March 16
ISLRN: 800-190-274-236-9
The corpus consists of 10 million German-English parallel sentences that were crawled from the internet between 10/2013 and 04/2015. Web pages have been automatically categorized for subject area. The corpus is available in TMX and Moses format (encoding UTF-8).
ISLRN: 067-486-870-902-0
Large Farsdat (L-FARSDAT) is a Persian (Farsi) Speech Database containing about 73 hours of read speech from formal Farsi texts (newspapers) recorded by 100 speakers. The sampling rate is 22050 Hz for the whole corpus and the average SNR is about 28 dB. The corpus has been segmented and labelled at word and sentence levels and each word has been annotated according to the 29 standard Persian phonemes.
January 16
ISLRN: 200-331-212-512-8
The GlobalPhone Swahili corpus contains 7,728 utterances spoken by 70 speakers. Native speakers of Swahili were asked to read prompted sentences of newspaper articles. The entire collection took place in Nairobi, Kenya.
ELRA-S0376 GlobalPhone Swahili Pronunciation Dictionary
ISLRN: 010-360-238-702-2
The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Swahili dictionary contains 10664 entries.
ELRA-S0377 GlobalPhone Ukrainian
ISLRN: 456-398-378-806-1
The GlobalPhone Ukrainian corpus contains 12,814 utterances spoken by 119 speakers. Native speakers of Ukrainian were asked to read prompted sentences of newspaper articles. The entire collection took place in Donezk, Ukraine.
ELRA-S0378 GlobalPhone Ukrainian Pronunciation Dictionary
ISLRN: 022-652-862-222-7
The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Ukrainian dictionary contains 7748 entries/7740 words.
ELRA-S0379 JV_TDM Corpus
ISLRN: 371-240-320-910-4
This corpus provides a phonetic annotation of 37 chapters of the original French version of “Around the World in 80 Days” by Jules Verne read by a single speaker. Each chapter has been annotated in a separate .TextGrid file. The total audio size is 6h 41mn 36s with 5h 2mn 41s of speech. The .TextGrid files contain several annotation tiers: phoneme, number of characters, syllable, transcription, PoS, paragraph break, sentence break, prosodic annotations, breathing pauses.
ELRA-W0088 ROMBAC – Romanian balanced corpus
ISLRN: 162-192-982-061-0
ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. The entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels, and it provides morpho-syntactic information (MSD). It is xml encoded.
ELRA-W0089 NPChunks
ISLRN: 412-883-442-173-8
NPChunks is a training corpus containing approximately 1,000 sentences, with a total of 24,243 tokens, selected randomly from the written part of the CINTIL corpus. The corpus is PoS-annotated at token level, including punctuation. Noun Phrases were annotated with specific tags. It was automatically PoS-tagged with MBT tagger, and lemmatized with MBLEM, following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.
ELRA-W0090 EUROPARL Corpus Parallel Corpora: Portuguese-English
ISLRN: 435-502-922-727-2
The Portuguese-English subpart of the EUROPARL Corpus was extracted from the proceedings of the European Parliament. It contains approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation). It is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file, containing a PoS tag and a lemma for each token.
ELRA-L0096 MCL – Multifunctional Computational Lexicon of Contemporary Portuguese
ISLRN: 489-956-642-755-8
MCL is a 26,443 lemma Frequency Lexicon with 140,315 tokens extracted from CORLEX, a contemporary Portuguese corpus (16,210,438 words). In order to extract the lexicon, all the different lexical forms occurring in the corpus were indexed and subsequently tagged morphosyntactically and lemmatised by PALAVROSO. Each lemma in MCL is followed by morphosyntactic and quantitative information.
ELRA-L0097 LEX-MWE-PT – Word Combination in Portuguese
ISLRN: 353-430-176-260-6
LEX-MWE-PT is a lexicon of European Portuguese containing multiword expressions (MWE) extracted from a balanced 50.8M-word written corpus. The lexicon covers 1,198 lemmas (composed of single words from different PoS categories: nouns, adjectives, verbs and adverbs); 12,753 MWE lemmas (which include inflectional variants of the MWE lemmas); and 242,233 concordances of those MWE manually verified.
Dec 15
ELRA-W0084 Arboretum treebank
ISLRN: 025-729-182-451-2
The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences. It consists of about 425,000 tokens and there are ca. 22,260 sentences/utterances containing 3 or more tokens. Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:
1. Native dependency format (Constraint Grammar format)
2. Dependency annotation converted to MALT xml format
3. Native constituent tree format (Cross-language VISL standard)
4. Constituent format converted to TIGER xml
ELRA-W0085 ROCO Romanian journalistic corpus
ISLRN: 312-617-089-348-7
ROCO is a Romanian journalistic corpus containing approximately 7.1 million tokens, the number of types being 231,626. It is rich in proper names, numerals and named entities. The corpus has been lemmatized and PoS annotated following the Multext-East morphosyntactic specifications, and it is XML encoded.
Nov 15
ELRA-L0095-01 GLiCom Spanish Wordform list – Regular word-forms + verb-clitic combinations
ISLRN: 529-126-116-826-1
GLiCom Spanish Wordform List v.1 is a computational lexicon of inflected wordforms in Spanish. Each entry has the following information: (i) lemma, (ii) morphosyntactic tag, and (iii) word type. The lexicon is distributed in two sublexicons: a list of wordforms which contains 1,152,242 entries, and a list of verb-clitic combinations which contains 4,283,637 entries.
ELRA-L0095-02 GLiCom Spanish Wordform list – Regular word-forms
ISLRN: 282-492-887-361-0
GLiCom Spanish Wordform List v.1 is a computational lexicon of inflected wordforms in Spanish. Each entry has the following information: (i) lemma, (ii) morphosyntactic tag, and (iii) word type. This set consists of a subdivision of the full lexicon and contains the list of word forms which consists of 1,152,242 entries. For the full lexicon, see ELRA-L0095-01.
ELRA-L0095-03 GLiCom Spanish Wordform list – Verb-clitic combinations
ISLRN: 064-262-883-254-3
GLiCom Spanish Wordform List v.1 is a computational lexicon of inflected wordforms in Spanish. Each entry has the following information: (i) lemma, (ii) morphosyntactic tag, and (iii) word type. This set consists of a subdivision of the full lexicon and contains the list of verb-clitic which consists of 4,283,637 entries. For the full lexicon, see ELRA-L0095-01.
June 15
ELRA-S0373 GVLEX tales corpus
ISLRN: 433-270-888-230-5
GVLEX tales corpus consists of 89 written tales, manually annotated in structures, speech turns, speakers, phrases, 7 of which were annotated by 2 human annotators (96 annotated texts in total); 12 tales read by a professional, transcribed and manually annotated, including audio files; and annotation and viewing software developed within the GV-LEX project.
Apr 15
ELRA-L0094 CEPLEXicon
ISLRN: 408-817-203-152-3
CEPLEXicon results from the automatic tagging of two corpora, using a tagger and the POS tag set. The automatic tagging was followed by a partial manual revision. This lexicon covers all the speech produced by seven monolingual Portuguese children aged 1;02.00 to 3;11.12, in a total of 114 files, each corresponding to 40-50 minutes of child-adult interaction in a naturalistic setting. The lexicon is presented in .xls format and includes 2201 lemmas, the number of occurrences of each lemma in three different age periods, frequency of the lemma in each period and age of first occurrence for each child.
March 15
ELRA-W0083 deL1L2IM corpus
ISLRN: 339-799-085-669-8
The deL1L2IM corpus is composed of 72 dialogues, each of them having a duration of 20 to 45 minutes. The whole corpus contains ca. 52,000 words and 4,800 messages and has a file size of 0.5 Mb. Nine pairs of participants – i.e. nine learners and four native speakers – were required, with 8 dialogues per pair. The interactions have undergone linguistic analysis whereby the annotation will be performed only on repair/correction sequences (incomplete learner error annotation). The corpus is delivered in one written text file (in XML format, customized under TEI P5). This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts: 1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each. 2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each. The corpus is using the GATE XML format.
Feb 15
ELRA-W0082 88milSMS. A corpus of authentic text messages in French
ISLRN: 024-713-187-947-8
A pluridisciplinary team of linguists and computer scientists collected more than 88,000 French authentic text messages in Montpellier (2011), as part of the sud4science LR project. The text messages were semi-automatically anonymised, before being partially transcoded (into standardised French) and annotated.
ELRA-E0043 CLEFeHealth 2014 Task 3 Evaluation Package
ISLRN: 725-020-897-275-7
The CLEFeHealth 2014 Task 3 Evaluation Package contains data used for the User-centred health information retrieval Shared task at the CLEFeHealth Lab conducted in 2014. Task 3 aimed at evaluating information retrieval to address questions patients may have when reading clinical reports.
ELRA-E0044 REPERE Evaluation Package
ISLRN: 360-758-359-485-0
The REPERE Evaluation Package contains the visual annotation of 60 hours of French news TV shows, for the purpose of person recognition within TV programs. This annotation concerns both persons and written information appearing on screen. Provided data consists of:
- video files with indexes and with manual transcriptions in XGTF format (Viper),
- audio files compressed in WAV format with transcriptions in TRS format (Transcriber)
ELRA-E0045 MAURDOR Evaluation Package
ISLRN: 364-018-517-901-2
The MAURDOR project consists in evaluating systems for automatic processing of written documents. Collected written documents are scanned documents (printed, typewritten or manuscripts). This package contains 8,129 documents. Once collected, those documents were submitted to a manual annotation. This package contains the material provided to the evaluation campaign participants:
- Consistent development and test data corresponding to the application concerned;
- Tools for the automatic measurement of system performances;
- A common assessment protocol applicable to each processing stage, along with a complete automatic processing chain for written documents.
The documents are provided in TIFF format and the annotations are provided in XML format.
Jan 15
ELRA-W0081 Khresmoi manually annotated reference corpus
ISLRN: 764-036-829-417-7
This corpus is a collection of Khresmoi English web documents annotated with key entities (such as disease, drug). The corpus is divided into two parts: 1. The initial corpus: 625 documents from the Genetics Home Reference data set, automatically annotated with anatomical locations and diseases, and manually corrected by 3-4 annotators. Size of documents: between 26 and 8,306 tokens each. 2. The main corpus: 6,950 English documents from the Khresmoi crawl and 5,518 English Wikipedia pages, automatically annotated through the GATE Platform for Anatomy, Disease, Drug and Investigation. Size of documents: between 200 and 2,000 tokens each. The corpus is using the GATE XML format.
ELRA-T0375 ACL RD-TEC: A Reference Dataset for Terminology Extraction and Classification Research in Computational Linguistics
ISLRN: 699-305-362-089-6
The Arabic corpus contains 103,363 words coming from articles extracted from “Le Monde Diplomatique” newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount.
Sep 14
ELRA-S0371 PortMedia French and Italian corpus
This corpus contains 700 transcribed dialogues from about 140 French speakers and 604 transcribed dialogues from about 150 Italian speakers (several dialogues per speaker). The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of touristic information and reservation. A manual transcription and semantic annotation of the corpus are provided with corresponding wave files.
ELRA-W0078 NE3L named entities Arabic corpus
The Arabic corpus contains 103,363 words coming from articles extracted from “Le Monde Diplomatique” newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount.
ELRA-W0079 NE3L named entities Chinese corpus
The Chinese corpus contains 79,302 words coming from articles extracted from “Le Monde Diplomatique” newspaper, and published in 2001. 3 named entity categories were taken into account: Person, Place and Organisation.
ELRA-W0080 NE3L named entities Russian corpus
The Russian corpus contains 75,784 words coming from articles extracted from “Izvestia” newspaper, and published in 1995. 2 named entity categories were taken into account: Time and Amount.
July 14
ELRA-S0366 LECTRA (LECture TRAnscriptions in European Portuguese)
This corpus is composed of the audio and the manual transcriptions from seven 1-semester University courses in Portuguese. The corpus contains a total of 28 hours of audio speech that were manually transcribed by several trained annotators. The corpus is comprised of technical University lectures.
ELRA-S0367 CORAL Corpus
The CORAL Corpus is a collection of spoken dialogues in European Portuguese. It consists of 56 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. Only orthographic transcription was done for the whole corpus. A pilot recording was annotated in several levels.
ELRA-S0370 MoveOn Speech and Noise Corpus
The MoveOn Speech and Noise Corpus is a corpus recorded under the extreme conditions of the motorcycle environment within the MoveOn project. The speech utterances are in British English approaching the issue of command and control and template driven dialog systems with a focus on – but not limited to – the police domain. The major part of the corpus comprises noisy speech and environmental noise recorded on a motorcycle. Several clean speech recording sessions with the same recording setup (including the motorcycle helmet) in an office environment complete the corpus.
May 14
ELRA-S0368 Nepali Spoken Corpus
The Nepali Spoken Corpus contains audio recordings from different social activities within their natural settings as much as possible, with phonologically transcribed and annotated texts, and information about the participants. A total of 17 types of activity were recorded. The total temporal duration of the recorded material is 31 hours and 26 minutes.
ELRA-S0369 CLIPS_MT_MANUAL
CLIPS_MT_MANUAL is a sub-corpus of the original Italian CLIPS corpus (Corpora e Lessici dell’Italiano Parlato e Scritto). This corpus contains 3228 inspected and partially repaired WAV signal files, each containing one dialogue turn (*.wav), 3228 corrected original CLIPS annotation files (*.acs, *.phn, *.std, *.wrd), 3228 BAS Partitur files containing the annotation tiers ORT, KAN and SAP (*.par), 3228 EMU database annotation files (*.vot, *.hlb) covering 30 maptask dialogues performed by 30 speakers (each speaker pair performing two different map tasks) recorded in 15 different locations in Italy in 2000-2004.
March 14
ELRA-E0042 CLEFeHealth 2013 Evaluation Package
The CLEFeHealth 2013 Task 3 Evaluation Package contains data used for the User-centred health information retrieval Shared task at the CLEFeHealth Lab conducted in 2013. Task 3 aimed at evaluating information retrieval to address questions patients may have when reading clinical reports.
Jan 14
ELRA-W0076Nepali Monolingual written corpus
The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
ELRA-W0077 English-Nepali Parallel Corpus
This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali).
Dec 13
ELRA-S0365 aGender
aGender contains speech sample recordings over public telephone lines with read and (semi-)spontaneous speech. Native German speakers called a voice portal from their private phone, and read text + answered some open questions. The corpus contains the voices of 945 German speakers (approx. minimum of 100 speakers per class), each delivering 18 speech items in up to six different sessions.
ELRA-W0074 Amharic-English bilingual corpus
The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English.
Nov 13
The GlobalPhone Pronunciation Dictionaries: GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 17 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), Vietnamese (38504 entries/29974 words), Chinese-Mandarin (73388 pronunciations), and Korean (3500 syllables).
*** NEW ***
ELRA-S0363 GlobalPhone Chinese-Mandarin Pronunciation Dictionary
ELRA-S0364 GlobalPhone Korean Pronunciation Dictionary
Special prices are offered for a combined purchase of several GlobalPhone languages.
Available GlobalPhone Pronuncation Dictionaries are listed below (click on the links for further details): ELRA-S0340 GlobalPhone French Pronunciation Dictionary
ELRA-S0341 GlobalPhone German Pronunciation Dictionary
ELRA-S0348 GlobalPhone Japanese Pronunciation Dictionary
ELRA-S0350 GlobalPhone Arabic Pronunciation Dictionary
ELRA-S0351 GlobalPhone Bulgarian Pronunciation Dictionary
ELRA-S0352 GlobalPhone Czech Pronunciation Dictionary
ELRA-S0353 GlobalPhone Hausa Pronunciation Dictionary
ELRA-S0354 GlobalPhone Polish Pronunciation Dictionary
ELRA-S0355 GlobalPhone Portuguese (Brazilian) Pronunciation Dictionary
ELRA-S0356 GlobalPhone Swedish Pronunciation Dictionary
ELRA-S0358 GlobalPhone Croatian Pronunciation Dictionary
ELRA-S0359 GlobalPhone Russian Pronunciation Dictionary
ELRA-S0360 GlobalPhone Spanish (Latin American) Pronunciation Dictionary
ELRA-S0361 GlobalPhone Turkish Pronunciation Dictionary
ELRA-S0362 GlobalPhone Vietnamese Pronunciation Dictionary
Sep 13
The GlobalPhone Pronunciation Dictionaries: The GlobalPhone Pronunciation Dictionaries: GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 15 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swedish (about 25000 entries), Turkish (31330 entries/31087 words), and Vietnamese (38504 entries/29974 words). Other 3 languages will also be released: Chinese-Mandarin, Korean and Thai.
Available GlobalPhone Pronuncation Dictionaries are listed below (click on the links for further details):
ELRA-S0358 GlobalPhone Croatian Pronunciation Dictionary
ELRA-S0359 GlobalPhone Russian Pronunciation Dictionary
ELRA-S0360 GlobalPhone Spanish (Latin American) Pronunciation Dictionary
ELRA-S0361 GlobalPhone Turkish Pronunciation Dictionary
ELRA-S0362 GlobalPhone Vietnamese Pronunciation Dictionary
Jun 13
The GlobalPhone Pronunciation Dictionaries: GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 10 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Polish (36484 entries), Portuguese (Brazilian) (54146 entries/54130 words) and Swedish (about 25000 entries). Other 8 languages will also be released: Chinese-Mandarin, Croatian, Korean, Russian, Spanish (Latin American), Thai, Turkish, and Vietnamese.
Available GlobalPhone Pronuncation Dictionaries are listed below (click on the links for further details):
ELRA-S0340 GlobalPhone French Pronunciation Dictionary
ELRA-S0341 GlobalPhone German Pronunciation Dictionary
ELRA-S0348 GlobalPhone Japanese Pronunciation Dictionary
ELRA-S0350 GlobalPhone Arabic Pronunciation Dictionary
ELRA-S0351 GlobalPhone Bulgarian Pronunciation Dictionary
ELRA-S0352 GlobalPhone Czech Pronunciation Dictionary
ELRA-S0353 GlobalPhone Hausa Pronunciation Dictionary
ELRA-S0354 GlobalPhone Polish Pronunciation Dictionary
ELRA-S0355 GlobalPhone Portuguese (Brazilian) Pronunciation Dictionary
ELRA-S0356 GlobalPhone Swedish Pronunciation Dictionary
ELRA-E0041 CHIL 2007+ Evaluation Package The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds. The CHIL 2007+ Evaluation Package includes: 1) CHIL 2007 Evaluation Package (see ELRA-E0033) and 2) additional annotations which have been created within the scope of the Metanet4u Project (ICT PSP No 270893), sponsored by the European Commission.
Feb 13
ELRA-W0073 Quaero Old Press Extended Named Entity corpusThis corpus consists of the manual annotation of 76 newspaper issues published in 1890-1891 and provided by the French National Library (Bibliothèque Nationale de France). Three different titles are used (Le Temps, La Croix and Le Figaro) for a total of 295 pages. The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.
ELRA-S0349 Quaero Broadcast News Extended Named Entity corpusThis corpus consists of the manual annotation of (i) the ESTER 2 (see also ELRA-S0338) manual transcription corpus and (ii) the Quaero Speech Recognition Evaluation corpus (manual and automatic transcriptions coming from 3 different ASR systems). The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.
ELRA-W0057 PANACEA English-French and English-Greek parallel corpus acquired for Environment domainThis package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Environment domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets. ELRA-W0058 PANACEA English-French and English-Greek parallel corpus acquired for Labour Legislation domain
This package consists of an English-French and English-Greek sentence-aligned parallel corpus from the Labour Legislation domain automatically acquired from the web during 2010 and 2011. It was acquired in the framework of the PANACEA project. Data and language pairs are split into training, test and development test sets.
ELRA-W0063 PANACEA Environment English monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the “Environment” domain. It was constructed in the summer of 2011. It contains 50,541,538 tokens, divided into a total of 28,071 documents that were crawled from 3,121 web sites.
ELRA-W0064 PANACEA Labour English monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the English language and were automatically classified as relevant to the “Labour Legislation” domain. It was constructed in the summer of 2011. It contains 46,431,351 tokens, divided into a total of 15,197 documents that were crawled from 1,558 web sites.
ELRA-W0065 PANACEA Environment French monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the “Environment” domain. It was constructed in the summer of 2011. It contains 47,364,125 tokens, divided into a total of 23,514 documents that were crawled from 1,969 web sites.
ELRA-W0066 PANACEA Labour French monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the French language and were automatically classified as relevant to the “Labour Legislation” domain. It was constructed in the summer of 2011. It contains 56,440,425 tokens, divided into a total of 26,675 documents that were crawled from 1,391 web sites.
ELRA-W0067 PANACEA Environment Greek monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the “Environment” domain. It was constructed in the summer of 2011. It contains 27,958,530 tokens, divided into a total of 16,073 documents that were crawled from 1,063 web sites.
ELRA-W0068 PANACEA Labour Greek monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Greek language and were automatically classified as relevant to the “Labour Legislation” domain. It was constructed in the summer of 2011. It contains 21,077,196 tokens, divided into a total of 7,124 documents that were crawled from 598 web sites.
ELRA-W0069 PANACEA Environment Italian monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the “Environment” domain. It was constructed in the summer of 2011. It contains 40,044,852 tokens, divided into a total of 16,159 documents that were crawled from 1,211 web sites.
ELRA-W0070 PANACEA Labour Italian monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Italian language and were automatically classified as relevant to the “Labour Legislation” domain. It was constructed in the summer of 2011. It contains 70,563,320 tokens, divided into a total of 12,706 documents that were crawled from 864 web sites.
ELRA-W0071 PANACEA Environment Spanish monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the “Environment” domain. It was constructed in the summer of 2011. It contains 46,225,624 tokens, divided into a total of 26,009 documents that were crawled from 2,053 web sites.
ELRA-W0072 PANACEA Labour Spanish monolingual corpus
This corpus consists of documents that were acquired from the web, were automatically detected to be in the Spanish language and were automatically classified as relevant to the “Labour Legislation” domain. It was constructed in the summer of 2011. It contains 53,922,118 tokens, divided into a total of 13,188 documents that were crawled from 1,015 web sites.
Dec 12
ELRA-W0059 LT Corpus
The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
ELRA-W0060 PTPARL Corpus
The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
ELRA-W0061 CINTIL-DependencyBank
The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
ELRA-W0062 CINTIL-DeepBank
The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.
Nov 12
ELRA-S0347 GlobalPhone Hausa
The GlobalPhone Hausa corpus contains 7,895 utterances spoken by 33 male and 69 female speakers in the age range of 16 to 60 years. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection took place in 5 different locations in Cameroon. The speech data contains a variety of accents: Maroua, Douala, Yaoundé, Bafoussam, Ngaoundéré, and Nigeria.
Sep 12
ELRA-S0345 Spoken Portuguese Corpus
The Spoken Portuguese corpus consists of a total of 86 recordings (8h44m), collected among sociolinguistically diverse speakers having Portuguese as mother tongue or as second language.
ELRA-S0346 Fundamental Portuguese Corpus
The Fundamental Portuguese Corpus is a corpus of spoken language, collected between 1970 and 1974, composed of 1800 recordings (500 hours) made in Continental Portugal and the Islands. Of these 1800 conversations, a sample was selected and transcribed.
ELRA-W0055 CINTIL-TreeBank
The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens).
ELRA-W0056 CINTIL-PropBank
The CINTIL-PropBank is a corpus of sentences annotated with their constituency structure and semantic role tags, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens).
July 12
ELRA-S0343 VERIF1DE
The speech corpus VERIF1DE contains 20 recordings (sessions) of 150 German speakers each over the telephone network (10 sessions over fixed network and 10 sessions over GSM). Each session contains 40 single recordings, mainly speech read from a prompt sheet.
ELRA-S0344 LILA Hindi Belt database
The LILA Hindi Belt database comprises 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items.
ELRA-M0013 Bilingual Collocational Dictionary (Horst Bogatz)
This new release contains 69,000 English headwords (instead of 40,000 for the previous release). The bilingual English-German collocational dictionary consists of around 69,000 English headwords, including concepts expressed with more than one word (e.g. “the awareness of the environment” or “lame duck”) and hyphenated compounds. It contains verbs, adjectives, synonyms and phrases that collocate with the headword. It provides the German equivalents for the headwords as well as their English synonyms.
Jan 12
ELRA-S0324 Catalan-SpeechDat For the Fixed Telephone Network Database
This speech database contains the recordings of 2000 Catalan speakers who called from Fixed telephones and who are recorded over the fixed PSTN using and ISDN-BRI interface. Each speaker uttered around 50 read and spontaneous items. The speech database follows the specifications made within the SpeechDat (II) project. The database was validated by UVIGO. The Catalan-SpeechDat for the Fixed Telephone Network Database was funded by the Catalan Government.
ELRA-S0325 Catalan-SpeechDat for the Mobile Telephone Network Database
This speech database contains the recordings of 2000 Catalan speakers who called from GSM telephones and who are recorded over the fixed PSTN using and ISDN-BRI interface. Each speaker uttered around 50 read and spontaneous items. The speech database follows the specifications made within the SpeechDat (II) project. The database was validated by UVIGO. The Catalan-SpeechDat for the Mobile Telephone Network Database was funded by the Catalan Government.
ELRA-S0326 Catalan SpeechDat-Car database
The Catalan SpeechDat-Car database contains the in-car recordings of 300 speakers who uttered from around 120 read and spontaneous items. Each speaker recorded two sessions. Recordings have been made through 4 different channels, via in-car microphones (1 close-talk microphone, 3 far-talk microphones). The 300 Catalan speakers were selected from 5 different dialectal regions and are balanced in gender and age groups. The database was validated by UVIGO. The Catalan-SpeechDat-Car Database was funded by the Catalan Government.
ELRA-S0327 Catalan Speecon database
The Catalan Speecon database comprises the recordings of 550 adult Catalan speakers who uttered over 290 items (read and spontaneous). The data were recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). The speech database follows the specifications made within the UE funded Speecon project. The database was validated by UVIGO. The Catalan-Speecon Database was funded by the Catalan Government.
ELRA-S0328 Spanish EUROM.1
EUROM1 is a multilingual European speech database. It contains over 60 speakers per language who pronounced numbers, sentences, isolated words … using close talking microphone in an anecoic room. Equivalent corpora for each of the European languages exist already, with the same number of speakers selected in the same way, and recorded in the same conditions with common file formats.
ELRA-S0329 Emotional speech synthesis database
This database contains the recordings of one male and one female Spanish professional speakers recorded in a noise-reduced room. It consists in recordings and annotations of read text material in neutral style plus six MPEG expressions, all in fast, slow, soft and loud speech styles. The text material is composed of 184 items including phonetically balanced sentences, digits and isolated words. The text material was the same for all the modes and styles, giving a total of 3h 59min recorded speech for the male speaker and 3h 53min for the female speaker. The Emotional speech synthesis database was created within the scope of the Interface EU funded project.
ELRA-S0330 FESTCAT Catalan TTS baseline male speech database
This database contains the recordings of one male Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. This database consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Male Speech Database was created within the scope of the FESTCAT project, funded by the Catalan Government.
ELRA-S0331 FESTCAT Catalan TTS baseline female speech database
This database contains the recordings of one female Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Female Speech Database was created within the scope of the FESTCAT project funded by the Catalan Government.
ELRA-S0332 FESTCAT Catalan TTS baseline speech database – 8 speakers
This database contains the recordings of four female and four male Catalan professional speakers recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 1 hour of speech per speaker for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS baseline speech database – 8 speakers was created within the scope of the FESTCAT project funded by the Catalan Government.
ELRA-S0333 Spanish Festival HTS models – male speech
This database contains the Festival HTS models trained with 10h of speech from the TC-STAR Spanish Baseline Male Speech Database (ELRA-S0310).
ELRA-S0334 Spanish Festival HTS models – female speech
This database contains the Festival HTS models trained with 10h of speech from the TC-STAR Spanish Baseline Female Speech Database (ELRA-S0309).
ELRA-S0335 Bilingual (Spanish-English) Speech synthesis HTS models
This database contains Bilingual (English and Spanish) Festival HTS models. Models were trained with 9h of speech from 2 female bilingual speakers and 2 male bilingual speakers. Each speaker recorded 2h 15 min per language. The speech data can be found in the TC-STAR Bilingual Voice-Conversion Spanish Speech Database (ELRA-S0311) and in the TC-STAR Bilingual Expressive Spanish Speech Database (ELRA-S0313).
ELRA-S0336 Spanish Festival voice male
This database contains the recordings of one male Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. This comprises read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The database includes Festival-compatible annotations. The recordings can be also found under TC-STAR Spanish Baseline Male Speech Database (ELRA-S0310).
ELRA-S0337 Spanish Festival voice female
This database contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal, of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The database includes Festival-compatible annotations. The recordings can be also found under TC-STAR Spanish Baseline Female Speech Database (ELRA-S0309).
Nov 11
ELRA-S0323 European Parliament Interpretation Corpus (EPIC)
The EPIC corpus is a parallel corpus of European Parliament speeches and their corresponding simultaneous interpretations. This corpus includes source speeches in Italian, English and Spanish and interpreted speeches in all possible combinations and directions. It contains a total of 357 speeches (177,295 words). The corpus has been orthographically transcribed. Non-tagged transcripts in text format are also available.
Sep 11
ELRA-S0319 GlobalPhone Bulgarian
ELRA-S0320 GlobalPhone Polish
ELRA-S0321 GlobalPhone Thai
ELRA-S0322 GlobalPhone Vietnamese
The GlobalPhone Corpus: The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of 19 spoken languages.
Update of ELRA-W0040 Venice Italian Treebank (VIT)
The new version of VIT has a totally revised constituent-based representation and a completely new dependency-based representation which has been achieved by semi-automatic procedures.
The VIT, Venice Italian Treebank contains about 272,000 words distributed over six different domains: bureaucratic, political, economic and financial, literary, scientific, and news. In addition, some 60,000 tokens of spoken dialogues in different Italian varieties were annotated.
Apr 11
ELRA-S0314 LILA Marathi database
The LILA Marathi database comprises 2,002 Marathi speakers (992 males and 1010 females) recorded over the Korean mobile telephone network. Each speaker uttered around 46 read and spontaneous items.
ELRA-S0315 A-SpeechDB
A-SpeechDB© is an Arabic speech database contains about 20 hours of continuous speech recorded through one desktop omni microphone by 205 native speakers (about 30% of females and 70% of males), aged between 20 and 45. Automatically generated transcriptions are provided with a manually revised version for each sentence.
ELRA-S0316 SmartKom Home (SKH)
Release SKH 1.0 contains 130 recordings in the technical setup (“scenario”) SmartKom Home which should be an intelligent communication assistant for the private environment. Naive users were asked to test a “prototype” for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4.5 minutes while they were left alone with the system.
ELRA-S0317 SmartKom Mobil (SKM)
Release SKM 1.0 contains 146 recordings in the technical setup (“scenario”) SmartKom Mobil which is a portable PDA equipped with a net link and additional intelligent communication devices. Naive users were asked to test a “prototype” for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system.
ELRA-S0318 SmartKom Audio (SKAUDIO)
Release SKAUDIO 1.0 contains all audio channel recordings of the SmartKom corpora SmartKom Public (cf. ELRA-S0136), SmartKom Home (cf. ELRA-S0316) and SmartKom Mobil (cf. ELRA-S0317).
Nov 10
ELRA-E0036 CLEF AdHoc-News Test Suites (2004-2008) – Evaluation Package
The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoc track of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections.
ELRA-E0037 CLEF Domain Specific Test Suites (2004-2008) – Evaluation Package
The CLEF Domain Specific Test Suites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles.
ELRA-E0038 CLEF Question Answering Test Suites (2003-2008) – Evaluation Package
The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008. This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents.
Sep 10
ELRA-S0308 Egyptian Arabic Speecon database
The Egyptian Arabic Speecon database comprises the recordings of 550 adult Egyptian speakers and 50 child Egyptian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-W0054 Persian 1984 corpus (Multext-East framework)
This corpus contains the Persian (Farsi) translation of a part of the novel “1984” (G. Orwell) annotated in the Multext-East framework (Multilingual Text Tools and Corpora for Eastern and Central European Languages). The corpus contains approximately 100,000 words (6,604 sentences, 13,247 lemmas), with extensive headers and markup for document structure, sentences, and various sub-sentence annotations in the XML-format following the TEI guidelines. Annotation includes POS (part-of-speech) and lemmas.
ELRA-L0086 Persian Multext-East framework lexicon
This is a Persian (Farsi) morphosyntactic lexicon derived from the Persian 1984 corpus (Multext-East framework) (see ELRA-W0054). It contains the full inflectional paradigms of a superset of lemmas that appear in the Persian 1984 corpus. Each entry gives the word-form, its lemma and morphosyntactic description. The lexicon contains 13,247 entries.
ELRA-L0087 Persian lexicon
This is a Persian (Farsi) lexicon of more than 40,000 entries of non-inflected forms of words. Each word is transliterated based on the proposed framework from MBROLA (Text-To-Speech synthesizer). The database includes a large variety of descriptors for each entry (plural, homograph, …). The lexicon is provided in a MS Access database.
June 10
ELRA-T0374 Terminology database of natural sciences
This dictionary covers the three kingdoms: Animal, Vegetal, Mineral. It contains 50,000 species with numerous synonyms in French, English and Latin and many breeds and varieties. Minerals are given with their chemical formula. About 7,900 definitions in French are included. It also includes synonyms and linguistic variants.
ELRA-W0053 Catalan-Spanish Parallel Corpus
This corpus contains more than 100 million words and it contains 10 years of bilingual articles from “El Periódico de Catalunya”. The data are aligned at sentence level and stored in text files, in a one sentence per line basis. The data are provided in plain text, with no encoding whatsoever.
Please note that the content and price of the following LRs have been updated:
ELRA-T0102 Terminology database of expressions
ELRA-T0103 Terminology database of finance
ELRA-T0367 Terminology database of telecommunication
Apr 10
ELRA-S0307 BABEL Polish database
The BABEL Polish Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). It consists of the basic “common” set which contains the Many Talker Set (30 males, 30 females), the Few Talker Set (5 males, 5 females), the Very Few Talker Set (1 male, 1 female).
March 10
ELRA-S0305 EPAC Corpus: orthographic transcriptions
This corpus consists of approx. 100 hours of manual orthographic transcriptions, which were produced from 1,677 hours of non transcribed recordings from the ESTER Evaluation Campaign (Technolangue programme). This corpus also consists of automatic transcriptions of the full 1,677 hours.
Sep 09
ELRA-S0301 Norwegian EUROM1 (EUROM1_N)
EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0302 TC-STAR female baseline voice: Laura
Laura contains the recordings of one female English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).
ELRA-S0303 TC-STAR male baseline voice: Ian
Ian contains the recordings of one male English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).
ELRA-S0304 SpeechDat(M) Italian Mobile Network Speech Database
This speech database contains the recordings of 342 Italian speakers recorded over the Italian mobile telephone network. Each speaker uttered around 40 read and spontaneous items.
Aug 09
ELRA-T0373 BioLexicon
BioLexicon is a large-scale English terminological resource which has been developed to address the needs emerging in text mining efforts in the biomedical domain. It contains over 2.2M lexical entries (over 3.3M semantic relations), and information on over 1.8M variants and on over 2M synonymy relations. BioLexicon is available in a relational database format (MySQL dump format) and it adheres to the EAGLES/ISO standards for lexical resources.
July 09
ELRA-T0372 Multilingual Dictionary of Sports
This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The results are presented in the form of MS ACCESS databases. The EuRADic sport dictionary is provided under the following different subsets:
ELRA-T0372-01 English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database
ELRA-T0372-02 English-French bilingual database
ELRA-T0372-03 English-French-Greek trilingual database
ELRA-T0372-04 English-French-Arabic trilingual database
ELRA-T0372-05 English-French-German trilingual database
ELRA-T0372-06 English-French-Spanish trilingual database
ELRA-T0372-07 English-French-Portuguese trilingual database
ELRA-M0042 ItalWordNet (Italian WordNet)
ItalWordNet (Italian WordNet) is an updated version of the EuroWordNet Italian database.
ELRA-W0051 Persian-English parallel Corpus
The corpus consists of about 3,500,000 English and Persian words aligned at sentence level (about 100,000 sentences). The format of the files is Unicode.
ELRA-E0034 EASy Evaluation Package
The EASy Evaluation Package was produced within the French national project EASy (Evaluation of syntactic parsers of French), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT).
Jun 09
ELRA-W0050 The CINTIL Corpus – International Corpus of Portuguese
CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators.
May 09
ELRA-M0048 The CINTIL Corpus – International Corpus of Portuguese
LatinWordNet contains information about the following aspects of the Latin and English lexicon: lexical relations between words, semantic relations between lexical concepts, correspondences between Latin and English lexical concepts.
ELRA-M0049 Basque WordNet
The Basque WordNet models nouns, verbs and adjectives. Each sense is linked to a so-called synset (for a total of 30,281 Synsets). Every synset encodes the synonymy relation between (possibly) several words (synonyms), having a unique meaning, belonging to one and the same part of speech (specified in the POS tag value), and expressing the same lexical meaning.
ELRA-M0050 The MWN.PT – MultiWordnet of Portuguese
MWN.PT – MultiWordnet of Portuguese (version 1) spans over 17,200 manually validated concepts/synsets, linked under the semantic relations of hyponymy and hypernymy. These concepts are made of over 21,000 word senses/word forms and 16,000 lemmas from both European and American variants of Portuguese.
ELRA-S0300 SIGNUM Database
The SIGNUM Database contains both isolated and continuous utterances of various signers. The corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images.
Apr 09
ELRA-S0299 Alcohol Language Corpus (BAS ALC)
ALC contains recordings of 88 German speakers that are either intoxicated or sober. The type of speech ranges from read single digits to full conversation style.
March 09
ELRA-W0049 “Le Monde Diplomatique” Arabic tagged corpus
This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.
ELRA-E0033 CHIL 2007 Evaluation Package
The CHIL 2007 Evaluation Package consists of the following contents:
- A set of audiovisual recordings of interactive seminars. The number of people present in the recording was fixed to be between 3 and 7. The recordings were done between June and September 2006 according to the “CHIL Room Setup” specification.
- Video annotations.
- Orthographic transcriptions.
ELRA-S0297 Hungarian Speecon database
The Hungarian Speecon database comprises the recordings of 555 adult Hungarian speakers and 50 child Hungarian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).ELRA-S0298 Czech Speecon database
The Czech Speecon database comprises the recordings of 550 adult Czech speakers and 50 child Czech speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
Feb 09
ELRA-S0296 FBK-Irst database of isolated meeting-room acoustic events
This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission’s Sixth Framework Programme.
Dec 08
ELRA-M0047 Czech WordNet
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 28,201 word senses (synsets).
ELRA-S0294 CHIEDE Corpus: a spontaneous child language corpus of Spanish
The spontaneous child language corpus, CHIEDE, consists of 58,163 words, in 30 texts, with 7 hours and 53 minutes of recordings and 59 child participants.
ELRA-S0295 LILA Korean database
The LILA Korean database comprises 1,000 Korean speakers (500 males and 500 females) recorded over the Korean mobile telephone network.
Nov 08
ELRA-S0283 Laboratory Conditions Czech Audio-Visual Speech Corpus (UWB-05-LCAVC)
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions.
ELRA-S0284 Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions (UWB-07-ICAVR I)
This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems collected with impaired illumination conditions. The corpus consists of about 20 hours of audio-visual records of 50 speakers in laboratory conditions.
ELRA-S0285 Czech Sign Language Corpus for Recognition – Amateur Signer (UWB-06-SLR-A)
This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views.
ELRA-S0286 Czech Sign Language Corpus for Recognition – Professional Signer (UWB-07-SLR-P)
This database comprises 378 signs from Czech sign language as performed by 4 everyday sign-language users (4 women, 2 of them deaf).
ELRA-E0017 CHIL 2006 Evaluation Package
The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.
ELRA-S0292 Danish EUROM1 (EUROM1_D)
EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0293 The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication
The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.
Oct 08
ELRA-S0287 Cantonese Speecon Database
The Cantonese Speecon database comprises the recordings of 550 adult Cantonese speakers and 50 child Cantonese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0288 Thai Speecon Database
The Thai Speecon database comprises the recordings of 552 adult Thai speakers and 50 child Thai speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0289 OrienTel Jordan MCA (Modern Colloquial Arabic) database
This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0290 OrienTel Jordan MSA (Modern Standard Arabic) database
This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0291 OrienTel English as spoken in Jordan database
This speech database contains the recordings of 578 Jordanian speakers of English recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
ELRA-W0048 TUNA Corpus
The TUNA Corpus of Referring Expressions is built with the contributions from 50 native or fluent speakers of English and it contains about 2000 descriptions (referring expressions). Participants described objects (targets) in visual domains by typing and submitting referring expressions that distingued them from other objects that were shown simultaneously (distractors). Each description is annotated with semantic information.
Sep 08
ELRA-S0281 LILA Hindi-L1 database
The LILA Hindi-L1 database comprises 2,030 Hindi speakers (1,012 males and 1,018 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered around 60 read and spontaneous items.
ELRA-S0282-01 BAS PHATT 1.0.X (sub-set)
The Ph@ttSessionz speech database contains recordings of 864 adolescent speakers of German (age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 41 locations in Germany. Recordings were done with SpeechRecorder in selected schools in the years 2005-2007. Both channels, the headset and the desktop microphone, were recorded in high quality. The BAS PHATT corpus is available in two versions: BAS PHATT 1.0.X (sub-set, ELRA-S0282-01) and BAS PHATT 1.1.X (complete corpus, ELRA-S0282-02). BAS PHATT 1.0.X contains 41 items.
ELRA-S0282-02 BAS PHATT 1.1.X (complete corpus) The SpeechDat Galician Database for the Fixed Telephone Network contains the recordings of 653 speakers of Galician recorded over the fixed telephone network. Each speaker uttered around 44 read and spontaneous items.
July 08
ELRA-S0276 Swedish EUROM1 (EUROM1_S)
EUROM1 is the first really multilingual speech database produced in Europe . Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.
ELRA-S0277 SpeechDat Galician Database for the Fixed Telephone Network
The SpeechDat Galician Database for the Fixed Telephone Network contains the recordings of 653 speakers of Galician recorded over the fixed telephone network. Each speaker uttered around 44 read and spontaneous items.
ELRA-S0278 SmartWeb Handheld Corpus (SHC)
This corpus contains recordings spoken by 156 speakers in a human-machine query situation. Users were asked to solve several tasks with a spoken query system to the WWW using a smart phone as portable device in natural environments (office, hall, restaurant, street). Recorded channels are the Bluetooth headset over UMTS (telephone quality), the Bluetooth headset and an additional collar microphone in high quality.
See also ELRA-S0279 and ELRA-S0280.
ELRA-S0279 SmartWeb Motorbike Corpus (SMC)
This corpus contains recordings spoken by 36 speakers in a human-machine query situation on a running motor cycle (BMW). Bikers were asked to solve several tasks with a spoken query system to the WWW using an integrated system connected to a speech server via an UMTS connection. Recorded channels are the Bluetooth helmet microphone over UMTS (telephone quality), and – partly – the Bluetooth helmet microphone and an additional neck microphone in high quality. See also ELRA-S0278 and ELRA-S0280.
ELRA-S0280 SmartWeb Video Corpus (SVC)
This multimodal corpus contains 99 recordings each containing a human-human-machine dialogue: one speaker (which is being recorded) interacts with a human partner as well with a dialogue system via a smart phone (SmartWeb system). See also ELRA-S0278 and ELRA-S0279.
Jun 08
Update – ELRA-S0242 SALA II US English database
The SALA II US English database comprises 4,090 US English speakers (2,017 males, 2,073 females, including some speakers with Hispanic accents) recorded over the United States mobile telephone network.
ELRA-L0085 euLEX (Lexical Database for Basque)
euLEX is a general lexicon which contains 115,000 entries, divided into 94,000 dictionary entries or lemmas, 12,000 allomorphs, 7,500 verb forms and about 1,200 dependent morphemes. All entries include linguistic information such as morphology and usage. The lexicon is in XML.
Apr 08
ELRA-S0273 LC-STAR Slovenian Phonetic lexicon
The LC-STAR Slovenian Phonetic lexicon comprises 110,900 entries, including a set of 64,521 common words, a set of 45,012 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 5,491 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0274 LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon
The LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon comprises 12,722 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,522 phrase corpus. The lexicon is provided in XML format.
March 08
ELRA-S0272 MEDIA speech database for French
The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation. The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).
Feb 08
ELRA-S0269 LC-STAR Greek Phonetic lexicon
The LC-STAR Greek Phonetic lexicon comprises 110,708 entries, including a set of 57,519 common words, a set of 45,162 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,027 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0270 LC-STAR Italian Phonetic lexicon
The LC-STAR Italian Phonetic lexicon comprises 109,712 entries, including a set of 56,420 common words, a set of 45,253 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,039 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0271 LC-STAR English-Italian Bilingual Aligned Phrasal lexicon
The LC-STAR English- Italian Bilingual Aligned Phrasal lexicon comprises 10,466 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,524 phrasal corpus. The lexicon is provided in XML format.
Jan 08
ELRA-S0268 UPC-TALP database of isolated meeting-room acoustic events
This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission’s Sixth Framework Programme. It contains a set of isolated acoustic events that occur in a meeting room environment and that were recorded for the CHIL Acoustic Event Detection (AED) task. The database can be used as a training material for AED technologies as well as for testing AED algorithms in quiet environments without temporal sound overlapping. Approximately 60 sounds per sound class were recorded. Ten people (5 men and 5 women) participated to three sessions. During each session a person had to produce a complete set of sounds two times.
Dec 07
ELRA-S0244 Japanese Speecon database
The Japanese Speecon database comprises the recordings of 556 adult Japanese speakers and 51 child Japanese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0265 Dutch from Belgium Speecon Database
The Dutch from Belgium Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0266 Dutch from the Netherlands Speecon Database
The Dutch from the Netherlands Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0267 Danish Speecon Database
The Danish Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0258 Orientel United Arab Emirates MCA (Modern Colloquial Arabic)
This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0259 Orientel United Arab Emirates MSA (Modern Standard Arabic)
This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
ELRA-S0260 Orientel English as spoken in the United Arab Emirates
This speech database contains the recordings of 500 speakers of English recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
ELRA-S0261 Hungarian SpeechDat(E) Database
This speech database contains the recordings of 1,000 Hungarian speakers recorded over the Hungarian fixed telephone network. Each speaker uttered around 50 read and spontaneous items.
ELRA-S0262 SALA II Portuguese from Brazil database
The SALA II Portuguese from Brazil database comprises 1000 Brazilian speakers recorded over the Brazilian mobile telephone network.
ELRA-S0263 SALA II Spanish from Colombia Database
The SALA II Spanish from Colombia database comprises 1000 Colombian speakers recorded over the Colombian mobile telephone network.
ELRA-S0264 SALA II US Spanish West
The SALA II US Spanish West database comprises 1000 Spanish speakers recorded over the American mobile telephone network.
ELRA-S0255 LC-STAR Finnish Phonetic lexicon
The LC-STAR Finnish Phonetic lexicon comprises 189,409 entries, including a set of 144,233 common words, a set of 45,176 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 13,068 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0256 LC-STAR Mandarin Chinese Phonetic lexicon
The LC-STAR Mandarin Chinese Phonetic lexicon comprises 104,368 entries, including a set of 38,098 common words, a set of 57,528 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,522 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0257 LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon
The LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format.
Nov 07
ELRA-S0249 TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech
This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format. For corresponding recordings, see ELRA-S0251
ELRA-S0251 TC-STAR English Training Corpora for ASR: Recordings of EPPS Speech
This corpus consists of the recordings of around 290 hours form EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English, 92 hours of which were annotated (transcribed) (the transcriptions are not provided in the present package). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz. For corresponding transcriptions, see ELRA-S0249.
ELRA-S0252 TC-STAR Spanish Training Corpora for ASR: Recordings of EPPS Speech
This corpus consists of the recordings of around 283 hours from EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish (a mixture of native and non-native Spanish). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz.
ELRA-S0253 TC-STAR English Test Corpora for ASR
This corpus consists of 70 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English and other European languages. From this corpus, 16 hours of English speeches (native or non native) were annotated (transcribed). Each speech file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.
ELRA-S0254 TC-STAR Spanish Test Corpora for ASR
This corpus consists of 174 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish and other European languages. From this corpus, 16 hours of Spanish speeches were annotated (transcribed). Each audio file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.
ELRA-S0250 TC-STAR English-Spanish Training Corpora for Machine Translation: Aligned Final Text Editions of EPPS
This corpus consists of respectively 34 million (English) and 38 million (Spanish) running words of bilingual sentence segmented and aligned texts in English and Spanish obtained from the Final Text Editions provided by the European Parliament (from April 1996 to Sept. 2004, Dec. 2004 to May 2005, and Dec. 2005 to May 2006. The data is accompanied by tools for further preprocessing.
ELRA-S0245 LC-STAR German Phonetic lexicon
The LC-STAR German Phonetic lexicon comprises 102,169 entries, including a set of 55,507 common words, a set of 46,662 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,763 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ).
ELRA-S0246 LC-STAR German Phonetic lexicon in the Touristic Domain
The LC-STAR German Phonetic lexicon in the Touristic Domain comprises 8,782 entries from the following categories: nouns, adjectives and verbs. For each entry the following information is provided: orthographic form, part-of-speech (POS), phonemic transcription. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0247 LC-STAR Standard Arabic Phonetic lexicon
The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0248 LC-STAR English-German Bilingual Aligned Phrasal lexicon
The LC-STAR English-German Bilingual Aligned Phrasal lexicon comprises 10,733 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format.
Oct 07
ELRA-L0084 Macedonian Morphological Lexicon (MACPLEX)
MACPLEX comprises two dictionaries: a dictionary of lemmas (over 80,000 entries) and a dictionary of word forms (over 1,300,000 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the more than 1,300,000 word forms, there are 345,350 nouns, 467,744 adjectives, 500,220 verbs and 19,472 adverbs. The remaining entries correspond to pronouns, adpositions, conjunctions and numerals. The lexicon is available in Unicode.
ELRA-S0242 SALA II US English database
The SALA II US English database comprises 3,065 US English speakers (1515 males, 1550 females, including some speakers with Hispanic accents ) recorded over the United States mobile telephone network.
ELRA-S0243 SpeechDat Catalan FDB database
The SpeechDat Catalan FDB database contains the recordings of 1,005 Catalan speakers (474 males, 531 females) recorded over the Spanish fixed telephone network.
AURORA-CD0005 AURORA-5
The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.
It contains artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz, as well as a set of scripts for running recognition experiments on those speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.
TC-STAR Evaluation Packages
The Evaluation Packages below include the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) and Spoken Language Translation (SLT) third evaluation campaign, as well as the material used for the TC-STAR 2006 and 2007 End-to-End task. They include resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0025 TC-STAR 2007 Evaluation Package – ASR English
ELRA-E0026-01 TC-STAR 2007 Evaluation Package – ASR Spanish – CORTES
ELRA-E0026-02 TC-STAR 2007 Evaluation Package – ASR Spanish – EPPS
ELRA-E0027 TC-STAR 2007 Evaluation Package – ASR Mandarin Chinese
ELRA-E0028 TC-STAR 2007 Evaluation Package – SLT English-to-Spanish
ELRA-E0029-01 TC-STAR 2007 Evaluation Package – SLT Spanish-to-English – CORTES
ELRA-E0029-02 TC-STAR 2007 Evaluation Package – SLT Spanish-to-English – EPPS
ELRA-E0030 TC-STAR 2007 Evaluation Package – SLT Chinese-to-English
ELRA-E0031 TC-STAR 2006 Evaluation Package – End-to-End
ELRA-E0032 TC-STAR 2007 Evaluation Package – End-to-End
August 07
Update – ELRA-W0036-02 “Le Monde Diplomatique” Text corpus in French – archives from 1999
Electronic archiving of “Le Monde Diplomatique” articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article.
ELRA-L0076 Polderland Dutch Lexicon of Abbreviations and Acronyms
The lexicon contains 2,180 Dutch abbreviations and acronyms. It complies with the official Dutch Spelling (2005/6). Each entry consists of an ID, word form, lemma and part of speech.
ELRA-L0077 Polderland Dutch General Lexicon
The lexicon contains 400,463 Dutch words, comprising 236,369 nouns, 90,882 adjectives, 69,744 verbs, 2,120 adverbs, and 1,348 items from other categories (pronouns, determiners, articles, adpositions, conjunctions, numerals, etc.). It complies with the official Dutch Spelling (2005/6). The lexicon contains an ID, word form, lemma and part of speech.
ELRA-L0078 Polderland Dutch Lexicon of Names
The lexicon contains 24,247 Dutch proper names. Various sorts of proper names are included, such as first names, last names, geographical names etc. Each entry contains an ID, word form, lemma, part of speech and proper name type.
ELRA-L0079 Polderland Dutch Lexicon of Business Terminology
The lexicon contains 15,987 Dutch words from the business domain, comprising 13,774 nouns, 1,267 adjectives, 895 verbs, 9 adverbs, and 42 items from other categories. It complies with the official Dutch Spelling (2005). Each entry contains an ID, word form and part of speech.
ELRA-L0080 Polderland Dutch Lexicon of Legal Terminology
The lexicon contains 6,207 Dutch words from the legal domain, comprising 4,781 nouns, 810 adjectives, 573 verbs, 12 adverbs and 31 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0081 Polderland Dutch Lexicon of Medical Terminology
The lexicon contains 17,115 Dutch words from the medical domain, comprising 12,638 nouns, 3,107 adjectives, 1,273 verbs, 11 adverbs and 86 items from other categories. The lexicon complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0082 Polderland Dutch Lexicon of Social Terminology
The lexicon contains 12,551 Dutch words from the social domain, comprising 9,984 nouns, 1,306 adjectives, 1,161 verbs, 56 adverbs and 44 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
ELRA-L0083 Polderland Dutch Lexicon of Technical Terminology
The lexicon contains 9,940 Dutch words from the technical/scientific domain, comprising 8,832 nouns, 950 adjectives, 111 verbs, 2 adverbs and 45 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.
July 07
ELRA-E0018 ARCADE II Evaluation Package
The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme. The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment.
The campaign is distributed over two actions: sentence alignment and translation of named entities.
ELRA-E0019 CESART Evaluation Package
The CESART Evaluation Package was produced within the French national project CESART (Evaluation of terminology extraction tools), as part of the Technolangue programme. The CESART project enabled to carry out a campaign for the evaluation of terminological resources acquisition tools.
The campaign is distributed over two actions: term extraction and relation extraction.
ELRA-E0020 CESTA Evaluation Package
The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme. The CESTA project enabled to carry out a campaign for the evaluation of machine translation technologies.
The campaign is distributed over two actions: evaluation on a non restrictive vocabulary, evaluation on a specialised domain (evaluation after terminology enrichment).
ELRA-E0021 ESTER Evaluation Package
The ESTER Evaluation Package was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme. The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems for French.
The campaign is distributed over three actions: orthographic transcription, segmentation and information extraction (named entity tracking).
For research or commercial use of this database, please refer to ELRA-S0241 ESTER Corpus
ELRA-E0022 EQueR Evaluation Package
The EQueR Evaluation Package was produced within the French national project EQueR (Evaluation campaign for Question-Answering systems), as part of the Technolangue programme. The EQueR project enabled to carry out a campaign for the evaluation of Question-Answering systems in French.
The campaign is distributed over two actions: one generic task and one specialised task (medical domain).
ELRA-E0023 EvaSy Evaluation Package
The EvaSy Evaluation Package was produced within the French national project EvaSy (Evaluation of speech synthesis systems), as part of the Technolangue programme. The EvaSy project enabled to carry out a campaign for the evaluation of speech synthesis systems using French text data.
The campaign is distributed over three actions: evaluation of grapheme-to-phoneme conversion, evaluation of prosody, global evaluation of the quality of speech synthesis systems.
ELRA-E0024 MEDIA Evaluation Package
The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme. The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French.
The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.
June 07
ELRA-M0038 SCI-ANAL English-German Bilingual Dictionary
This bilingual dictionary contains 59,758 pairs of English-German terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by “;”.
See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037.
Update – ELRA-M0037 SCI-ANES English-Spanish Bilingual Dictionary
This bilingual dictionary contains around 60,000 pairs of English-Spanish terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by “;”.
See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0038.
ELRA-S0240 French-Canadian Speecon database
The French-Canadian Speecon database comprises the recordings of 550 adult French-Canadian speakers and 50 child French-Canadian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
May 07
ELRA-W0047 Catalan Corpus of News Articles
The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007 . These articles are grouped per trimester without chronological order inside.
ELRA-L0075 Bulgarian Linguistic Database
This database contains 81,647 entries in Bulgarian with a linguistic environment tool (for WINDOWS XP). The data may be used for morphological analysis and synthesis, syntactic agreement checking, phonetic stress determining.
ELRA-S0238 MIST Multi-lingual Interoperability in Speech Technology database
The MIST Multi-lingual Interoperability in Speech Technology database comprises the recordings of 74 native Dutch speakers (52 males, 22 females) who uttered 10 sentences in Dutch, English, French and German, including 5 sentences per language identical for all speakers and 5 sentences per language per speaker unique. Dutch sentences are orthographically annotated.
ELRA-S0239 N4 (NATO Native and Non Native) database
The (NATO Native and Non Native) database comprises speech data recorded in the naval transmission training centers of four countries ( Germany , The Netherlands, United Kingdom , and Canada ) during naval communication training sessions in 2000-2002. The material consists of native and non-native speakers using NATO Naval English procedure between ships, and reading from a text, “The North Wind and the Sun,” in both English and the speaker’s native language. The audio material was recorded on DAT and downsampled to 16kHz-16bit, and all the audio files have been manually transcribed and annotated with speakers identities using the tool, Transcriber.
Apr 07
ELRA-M0043 Russian => English MT optimized lexicon in OLIF XML
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 99,211 entries in its source language (Russian) and 134,828 entries in its target language (English). The source entries are distributed as follows: 64,487 nouns, 11,470 adjectives, 19,724 verbs, 1,762 adverbs, and 1,768 closed-class elements (interjections, special prefixes, suffixes, etc.). Nouns contain gender and number information and verbs provide details on aspect and reflexivity. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Moreover, definitions are available for 59,775 entries, as well as collocational information for 39,148 entries.
ELRA-M0044 English => Swahili Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 58,247 entries in English and 58,300 in Swahili. The source entries are distributed as follows: 36,046 nouns, 3,013 adjectives, 18,308 verbs and 880 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 17,570 entries.
ELRA-M0045 Cebuano => English Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 1,988 entries in Cebuano and 1,990 in English. The source entries are distributed as follows: 1,052 nouns, 462 adjectives, 405 verbs and 69 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 500 entries.
ELRA-M0046 English => Czech Bilingual Lexicon
This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 31,718 entries in English and 32,125 in Czech. The source entries are distributed as follows: 17,797 nouns, 7,748 adjectives, 6,039 verbs and 134 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 3,065 entries.
Update – ELRA-S0226-01 IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project)
It contains the recordings of 1,000 French adult speakers (470 males and 530 females) recorded over the French fixed telephone network. The speakers uttered 45 phonetically rich sentences. The 45 sentences were the same for all speakers.
Update – ELRA-S0226-02 IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project)
It contains the recordings of 200 French adult speakers (97 males and 103 females) recorded over the French fixed telephone network. The speakers uttered 45 sentences per call with 10 calls per speaker. The 450 sentences per speaker are common to all speakers. Speakers were selected from theIDIOLOGOS 1 “Bootstrap” database.
ELRA-S0275 Slovenian BNSI Broadcast News Speech Corpus
This speech database consists of TV news shows (both evening news, “TV Dnevnik” and late night news, “Odmevi”), from the archive of a Slovenian national broadcaster RTV Slovenia. The recordings took place between June 1999 and May 2003. The database comprises a total of 36 hours of recordings, transcribed and manually checked using the Transcriber tool. 1,565 speakers were recorded (1,069 males, 477 females, 19 unspecified).
March 07
Update – ELRA-W0015 Text corpus of “Le Monde”
Corpus from “Le Monde” newspaper. Years 1987 to 2002 are available in an ASCII text format. Years 2003 to 2006 are available in .XML format. Each month consists of some 10 MB of data (circa 120 MB per year).
ELRA-S0235 LC-STAR Hebrew (Israel) phonetic lexicon
The LC-STAR Hebrew ( Israel ) phonetic lexicon comprises 109,580 words, including a set of 62,431 common words, a set of 47,149 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,677 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
ELRA-S0236 LC-STAR English-Hebrew (Israel) Bilingual Aligned Phrasal lexicon
The LC-STAR English-Hebrew ( Israel ) Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,449 phrasal corpus. The lexicon is provided in XML format.
ELRA-S0237 LC-STAR US English phonetic lexicon
The LC-STAR US English phonetic lexicon comprises 102,310 words, including a set of 51,119 common words, a set of 51,111 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,807 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
Feb 07
ELRA-S0234 SALA Spanish Chilean Database
The SALA Spanish Chilean Database comprises 1,024 Chilean speakers (477 males, 547 females) recorded over the Chilean fixed telephone network.
ELRA-S0232 Swiss-German Speecon database
The Swiss-German Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0233 US English Speecon database
The US English Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
ELRA-S0157 NetDC Arabic BNSC (Broadcast News Speech Corpus)
The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States . The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period.
ELRA-S0229 LC-STAR Turkish lexicon
The LC-STAR Turkish lexicon comprises 104,513 words, including a set of 59,213 common words and a set of 45,300 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
ELRA-S0230 LC-STAR Russian lexicon
The LC-STAR Russian lexicon comprises about 128,000 words, including a set of 77,154 common words, a set of 51,074 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
ELRA-S0231 LC-STAR English-Russian Bilingual Aligned Phrasal lexicon
The LC-STAR English-Russian Bilingual Aligned Phrasal lexicon comprises 10,519 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,000 phrasal corpus. The lexicon is provided in XML format.
Update – ELRA-S0207 LC-STAR Catalan phonetic lexicon
The LC-STAR Catalan phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
Update – ELRA-S0208 LC-STAR Spanish phonetic lexicon
The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
Jan 07
ELRA and Beijing Haitian Ruisheng Science Technology Ltd today signed a major Language Resources distribution agreement. On behalf of ELRA, ELDA will act as the distribution agency for Beijing Haitian Ruisheng Science Technology Ltd and will incorporate to the ELRA Language Resources catalogue a large number of Speech resources designed and collected to boost Speech Synthesis and Speech Recognition. The resources cover mainly Mandarin Chinese with some coverage of Korean and Japanese languages.
With over 60 new resources, ELDA is strengthening its position as the leading worldwide distribution centre. With this agreement Beijing Haitian Ruisheng Science Technology Ltd will get more visibility in particular on the European market.
List of available Speech Resources
List of available Written Corpora
ELRA-L0074 POLEX Polish Lexicon
The POLEX Polish Lexicon is a morphological dictionary of Polish language. It comprises about 100,000 entries. The POLEX dictionary includes the core Polish vocabulary of general interest. It is based on a precise machine-interpretable formalism (coding system), the same for all categories (classes of speech). The dictionary entries are of the following form: BASIC_FORM+LIST_OF_STEMS+PARADIGMATIC_CODE +DISTRIBUTION_OF_STEMS
It contains more than 42,000 nouns, 12,000 verbs, 15,000 adjectives, 25,000 participles, and about 200 pronouns. A simple lemmatiser (in form of PROLOG prototype) is also included.