December 07 |
ELRA-S0244 Japanese Speecon database The Japanese Speecon database comprises the recordings of 556 adult Japanese speakers and 51 child Japanese speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0265 Dutch from Belgium Speecon Database The Dutch from Belgium Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0266 Dutch from the Netherlands Speecon Database The Dutch from the Netherlands Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0267 Danish Speecon Database The Danish Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0258 Orientel United Arab Emirates MCA (Modern Colloquial Arabic) This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. ELRA-S0259 Orientel United Arab Emirates MSA (Modern Standard Arabic) This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. ELRA-S0260 Orientel English as spoken in the United Arab Emirates This speech database contains the recordings of 500 speakers of English recorded over the United Arab Emirates ’ fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items. ELRA-S0261 Hungarian SpeechDat(E) Database This speech database contains the recordings of 1,000 Hungarian speakers recorded over the Hungarian fixed telephone network. Each speaker uttered around 50 read and spontaneous items. ELRA-S0262 SALA II Portuguese from Brazil database The SALA II Portuguese from Brazil database comprises 1000 Brazilian speakers recorded over the Brazilian mobile telephone network. ELRA-S0263 SALA II Spanish from Colombia Database The SALA II Spanish from Colombia database comprises 1000 Colombian speakers recorded over the Colombian mobile telephone network. ELRA-S0264 SALA II US Spanish West The SALA II US Spanish West database comprises 1000 Spanish speakers recorded over the American mobile telephone network. ELRA-S0255 LC-STAR Finnish Phonetic lexicon The LC-STAR Finnish Phonetic lexicon comprises 189,409 entries, including a set of 144,233 common words, a set of 45,176 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 13,068 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ELRA-S0256 LC-STAR Mandarin Chinese Phonetic lexicon The LC-STAR Mandarin Chinese Phonetic lexicon comprises 104,368 entries, including a set of 38,098 common words, a set of 57,528 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,522 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ELRA-S0257 LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon The LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format. |
November 07 |
ELRA-S0249 TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format. For corresponding recordings, see ELRA-S0251 ELRA-S0251 TC-STAR English Training Corpora for ASR: Recordings of EPPS Speech This corpus consists of the recordings of around 290 hours form EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English, 92 hours of which were annotated (transcribed) (the transcriptions are not provided in the present package). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz. For corresponding transcriptions, see ELRA-S0249. ELRA-S0252 TC-STAR Spanish Training Corpora for ASR: Recordings of EPPS Speech This corpus consists of the recordings of around 283 hours from EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish (a mixture of native and non-native Spanish). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz. ELRA-S0253 TC-STAR English Test Corpora for ASR This corpus consists of 70 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English and other European languages. From this corpus, 16 hours of English speeches (native or non native) were annotated (transcribed). Each speech file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format. ELRA-S0254 TC-STAR Spanish Test Corpora for ASR This corpus consists of 174 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish and other European languages. From this corpus, 16 hours of Spanish speeches were annotated (transcribed). Each audio file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format. ELRA-S0250 TC-STAR English-Spanish Training Corpora for Machine Translation: Aligned Final Text Editions of EPPS This corpus consists of respectively 34 million (English) and 38 million (Spanish) running words of bilingual sentence segmented and aligned texts in English and Spanish obtained from the Final Text Editions provided by the European Parliament (from April 1996 to Sept. 2004, Dec. 2004 to May 2005, and Dec. 2005 to May 2006. The data is accompanied by tools for further preprocessing. ELRA-S0245 LC-STAR German Phonetic lexicon The LC-STAR German Phonetic lexicon comprises 102,169 entries, including a set of 55,507 common words, a set of 46,662 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,763 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ). ELRA-S0246 LC-STAR German Phonetic lexicon in the Touristic Domain The LC-STAR German Phonetic lexicon in the Touristic Domain comprises 8,782 entries from the following categories: nouns, adjectives and verbs. For each entry the following information is provided: orthographic form, part-of-speech (POS), phonemic transcription. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ELRA-S0247 LC-STAR Standard Arabic Phonetic lexicon The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ELRA-S0248 LC-STAR English-German Bilingual Aligned Phrasal lexicon The LC-STAR English-German Bilingual Aligned Phrasal lexicon comprises 10,733 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,518 phrasal corpus. The lexicon is provided in XML format. |
October 07 |
ELRA-L0084 Macedonian Morphological Lexicon (MACPLEX) MACPLEX comprises two dictionaries: a dictionary of lemmas (over 80,000 entries) and a dictionary of word forms (over 1,300,000 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the more than 1,300,000 word forms, there are 345,350 nouns, 467,744 adjectives, 500,220 verbs and 19,472 adverbs. The remaining entries correspond to pronouns, adpositions, conjunctions and numerals. The lexicon is available in Unicode. ELRA-S0242 SALA II US English database The SALA II US English database comprises 3,065 US English speakers (1515 males, 1550 females, including some speakers with Hispanic accents ) recorded over the United States mobile telephone network. ELRA-S0243 SpeechDat Catalan FDB database The SpeechDat Catalan FDB database contains the recordings of 1,005 Catalan speakers (474 males, 531 females) recorded over the Spanish fixed telephone network. AURORA-CD0005 AURORA-5 The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system. It contains artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz, as well as a set of scripts for running recognition experiments on those speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource. TC-STAR Evaluation Packages The Evaluation Packages below include the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) and Spoken Language Translation (SLT) third evaluation campaign, as well as the material used for the TC-STAR 2006 and 2007 End-to-End task. They include resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself. ELRA-E0025 TC-STAR 2007 Evaluation Package - ASR English ELRA-E0026-01 TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES ELRA-E0026-02 TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS ELRA-E0027 TC-STAR 2007 Evaluation Package - ASR Mandarin Chinese ELRA-E0028 TC-STAR 2007 Evaluation Package - SLT English-to-Spanish ELRA-E0029-01 TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - CORTES ELRA-E0029-02 TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - EPPS ELRA-E0030 TC-STAR 2007 Evaluation Package - SLT Chinese-to-English ELRA-E0031 TC-STAR 2006 Evaluation Package – End-to-End ELRA-E0032 TC-STAR 2007 Evaluation Package – End-to-End |
September 07 |
No LR announced. |
August 07 |
Update - ELRA-W0036-02 "Le Monde Diplomatique" Text corpus in French - archives from 1999 Electronic archiving of "Le Monde Diplomatique" articles in French from 1999. The corpus is available in HTML. Each HTML file contains one article. ELRA-L0076 Polderland Dutch Lexicon of Abbreviations and Acronyms The lexicon contains 2,180 Dutch abbreviations and acronyms. It complies with the official Dutch Spelling (2005/6). Each entry consists of an ID, word form, lemma and part of speech. ELRA-L0077 Polderland Dutch General Lexicon The lexicon contains 400,463 Dutch words, comprising 236,369 nouns, 90,882 adjectives, 69,744 verbs, 2,120 adverbs, and 1,348 items from other categories (pronouns, determiners, articles, adpositions, conjunctions, numerals, etc.). It complies with the official Dutch Spelling (2005/6). The lexicon contains an ID, word form, lemma and part of speech. ELRA-L0078 Polderland Dutch Lexicon of Names The lexicon contains 24,247 Dutch proper names. Various sorts of proper names are included, such as first names, last names, geographical names etc. Each entry contains an ID, word form, lemma, part of speech and proper name type. ELRA-L0079 Polderland Dutch Lexicon of Business Terminology The lexicon contains 15,987 Dutch words from the business domain, comprising 13,774 nouns, 1,267 adjectives, 895 verbs, 9 adverbs, and 42 items from other categories. It complies with the official Dutch Spelling (2005). Each entry contains an ID, word form and part of speech. ELRA-L0080 Polderland Dutch Lexicon of Legal Terminology The lexicon contains 6,207 Dutch words from the legal domain, comprising 4,781 nouns, 810 adjectives, 573 verbs, 12 adverbs and 31 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech. ELRA-L0081 Polderland Dutch Lexicon of Medical Terminology The lexicon contains 17,115 Dutch words from the medical domain, comprising 12,638 nouns, 3,107 adjectives, 1,273 verbs, 11 adverbs and 86 items from other categories. The lexicon complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech. ELRA-L0082 Polderland Dutch Lexicon of Social Terminology The lexicon contains 12,551 Dutch words from the social domain, comprising 9,984 nouns, 1,306 adjectives, 1,161 verbs, 56 adverbs and 44 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech. ELRA-L0083 Polderland Dutch Lexicon of Technical Terminology The lexicon contains 9,940 Dutch words from the technical/scientific domain, comprising 8,832 nouns, 950 adjectives, 111 verbs, 2 adverbs and 45 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech. |
July 07 |
ELRA-E0018 ARCADE II Evaluation Package The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme. The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment. The campaign is distributed over two actions: sentence alignment and translation of named entities. ELRA-E0019 CESART Evaluation Package The CESART Evaluation Package was produced within the French national project CESART (Evaluation of terminology extraction tools), as part of the Technolangue programme. The CESART project enabled to carry out a campaign for the evaluation of terminological resources acquisition tools. The campaign is distributed over two actions: term extraction and relation extraction. ELRA-E0020 CESTA Evaluation Package The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme. The CESTA project enabled to carry out a campaign for the evaluation of machine translation technologies. The campaign is distributed over two actions: evaluation on a non restrictive vocabulary, evaluation on a specialised domain (evaluation after terminology enrichment). ELRA-E0021 ESTER Evaluation Package The ESTER Evaluation Package was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme. The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems for French. The campaign is distributed over three actions: orthographic transcription, segmentation and information extraction (named entity tracking). For research or commercial use of this database, please refer to ELRA-S0241 ESTER Corpus ELRA-E0022 EQueR Evaluation Package The EQueR Evaluation Package was produced within the French national project EQueR (Evaluation campaign for Question-Answering systems), as part of the Technolangue programme. The EQueR project enabled to carry out a campaign for the evaluation of Question-Answering systems in French. The campaign is distributed over two actions: one generic task and one specialised task (medical domain). ELRA-E0023 EvaSy Evaluation Package The EvaSy Evaluation Package was produced within the French national project EvaSy (Evaluation of speech synthesis systems), as part of the Technolangue programme. The EvaSy project enabled to carry out a campaign for the evaluation of speech synthesis systems using French text data. The campaign is distributed over three actions: evaluation of grapheme-to-phoneme conversion, evaluation of prosody, global evaluation of the quality of speech synthesis systems. ELRA-E0024 MEDIA Evaluation Package The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme. The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French. The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context. |
June 07 |
ELRA-M0038 SCI-ANAL English-German Bilingual Dictionary This bilingual dictionary contains 59,758 pairs of English-German terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";". See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037. Update - ELRA-M0037 SCI-ANES English-Spanish Bilingual Dictionary This bilingual dictionary contains around 60,000 pairs of English-Spanish terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";". See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0038. ELRA-S0240 French-Canadian Speecon database The French-Canadian Speecon database comprises the recordings of 550 adult French-Canadian speakers and 50 child French-Canadian speakers who uttered respectively over 290 items and 210 items (read and spontaneous). |
May 07 |
ELRA-W0047 Catalan Corpus of News Articles The Catalan Corpus of News Articles comprises articles in Catalan from 1 January 1999 to 31 March 2007 . These articles are grouped per trimester without chronological order inside. ELRA-L0075 Bulgarian Linguistic Database This database contains 81,647 entries in Bulgarian with a linguistic environment tool (for WINDOWS XP). The data may be used for morphological analysis and synthesis, syntactic agreement checking, phonetic stress determining. ELRA-S0238 MIST Multi-lingual Interoperability in Speech Technology database The MIST Multi-lingual Interoperability in Speech Technology database comprises the recordings of 74 native Dutch speakers (52 males, 22 females) who uttered 10 sentences in Dutch, English, French and German, including 5 sentences per language identical for all speakers and 5 sentences per language per speaker unique. Dutch sentences are orthographically annotated. ELRA-S0239 N4 (NATO Native and Non Native) database The (NATO Native and Non Native) database comprises speech data recorded in the naval transmission training centers of four countries ( Germany , The Netherlands, United Kingdom , and Canada ) during naval communication training sessions in 2000-2002. The material consists of native and non-native speakers using NATO Naval English procedure between ships, and reading from a text, "The North Wind and the Sun," in both English and the speaker’s native language. The audio material was recorded on DAT and downsampled to 16kHz-16bit, and all the audio files have been manually transcribed and annotated with speakers identities using the tool, Transcriber.
|
April 07 |
ELRA-M0043 Russian => English MT optimized lexicon in OLIF XML This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 99,211 entries in its source language (Russian) and 134,828 entries in its target language (English). The source entries are distributed as follows: 64,487 nouns, 11,470 adjectives, 19,724 verbs, 1,762 adverbs, and 1,768 closed-class elements (interjections, special prefixes, suffixes, etc.). Nouns contain gender and number information and verbs provide details on aspect and reflexivity. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Moreover, definitions are available for 59,775 entries, as well as collocational information for 39,148 entries. ELRA-M0044 English => Swahili Bilingual Lexicon This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 58,247 entries in English and 58,300 in Swahili. The source entries are distributed as follows: 36,046 nouns, 3,013 adjectives, 18,308 verbs and 880 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 17,570 entries. ELRA-M0045 Cebuano => English Bilingual Lexicon This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 1,988 entries in Cebuano and 1,990 in English. The source entries are distributed as follows: 1,052 nouns, 462 adjectives, 405 verbs and 69 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 500 entries. ELRA-M0046 English => Czech Bilingual Lexicon This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 31,718 entries in English and 32,125 in Czech. The source entries are distributed as follows: 17,797 nouns, 7,748 adjectives, 6,039 verbs and 134 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 3,065 entries. Update - ELRA-S0226-01 IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project) It contains the recordings of 1,000 French adult speakers (470 males and 530 females) recorded over the French fixed telephone network. The speakers uttered 45 phonetically rich sentences. The 45 sentences were the same for all speakers. Update - ELRA-S0226-02 IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project) It contains the recordings of 200 French adult speakers (97 males and 103 females) recorded over the French fixed telephone network. The speakers uttered 45 sentences per call with 10 calls per speaker. The 450 sentences per speaker are common to all speakers. Speakers were selected from theIDIOLOGOS 1 “Bootstrap” database. ELRA-S0275 Slovenian BNSI Broadcast News Speech Corpus This speech database consists of TV news shows (both evening news, “TV Dnevnik” and late night news, “Odmevi”), from the archive of a Slovenian national broadcaster RTV Slovenia. The recordings took place between June 1999 and May 2003. The database comprises a total of 36 hours of recordings, transcribed and manually checked using the Transcriber tool. 1,565 speakers were recorded (1,069 males, 477 females, 19 unspecified). |
March 07 |
Update - ELRA-W0015 Text corpus of "Le Monde" Corpus from "Le Monde" newspaper. Years 1987 to 2002 are available in an ASCII text format. Years 2003 to 2006 are available in .XML format. Each month consists of some 10 MB of data (circa 120 MB per year). ELRA-S0235 LC-STAR Hebrew (Israel) phonetic lexicon The LC-STAR Hebrew ( Israel ) phonetic lexicon comprises 109,580 words, including a set of 62,431 common words, a set of 47,149 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,677 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. ELRA-S0236 LC-STAR English-Hebrew (Israel) Bilingual Aligned Phrasal lexicon The LC-STAR English-Hebrew ( Israel ) Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,449 phrasal corpus. The lexicon is provided in XML format. ELRA-S0237 LC-STAR US English phonetic lexicon The LC-STAR US English phonetic lexicon comprises 102,310 words, including a set of 51,119 common words, a set of 51,111 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,807 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. |
February 07 |
ELRA-S0234 SALA Spanish Chilean Database The SALA Spanish Chilean Database comprises 1,024 Chilean speakers (477 males, 547 females) recorded over the Chilean fixed telephone network. ELRA-S0232 Swiss-German Speecon database The Swiss-German Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0233 US English Speecon database The US English Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous). ELRA-S0157 NetDC Arabic BNSC (Broadcast News Speech Corpus) The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States . The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period. ELRA-S0229 LC-STAR Turkish lexicon The LC-STAR Turkish lexicon comprises 104,513 words, including a set of 59,213 common words and a set of 45,300 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format. ELRA-S0230 LC-STAR Russian lexicon The LC-STAR Russian lexicon comprises about 128,000 words, including a set of 77,154 common words, a set of 51,074 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format. ELRA-S0231 LC-STAR English-Russian Bilingual Aligned Phrasal lexicon The LC-STAR English-Russian Bilingual Aligned Phrasal lexicon comprises 10,519 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from US-English 10,000 phrasal corpus. The lexicon is provided in XML format.
Update – ELRA-S0207 LC-STAR Catalan phonetic lexicon The LC-STAR Catalan phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
Update – ELRA-S0208 LC-STAR Spanish phonetic lexicon The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of more than 45,000 common words and a set of more than 45,000 proper names (including person names, family names, cities, streets, companies and brand names) with phonetic transcriptions in SAMPA. The lexicon is provided in XML format.
|
January 07 |
ELRA and Beijing Haitian Ruisheng Science Technology Ltd today signed a major Language Resources distribution agreement. On behalf of ELRA, ELDA will act as the distribution agency for Beijing Haitian Ruisheng Science Technology Ltd and will incorporate to the ELRA Language Resources catalogue a large number of Speech resources designed and collected to boost Speech Synthesis and Speech Recognition. The resources cover mainly Mandarin Chinese with some coverage of Korean and Japanese languages. With over 60 new resources, ELDA is strengthening its position as the leading worldwide distribution centre. With this agreement Beijing Haitian Ruisheng Science Technology Ltd will get more visibility in particular on the European market. List of available Speech Resources List of available Written Corpora ELRA-L0074 POLEX Polish Lexicon The POLEX Polish Lexicon is a morphological dictionary of Polish language. It comprises about 100,000 entries. The POLEX dictionary includes the core Polish vocabulary of general interest. It is based on a precise machine-interpretable formalism (coding system), the same for all categories (classes of speech). The dictionary entries are of the following form: BASIC_FORM+LIST_OF_STEMS+PARADIGMATIC_CODE +DISTRIBUTION_OF_STEMS It contains more than 42,000 nouns, 12,000 verbs, 15,000 adjectives, 25,000 participles, and about 200 pronouns. A simple lemmatiser (in form of PROLOG prototype) is also included. |