Details on the Databases

Details on AURORA Databases

The notebook below shows the detailed description, downloadable licenses and link to the ELRA catalogue (when relevant) for each of the 9 AURORA Databases.

Please scroll horizontally on the right arrow (or on the left arrow) to see the tabs that are not displayed.

AURORA Project Database 2.0 (AURORA/CD0002)

The Aurora project is releasing a revised version of the Noisy TI digits database to follow on the work of ETSI. This CD set is a replacement for the previous set (version 1.0 consisted of 2 CDs while version 2.0 now consists of 4 CDs) .

This database is intended for the evaluation of algorithms for front-end feature extraction algorithms in background noise but may also be used more widely by speech researchers to evaluate and compare the performance of noise robust speech recognition algorithms.

Compared to version 1.0 the changes are as follows:

The files are restored to the energy level of the original speech in the TI digits database.
One of the noise types added to the speech has been changed (the babble one)
There is an additional test sets where the noises are mismatched to those used in the training set.
There is a convolutional distortion test.
There is a clean training set
The CD ROM will be used for the next round of ETSI Aurora standards evaluation.

Two original copies of the contract (word | pdf) must be sent to ELDA. To be valid these contracts must be initialled and signed. The user should annex to the contract the proof that he obtained the right to use the TI digits from LDC (ref. LDC93S10). This may be a signed licence agreement or a proof of membership payment for 1993.

Price for research use by academic organisations: Free
Price for research use by commercial organisations: EUR 250

AURORA Project Database - Subset of SpeechDat-Car Spanish database (AURORA/CD0003-02)

ETSI DES/STQ WI007: Distributed Speech Recognition - Front-End Feature Extraction Algorithm & Compression Algorithm
ETSI DES/STQ WI008: Distributed Speech Recognition - Advanced Feature Extraction Algorithm.

This database is a subset of the SpeechDat-Car database in Spanish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Spanish digits spoken in the following noise and driving conditions inside a car:

Quiet environment: Stop motor running.
Town traffic + low speed rough road.
High noise: High speed good road.

Two original copies of the contract (word | pdf) must be sent to ELDA.

Price for research use by academic organisations: EUR 200
Price for research use by commercial organisations: EUR 1000

Aurora 4a

The Aurora project is now releasing a number of list files for performing the training and testing on the Wall Street Journal (WSJ0) data at two sampling rates -8 kHz and 16 kHz. The Aurora 4a database is based on the WSJ0 with artificial addition of noise over a range of signal to noise ratios. It contains both clean and multicondition training sets and 14 evaluation sets with different noise types and microphones.

Two original copies of the contract (word | pdf) must be sent to ELDA.

Price for research use by academic organisations: Free
Price for research use by commercial organisations: EUR 1000

AURORA Project Database - Subset of SpeechDat-Car German database (AURORA/CD0003-03)

ETSI DES/STQ WI007: Distributed Speech Recognition - Front-End Feature Extraction Algorithm & Compression Algorithm
ETSI DES/STQ WI008: Distributed Speech Recognition - Advanced Feature Extraction Algorithm.

This database is a subset of the SpeechDat-Car database in German language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected German digits spoken in the following noise and driving conditions inside a car:

High speed good road
Low speed rough road
Stopped with motor running
Town traffic

Two original copies of the contract (word | pdf) must be sent to ELDA.

Price for research use by academic organisations: EUR 200
Price for research use by commercial organisations: EUR 1000

Aurora 4b

An additional database has been released. It contains noisy versions of the Nov’92 WSJO development set.

Two original copies of the contract (word | pdf) must be sent to ELDA.

Price for research use by academic organisations: Free
Price for research use by commercial organisations: EUR 1000

AURORA Project Database - Subset of SpeechDat-Car Danish database (AURORA/CD0003-04)

ETSI DES/STQ WI007: Distributed Speech Recognition - Front-End Feature Extraction Algorithm & Compression Algorithm
ETSI DES/STQ WI008: Distributed Speech Recognition - Advanced Feature Extraction Algorithm.

This database is a subset of the SpeechDat-Car database in Danish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Danish digits spoken in the following noise and driving conditions inside a car:

High speed good road
Low speed rough road
Stopped with motor running
Town traffic

Two original copies of the contract (word | pdf) must be sent to ELDA.

Price for research use by academic organisations: EUR 200
Price for research use by commercial organisations: EUR 1000

AURORA Project Database - Subset of SpeechDat-Car Italian database (AURORA/CD0003-05)

ETSI DES/STQ WI007: Distributed Speech Recognition - Front-End Feature Extraction Algorithm & Compression Algorithm
ETSI DES/STQ WI008: Distributed Speech Recognition - Advanced Feature Extraction Algorithm.

High speed good road
Low speed rough road
Stopped with motor running
Town traffic

Two original copies of the contract (word | pdf) must be sent to ELDA.

Price for research use only: EUR 1000

Aurora 5

The Aurora project was originally set up to establish a worldwide standard for the feature extraction software which forms the core of the front-end of a DSR (Distributed Speech Recognition) system.

The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.

The earlier three Aurora experiments had a focus on additive noise and the influence of some telephone frequency characteristics. Aurora-5 tries to cover all effects as they occur in realistic application scenarios. The focus was put on two scenarios. The first one is the hands-free speech input in the noisy car environment with the intention of controlling either devices in the car itself or retrieving information from a remote speech server over the telephone. The second one covers the hands-free speech input in a type of office or in a type of living room to control e.g. a telephone device or some audio/video equipment.

The AURORA-5 database contains the following data:

Artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz. The distortions consist of:

additive background noise,
the simulation of a hands-free speech input in rooms,
the simulation of transmitting speech over cellular telephone networks.

A subset of recordings from the meeting recorder project at the International Computer Science Institute. The recordings contain sequences of digits uttered by different speakers in hands-free mode in a meeting room.
A set of scripts for running recognition experiments on the above mentioned speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.

Further information is also available at the following address: http://aurora.hsnr.de

Two original copies of the contract (word | pdf) must be sent to ELDA. To be valid these contracts must be initialled and signed. The user should annex to the contract the proof that he obtained the right to use the TI digits from LDC (ref. LDC93S10). This may be a signed licence agreement or a proof of membership payment for 1993.

Price for research use by academic organisations : Free
Price for research use by commercial organisations : EUR 250

AURORA Project Database - Subset of SpeechDat-Car Finnish database (AURORA/CD0003-01)

This database is a subset of the SpeechDat-Car database in Finnish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Finnish digits spoken in the following driving conditions inside a car:

0 km/hr with the car engine on
40-60 km/hr with the car windows closed
40-60 km/hr with the car windows open
100-120km/hr with no music in the background
100-120km/hr with music in the background

The database also contains the software needed to run simulations using the Entropic’s HTK, which has been adopted as the "standard" HMM recogniser for the Aurora standard evaluation.

Two original copies of the contract (word | pdf) must be sent to ELDA.

Price for research use by academic organisations: EUR 200
Price for research use by commercial organisations: EUR 1000

Latest News

New LRs in the ELRA Catalogue Dec. 7, 2023
New LRs in the ELRA Catalogue Nov. 13, 2023
The LDS vision by Philippe Gelin Oct. 17, 2023
Distribution Agreement between ELDA and Lexicala for Multilingual Lexical Data Dissemination Oct. 12, 2023
Open position at ELDA Sept. 4, 2023

ELRA Tweets

Tweets by @ELRAnews

Share this page!

Latest News

Tag Cloud

ELRA Tweets