PEA-Trad , a French DGA-financed project, aimed at developing Speech to Speech translation technology for several language directions and covering languages like Arabic, Chinese, English, French and Pashto. Moreover, PEA-TRAD had ambitiously targeted a variety of domains such as web-based text data, blogs, mail, and broadcast news.
Both technology development and evaluation have involved a large data production effort, both from the point of view of the number of parallel corpora produced as well as from that of the data size produced for some of those languages. For instance, with regard to Pashto, a very low-resourced language with no associated HLT, several large corpora (soon available from ELRA) have been produced,so as to allow for both training and evaluation. The full data production effort has carried out by ELDA.
• A 100M word monolingual text corpus has been created, which has implied identifying the sources, clearing IPR issues, collecting the data, and validating quality and content as well as formatting (Mostefa et al., 2012);
•An about 100-hour broadcast-news corpus was also produced for the Pashto language, with recordings coming from different sources so as to cover different dialects. These data have been transcribed orthographically;
•A 2M (per language) Pashto-French parallel corpus has been produced: in order to do this, 100 further hours were recorded and transcribed, so as to obtain an about 2M word Pashto corpus of transcribed speech which has been later translated into French.
The full data production effort has been carried out by ELDA, who has put into place a very strict protocol for data collection, translation, revision and quality control, in particular for all evaluation sets.