RSS twitter Login
ELRA-ELDA-Logo.png
Home Contact Login

Nepali Corpora to Help Reconstruction in Nepal

Share this page!
twitter google-plus linkedin share

As an answer to the April 2015 devastating earthquake in Nepal, ELRA would like to make Nepali Corpora available for free.

Originally available for research purposes only in the ELRA Catalogue, those Language Resources (2 Nepali Written Corpora and 1 Speech Corpus) will be provided at no cost to those working on the  the development of systems and applications to be used during the reconstruction phase in Nepal, for not-for-profit purposes.

If you feel that ELRA can help in other ways please let us know.


ELRA-W0076 Nepali Monolingual written corpus
ISLRN: 325-796-965-405-9
The Nepali Monolingual written corpus comprises the core corpus (core sample) and the general corpus. The core sample (CS) represents the collection of Nepali written texts from 15 different genres with 2000 words each published between 1990 and 1992. It is based on FLOB/FROWN corpora and contains 802,000 words. The general corpus (GC) consists of written texts collected opportunistically from a wide range of sources such as the internet webs, newspapers, books, publishers and authors. It contains 1,400,000 words.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1216


ELRA-W0077 English-Nepali Parallel Corpus
ISLRN: 853-487-663-161-6
This corpus consists of a collection of national development texts in English and Nepali. A small set of data is aligned at the sentence level (27,060 English words; 21,756 Nepali words), and a larger set of texts at the document level (617,340 English words; 596,571 Nepali words). An additional set of monolingual data in Nepali is also provided (386,879 words in Nepali).
For more information, see http://catalog.elra.info/product_info.php?products_id=1217


ELRAS0368 Nepali Spoken Corpus
ISLRN: 688-800-566-571-0
The Nepali Spoken Corpus contains audio recordings from different social activities within their natural settings as much as possible, with phonologically transcribed and annotated texts, and information about the participants. A total of 17 types of activity were recorded. The total temporal duration of the recorded material is 31 hours and 26 minutes.
For more information, see: http://catalog.elra.info/product_info.php?products_id=1219