HindEnCorp

Tags:

Corpora, Data, Machine Translation, Monolingual, Multilingual

HindEnCorp is a parallel Hindi-English corpus freely available for non-commercial research purposes. The version 0.5, released in 2014, contains 273 thousand sentences (about 3.8 million tokens in each language).

Additionally, we also release HindMonoCorp, a Hindi-only corpus of 44 million sentences (787 million tokens).

Both HindEnCorp and HindMonoCorp are equipped with automatic morphological tags.
Download HindEnCorp 0.5 from Lindat repository.
Download HindMonoCorp 0.5 from Lindat repository.
Read all the details in the corresponding LREC paper: PDF.

The previous, preliminary, version of HindEnCorp 0.1 was used in the shared translation task at WMT 2014. You can still obtain HindEnCorp 0.1 here.

If you use HindEnCorp or HindMonoCorp in your work, please cite the following paper:

@InProceedings{hindencorp05:lrec:2014,
  author = {Ond{\v{r}}ej Bojar and Vojt{\v{e}}ch Diatka
            and Pavel Rychl{\'{y}} and Pavel Stra{\v{n}}{\'{a}}k
            and V{\'{\i}}t Suchomel and Ale{\v{s}} Tamchyna and Daniel Zeman},
  title = "{HindEnCorp - Hindi-English and Hindi-only Corpus for Machine
            Translation}",
  booktitle = {Proceedings of the Ninth International Conference on Language
               Resources and Evaluation (LREC'14)},
  year = {2014},
  month = {may},
  date = {26-31},
  address = {Reykjavik, Iceland},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and
     Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani
     and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-8-4},
  language = {english}
}

Hindi-English and Hindi-only corpus

Search form