HindEnCorp is a parallel Hindi-English corpus freely available for non-commercial research purposes. The version 0.5, released in 2014, contains 273 thousand sentences (about 3.8 million tokens in each language).

Additionally, we also release HindMonoCorp, a Hindi-only corpus of 44 million sentences (787 million tokens).

The previous, preliminary, version of HindEnCorp 0.1 was used in the shared translation task at WMT 2014. You can still obtain HindEnCorp 0.1 here.

If you use HindEnCorp or HindMonoCorp in your work, please cite the following paper:

  author = {Ond{\v{r}}ej Bojar and Vojt{\v{e}}ch Diatka
            and Pavel Rychl{\'{y}} and Pavel Stra{\v{n}}{\'{a}}k
            and V{\'{\i}}t Suchomel and Ale{\v{s}} Tamchyna and Daniel Zeman},
  title = "{HindEnCorp - Hindi-English and Hindi-only Corpus for Machine
  booktitle = {Proceedings of the Ninth International Conference on Language
               Resources and Evaluation (LREC'14)},
  year = {2014},
  month = {may},
  date = {26-31},
  address = {Reykjavik, Iceland},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and
     Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani
     and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-8-4},
  language = {english}