HindEnCorp is a parallel Hindi-English corpus freely available for non-commercial research purposes. The version 0.5, released in 2014, contains 273 thousand sentences (about 3.8 million tokens in each language).

Additionally, we also release HindMonoCorp, a Hindi-only corpus of 44 million sentences (787 million tokens).

The previous, preliminary, version of HindEnCorp 0.1 was used in the shared translation task at WMT 2014. You can still obtain HindEnCorp 0.1 here.

If you use HindEnCorp or HindMonoCorp in your work, please cite the following paper:

