HindEnCorp is a parallel Hindi-English corpus freely available for non-commercial research purposes. The version 0.5, released in 2014, contains 273 thousand sentences (about 3.8 million tokens in each language).
Additionally, we also release HindMonoCorp, a Hindi-only corpus of 44 million sentences (787 million tokens).
The previous, preliminary, version of HindEnCorp 0.1 was used in the shared translation task at WMT 2014. You can still obtain HindEnCorp 0.1 here.
If you use HindEnCorp or HindMonoCorp in your work, please cite the following paper:
@InProceedings{hindencorp05:lrec:2014, author = {Ond{\v{r}}ej Bojar and Vojt{\v{e}}ch Diatka and Pavel Rychl{\'{y}} and Pavel Stra{\v{n}}{\'{a}}k and V{\'{\i}}t Suchomel and Ale{\v{s}} Tamchyna and Daniel Zeman}, title = "{HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation}", booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} }