EnTam: An English-Tamil Parallel Corpus

Data

We have collected English-Tamil bilingual data from some of the publicly available websites for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpora cover texts from bible, cinema and news domains. The statistics of the current release (EnTam v2.0) is given below,

Parallel Corpus Statistics
Dataset Sentences #English tokens #Tamil tokens
train 166871 3913541 2727174
test 2000 47144 32847
development 1000 23353 16376
total 169871 3984038 2776397
Domain Level Statistics
Domain Sentences #English tokens #Tamil tokens
bible 26792 (15.77%) 703838 373082
cinema 30242 (17.80%) 445230 298419
news 112837 (66.43%) 2834970 2104896
total 169871 3984038 2776397

Citation

If you make use of the data for your research, please cite the data as follows,

@inproceedings{ biblio:RaBoMorphologicalProcessing2012,
title = {Morphological Processing for English-Tamil Statistical Machine Translation},
author = {Loganathan Ramasamy and Ond{\v{r}}ej Bojar and 
Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
year = {2012},
pages = {113--122},
booktitle = {Proceedings of the Workshop on Machine Translation and Parsing 
in Indian Languages ({MTPIL}-2012)},
}

Register

We would appreciate if you could register when you download the data. However, the registration is not mandatory for downloading the data.


Download

Download EnTam
Parallel corpus Release date Description
EnTam v2.0 01-05-2013 improved the quality of the parallel corpus (EnTam v1.0) by automatically removing unlikely translations, removed one-to-many setences, some normalization in the data.
EnTam v1.0* 08-12-2012 The data is described in the paper

* To download EnTam v1.0, please make a request by email.

Contact

For data related issues/comments,
Loganathan Ramasamy : ramasamy@ufal.mff.cuni.cz
For other comments, please contact any of the authors listed in the paper here.

Credits

This project has been supported by (i) The European Commission's 7th Framework Program (FP7) under grant agreement n° 238405 (CLARA)

FP7 Marie Curie Actions