Data

We have collected English-Tamil bilingual data from some of the publicly available websites for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpora cover texts from bible, cinema and news domains. The statistics of the current release (EnTam v2.0) is given below,

Parallel Corpus Statistics
Dataset	Sentences	#English tokens	#Tamil tokens
train	166871	3913541	2727174
test	2000	47144	32847
development	1000	23353	16376
total	169871	3984038	2776397

Domain Level Statistics
Domain	Sentences	#English tokens	#Tamil tokens
bible	26792 (15.77%)	703838	373082
cinema	30242 (17.80%)	445230	298419
news	112837 (66.43%)	2834970	2104896
total	169871	3984038	2776397

Citation

If you make use of the data for your research, please cite the data as follows,

@inproceedings{ biblio:RaBoMorphologicalProcessing2012,
title = {Morphological Processing for English-Tamil Statistical Machine Translation},
author = {Loganathan Ramasamy and Ond{\v{r}}ej Bojar and 
Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
year = {2012},
pages = {113--122},
booktitle = {Proceedings of the Workshop on Machine Translation and Parsing 
in Indian Languages ({MTPIL}-2012)},
}

Register
We would appreciate if you could register when you download the data. However, the registration is not mandatory for downloading the data.

Name Email Organization Purpose

Download

Download EnTam
Parallel corpus	Release date	Description
EnTam v2.0	01-05-2013	improved the quality of the parallel corpus (EnTam v1.0) by automatically removing unlikely translations, removed one-to-many setences, some normalization in the data.
EnTam v1.0*	08-12-2012	The data is described in the paper

* To download EnTam v1.0, please make a request by email.

Contact

For data related issues/comments,
Loganathan Ramasamy : ramasamy@ufal.mff.cuni.cz
For other comments, please contact any of the authors listed in the paper here.

License

EnTam by Institute of Formal and Applied Linguistics is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Credits

This project has been supported by (i) The European Commission's 7th Framework Program (FP7) under grant agreement n° 238405 (CLARA)