Data
We have collected English-Tamil bilingual data from some of the publicly available websites for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpora cover texts from bible, cinema and news domains. The statistics of the current release (EnTam v2.0) is given below,
Dataset | Sentences | #English tokens | #Tamil tokens |
---|---|---|---|
train | 166871 | 3913541 | 2727174 |
test | 2000 | 47144 | 32847 |
development | 1000 | 23353 | 16376 |
total | 169871 | 3984038 | 2776397 |
Domain | Sentences | #English tokens | #Tamil tokens |
---|---|---|---|
bible | 26792 (15.77%) | 703838 | 373082 |
cinema | 30242 (17.80%) | 445230 | 298419 |
news | 112837 (66.43%) | 2834970 | 2104896 |
total | 169871 | 3984038 | 2776397 |
Citation
If you make use of the data for your research, please cite the data as follows,
@inproceedings{ biblio:RaBoMorphologicalProcessing2012,
title = {Morphological Processing for English-Tamil Statistical Machine Translation},
author = {Loganathan Ramasamy and Ond{\v{r}}ej Bojar and
Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
year = {2012},
pages = {113--122},
booktitle = {Proceedings of the Workshop on Machine Translation and Parsing
in Indian Languages ({MTPIL}-2012)},
}
Register
We would appreciate if you could register when you download the data. However, the registration is not mandatory for downloading the data.
Download
Parallel corpus | Release date | Description |
---|---|---|
EnTam v2.0 | 01-05-2013 | improved the quality of the parallel corpus (EnTam v1.0) by automatically removing unlikely translations, removed one-to-many setences, some normalization in the data. |
EnTam v1.0* | 08-12-2012 | The data is described in the paper |
* To download EnTam v1.0, please make a request by email.
Contact
For data related issues/comments,
Loganathan Ramasamy : ramasamy@ufal.mff.cuni.cz
For other comments, please contact any of the authors listed in the paper here.
License
EnTam by Institute of Formal and Applied Linguistics is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Credits
This project has been supported by (i) The European Commission's 7th Framework Program (FP7) under grant agreement n° 238405 (CLARA)