Tamil Dependency Treebank is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB 1.0 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank. TamilTB 1.0 has been created at the Institute of Formal and Applied Linguistics, Charles University in Prague.
Dataset | Sentences | #Tamil tokens |
---|---|---|
train | 480 | 7246 |
test | 120 | 1892 |
total | 600 | 9138 |
If you make use of the data for your research, please cite the data as follows,
@inproceedings{Ramasamy:2011:TDP:1964799.1964808,
author = {Ramasamy, Loganathan and \v{Z}abokrtsk\'{y}, Zden\v{e}k},
title = {Tamil dependency parsing: results using rule based and corpus based approaches},
booktitle = {Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I},
series = {CICLing'11},
year = {2011},
isbn = {978-3-642-19399-6},
location = {Tokyo, Japan},
pages = {82--95},
numpages = {14},
url = {http://portal.acm.org/citation.cfm?id=1964799.1964808},
acmid = {1964808},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
keywords = {clause boundaries, dependency parsing, syntax, tamil},
}
We would appreciate if you could register when you download the data. However, the registration is not mandatory for downloading the data.
TamilTB Version | Download Link | Last Updated | Comments |
---|---|---|---|
1.0 | TamilTB-1.0.tar.gz | Nov 25, 2013 | Uses UTF-8 as the main character set; Other features include, revisions to morphological tagset and dependency tagset. |
0.1 | TamilTB.v0.1 | May, 2011 |
For data related issues/comments,
Loganathan Ramasamy: ramasamy@ufal.mff.cuni.cz
Zdeněk Žabokrtský: zabokrtsky@ufal.mff.cuni.cz