TamilTB: Tamil Dependency Treebank

Data

Tamil Dependency Treebank is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB 1.0 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank. TamilTB 1.0 has been created at the Institute of Formal and Applied Linguistics, Charles University in Prague.

Dataset	Sentences	#Tamil tokens
train	480	7246
test	120	1892
total	600	9138

Citation

If you make use of the data for your research, please cite the data as follows,

@inproceedings{Ramasamy:2011:TDP:1964799.1964808,
  author = {Ramasamy, Loganathan and \v{Z}abokrtsk\'{y}, Zden\v{e}k},
  title = {Tamil dependency parsing: results using rule based and corpus based approaches},
  booktitle = {Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I},
  series = {CICLing'11},
  year = {2011},
  isbn = {978-3-642-19399-6},
  location = {Tokyo, Japan},
  pages = {82--95},
  numpages = {14},
  url = {http://portal.acm.org/citation.cfm?id=1964799.1964808},
  acmid = {1964808},
  publisher = {Springer-Verlag},
  address = {Berlin, Heidelberg},
  keywords = {clause boundaries, dependency parsing, syntax, tamil},
}

Register

We would appreciate if you could register when you download the data. However, the registration is not mandatory for downloading the data.

Download

TamilTB Version	Download Link	Last Updated	Comments
1.0	TamilTB-1.0.tar.gz	Nov 25, 2013	Uses UTF-8 as the main character set; Other features include, revisions to morphological tagset and dependency tagset.
0.1	TamilTB.v0.1	May, 2011

Contact

For data related issues/comments,
Loganathan Ramasamy: ramasamy@ufal.mff.cuni.cz
Zdeněk Žabokrtský: zabokrtsky@ufal.mff.cuni.cz

License

TamilTB 1.0 by Institute of Formal and Applied Linguistics is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.