TamilTB: Tamil Dependency Treebank

Data

Tamil Dependency Treebank is an attempt to develop a syntactically annotated corpora for Tamil. TamilTB 1.0 contains 600 sentences enriched with manual annotation of morphology and dependency syntax in the style of Prague Dependency Treebank. TamilTB 1.0 has been created at the Institute of Formal and Applied Linguistics, Charles University in Prague.

Dataset Sentences #Tamil tokens
train 480 7246
test 120 1892
total 600 9138

Citation

If you make use of the data for your research, please cite the data as follows,

@inproceedings{Ramasamy:2011:TDP:1964799.1964808,
  author = {Ramasamy, Loganathan and \v{Z}abokrtsk\'{y}, Zden\v{e}k},
  title = {Tamil dependency parsing: results using rule based and corpus based approaches},
  booktitle = {Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I},
  series = {CICLing'11},
  year = {2011},
  isbn = {978-3-642-19399-6},
  location = {Tokyo, Japan},
  pages = {82--95},
  numpages = {14},
  url = {http://portal.acm.org/citation.cfm?id=1964799.1964808},
  acmid = {1964808},
  publisher = {Springer-Verlag},
  address = {Berlin, Heidelberg},
  keywords = {clause boundaries, dependency parsing, syntax, tamil},
}

		    

Register

We would appreciate if you could register when you download the data. However, the registration is not mandatory for downloading the data.


Download

TamilTB Version Download Link Last Updated Comments
1.0 TamilTB-1.0.tar.gz Nov 25, 2013 Uses UTF-8 as the main character set; Other features include, revisions to morphological tagset and dependency tagset.
0.1 TamilTB.v0.1 May, 2011

Contact

For data related issues/comments,
Loganathan Ramasamy: ramasamy@ufal.mff.cuni.cz
Zdeněk Žabokrtský: zabokrtsky@ufal.mff.cuni.cz