Introduction

Introduction

The abbreviation TamilTB.v0.1 will be used throughout the document instead of Tamil Dependency Treebank v0.1. This page will serve as a summary of different annotation layers.

Background & Objectives

Treebank is an important resource in building parsers and analyzing the language. As far as our knowledge is concerned, this is the first attempt to build a treebank for Tamil language. Tamil belongs to Dravidian family of languages and mainly spoken in Southern part of India. Main features of the Tamil language include Subject-Object-Verb (SOV) word order, morphologically rich and agglutination.

The main objectives of this project include,

Annotate data at word level and syntactic level
In each level of annotation, trying for maximum level of linguistic representation
Building large annotated corpora using automatic annotation process.

Data

The data used for the TamilTB.v0.1 annotation comes from the news domain. We decided to use the news data for two reasons: (i) huge amount of data is available in digital format and can be easily downloadable and (ii) the news data can be considered as representative of written Tamil. At present, the data for the annotation comes from www.dinamani.com , and we downloaded pages randomly covering various news topics. The Table 2.1 below summarises the data used for annotation.

Table 2.1: Corpus Information
No	Description	Value
1	Source	www.dinamani.com
2	Source Transliterated	Yes
3	Number of Words	9581
4	Number of Sentences	600
5	Morphological Layer Annotation (sen)	600
6	Syntactic Annotation (sen)	600
7	Tectogrammatical Layer Annotation	-

Text preprocessing

Before starting the annotation, the text has been preprocessed in three steps:

Transliteration
Sentence segmentation
Tokenization

Transliteration

The UTF-8 encoded Tamil raw text was transliterated to Latin for ease of processing during all levels of annotation. The raw UTF-8 encoded text can be obtained by applying reverse transliteration to the Latin-transcribed text. The transliteration of basic vowels and consonants are given below,

Table 2.2: Transliteration
Tamil vowels:	அ	ஆ	இ	ஈ	உ	ஊ	எ	ஏ	ஐ	ஒ	ஓ	ஔ	ஃ
Transl. vowels:	a	A	i	I	u	U	e	E	ai	o	O	au	q
Consonants:	க்	ங்	ச்	ஞ்	ட்	ண்	த்	ந்	ப்	ம்	ய்	ர்	ல்	வ்	ழ்	ள்	ற்	ன்
Transl. consonants:	k	ng	c	nj	t	N	T	w	p	m	y	r	l	v	z	L	R	n
Sanskrit sounds:	ஷ்	ஹ்	ஜ்	ஸ்ரீ	ஸ்
Transl. sanskrit sounds:	sh	h	j	sri	S

The above table illustrates only the transliteration of basic Tamil alphabets and their transliteration. Vowel+consonant combinations are separate characters in Tamil. Mapping for those combinations are given in a separate mapping file. The transliteration for the entire Tamil character set can be found here: utf8_to_latin_map.txt (change the encoding of the browser to UTF-8 to view the contents properly).

Sentence segmentation

Before the annotation takes place, the raw corpus downloaded from the source is sentence segmented automatically . Like English, Tamil can also be ambiguous at various places that will look like sentence boundaries. But in reality they may not be sentence boundaries. Those ambiguous sentence boundaries (such as dots at decimal numbers, initials in names and dates) are detected through heuristics and the sentences are segmented only at appropriate places.

Tokenization

Tokenization is one of the important module that helps the annotation task. This module splits the sentence into words. Tamil uses spaces to mark word boundaries. But yet, a lot of Tamil wordforms are agglutinative in nature, meaning they glue together atleast two words (in majority of cases). Those cases can be identified as determiners+nouns, nouns+postpositions, verbs+particles, nouns+particles and etc. Except the first pattern (determiners+nouns), in all other cases, the second part of the wordforms are restricted and can be listed. So it is possible to split certain Tamil agglutinative wordforms into separate tokens. Certain particles (also called clitics) such as {உம்/um/also}, {ஓ/O/or} are not treated as separate tokens in Tamil. But for the purpose of annotation we treat them as separate tokens. The same module will be used for tokenization when parsing the raw Tamil text.

Example 2.1: Splitting Agglutinative Combinations in Tokenization
Before Splitting	puTiya cattaTTinpati , pATukAkkappatta winaivuc cinnaTTiliruwTu 1000 ati varai ewTa kattumAnamum katta anumaTi illai .
After Splitting	puTiya cattaTTin pati , pATukAkkap patta winaivuc cinnaTT iliruwTu 1000 ati varai ewTa kattumAnam um katta anumaTi illai .

In the Example 2.1 above, pati and iliruwTu are postpositions, patta is an auxiliary verb and um is a clitic. This kind of agglutination is very prevalent in Tamil, and it would be useful to tagging process if we are able to reduce the vocabulary size by splitting the known combinations as separate tokens.

Table 2.3: Words and Suffixes for Tokenization
Clitics	um, E, EyE, AvaTu
Postpositions	kUta, utan, pati, kuRiTTu, iliruwTu, anRu, uL, ARu, Tavira, pOTu, pOla, pinnar, pin, arukE, aRRa, inRi, illATa, mITu, kIz, mEl, munpE, otti, paRRi, paRRiya, pOnRa, mUlam, vaziyAka etc.
Auxiliary verbs	patta, pattu, uLLa, pata, mAttATu, patuvArkaL, uLLAr, uLLanar, illai, iruwTAr, iruwTaTu, pattaTu, pattana, mutiyum, kUtATu, vENtum, kUtum, iruppin, uLLana, mutiyATu, patATu, koNtu, ceyTu etc.
Particles	Aka, Ana and their spelling variants such as Akak, Akac, AkaT
Demonstrative pronouns (as prefixes)	ap, ac, ic, iw, aw

Some of the most commonly occurring (from the corpus) words and suffixes which participate in agglutination is given in the Table 2.3 above. Except demonstrative pronouns, all other words and suffixes are added after the stem. Among the categories in the Table 2.3 above. Clitics and Particles are the most participated in the agglutination. The tokenizer will make use of this list and try to separate these words from the original wordform. Even after the tokenization it would be possible to reconstruct the original sentence by making use of the attribute called 'no_space_after'. The 'no_space_after' will be set to 1 if the following token is part of the current token. Whenever the splitting takes place this attribute will be set to 1 for the first token. For example, The 'no_space_after' attribute for pATukAkkap will be 1. Whereas the 'no_space_after' attribute for um will be 0. The splitting for the corpora has been done semi automatically using some of the most commonly occuring combination from the above list and edited manually during the annotation process. At present, the tokenizer includes only few commonly occurring combinations from the Table 2.3 such as Clitics, Particles and very few postpositions.

We evaluated how much such combinations have been splitted from the original corpora. We found that 953 splits took place out of 9581 words. We simply did this by counting how many 'no_space_after' attributes have been set to 1. We can say that almost 10% of the additional corpus size is due to splitting some wordforms into separate tokens.

Layers of Annotation

The annotation scheme followed for TamilTB.v0.1 is similar to that of Prague Dependency Treebank 2.0 (PDT 2.0). PDT 2.0 uses the notion 'layers' to distinguish annotation at various levels (linguistic) such as word level and structural level. Precisely, PDT 2.0 is annotated on 3 levels or layers: (i) morphological layer (m-layer), (ii) analytical layer (a-layer) and (iii) tectogrammatical layer (t-layer). At present, TamilTB.v0.1 is annotated on only two layers: m-layer and a-layer.

Example 2.2: A Tamil Sentence
Tamil:	பண்பாட்டு	அடையாளங்களைப்	பாதுகாக்க	தொல்பொருள்	ஆய்வுத்	துறை	உருவாக்கப்	பட்டு	,	தனிச்	சட்டங்கள்	இயற்றப்	பட்டு	உள்ளன	.
Tr:	paNpAttu	ataiyALangkaLaip	pATukAkka	TolporuL	AyvuT	TuRai	uruvAkkap	pattu	,	Tanic	cattangkaL	iyaRRap	pattu	uLLana	.
Gloss:	Cultural	symbols	to protect	Archaelogy		department	to create	AUX	,	separate	laws	to enact	AUX	AUX
English:	Having created Arachelogical Department, separate laws have been enacted to protect cultural symbols .

In the above example, the actual setence is given in Tamil script (indicated as Tamil:) in the 1st row, the transliterated (indicated as Tr:) version in the 2nd row, gloss in the 3rd row, and the actual English translation in the 4th row. The same format is used to illustrate sentence examples elsewhere in the document. There are 15 words in the Tamil sentence (including punctuations), each word will be treated as a node in each annotation layer. Each node will have general attributes and attributes specifict to a particular annotation layer. For ex: a node in morphological layer will have attributes such as, 'lemma', 'form', 'tag' and 'no_space_after' corresponding to lemma, wordfom, POS tag of a particular wordform and whether the following word is part of the current node (wordform). A node in analytical layer will have attributes such as dependency label ('afun') of the current node, whether the current node is an element in the coordination conjunction ('is_member') etc. These attributes will be set automically during annotation when using TrEd. Also, the lower layer (m-layer) attributes are visible to upper layers (a-layer or t-layer).

Only transliterated version of the text will be used in all layers of annotation for the ease of processing. Examples in Tamil script are shown only for display purposes.

The following subsections briefly describe the annotation layers of TamilTB.v0.1 with an example.

Morphological Layer

The purpose of m-layer is to assign Parts of Speech (POS) tag or more refined morphological tag to each word in the sentence. This is accomplished by setting the 'tag' attribute of the node (corresponds to word) to the POS or morphological tag. The 'lemma' attribute will store the conceptual root or the word listed in dictionary as the 'lemma' of the wordform. The following Figure 2.1 illustrates m-layer annotation.

Figure 2.1: An Example for Morphological Layer Annotation

The Figure 2.1 shows, there are three text values that are displayed at each node. The text at the top of the node is the 'form' or the exact word which appeared in the text. The text at the middle (for ex: paNpAttu) of the node is the 'lemma', and the text at the bottom (for ex: NO--3SN--) of the node is the morphological tag of the wordform. The length (string) of each morphological tag is 9 and each character position will correspond to some feature of a wordform. The first 2 positions in the morpholgical tag corresponds to main POS and refined POS. Both together will represent fine details of a wordform. Thus it is possible to train the POS tagger for a fine grained tagset or coarse grained tagset. This kind of tagging is known as positional tagging. Positional tagging is suitable for morphologically rich languages and has been successfully applied to languages such as Czech. The Section: Morphological Annotation will give a detailed description about positional tagging and the tagset used to perform annotation for TamilTB.v0.1.

Analytical Layer

Analytical layer (a-layer) is used to annotate the sentence at syntactic level. There are two phases in a-layer annotation: (i) capture the dependency structure of the sentence in the form of tree and (ii) identify the relationship between words or nodes in the tree. From m-layer, we know that each word corresponds to a node in the tree but they are without their parents assigned. The dependency structure is captured by hanging the dependent nodes (words) under their governing nodes (words). Visually, dependent nodes will hang as children of their governing nodes. There will be one extra node called technical root to which the predicate node and the terminal node (end of the sentence) will be attached. The following Figure 2.2 illustrates the a-layer annotation of a sentence shown in the Example 2.2.

Figure 2.2: An Example for Analytical Layer (a-layer) Annotation

Edges between the nodes indicate the relationship with which they are connected. In linguistic terms, it is called syntactic relation between governor and dependent. The relationship between two nodes are stored in the attribute called afun. Instead of storing the afun along the edges, the afun is treated as another attribute of the dependent node. So the afun value of a node indicates the relation with which it connects to its governing node. For example, the afun value of the word 'cattangkaL'(laws) is 'Sb' meaning Subject, is connected to the verb 'iyaRRap(to enact)'.

The more detailed treatment of various syntactic relationships and a-layer annotation scheme is given in Section: Syntactic Annotation.

Obtaining Data

The annotated data is available in three formats:

TMT format - XML-based format used in the TectoMT system
CoNLL format - tabular-separated format in the CoNLL shared task style
TnT style POS tagged format - tabular-separated columns with word forms, POS tags, and lemmas.

The syntactic trees can be comfortably browsed in the TrEd tree editor, after installing the TMT file support extension into it (Menu: Setup->Manage extensions... install TMT files support extension).

Go to the download section to obtain the data