Introduction
The abbreviation
TamilTB.v0.1
will be used throughout the document instead of
Tamil
Dependency Treebank v0.1. This page will serve as a summary of different annotation layers.
Background & Objectives
Treebank is an important resource in building parsers and analyzing the
language. As far as our knowledge is concerned, this is the first
attempt to build a treebank for Tamil language. Tamil belongs to
Dravidian family of languages and mainly spoken in Southern part of India. Main features of the
Tamil language include Subject-Object-Verb (SOV) word order,
morphologically rich and agglutination.
The main objectives of this project include,
- Annotate data at word level and syntactic level
- In each level of annotation, trying for maximum level of linguistic
representation
- Building large annotated corpora using automatic annotation
process.
The data used for the
TamilTB.v0.1
annotation comes from the news domain. We decided to use the news data
for two reasons: (i) huge amount of data is available in digital
format and can be easily downloadable and (ii) the news data can be
considered as
representative of written Tamil. At present, the data for the
annotation comes from
www.dinamani.com
, and we downloaded pages randomly covering various news topics. The
Table 2.1 below summarises the
data used for annotation.
Table 2.1:
Corpus Information
No
|
Description
|
Value
|
1
|
Source
|
www.dinamani.com
|
2
|
Source Transliterated
|
Yes
|
3
|
Number of Words |
9581
|
4
|
Number of Sentences
|
600
|
5
|
Morphological Layer Annotation
(sen)
|
600
|
6
|
Syntactic Annotation (sen)
|
600
|
7
|
Tectogrammatical Layer Annotation
|
-
|
Before starting the annotation, the
text has been preprocessed in
three steps:
- Transliteration
- Sentence segmentation
- Tokenization
Transliteration
The UTF-8 encoded Tamil raw text was
transliterated to Latin for ease
of processing during all levels of annotation. The
raw UTF-8 encoded text can be obtained by applying reverse
transliteration to the
Latin-transcribed text. The transliteration of basic vowels and
consonants are given below,
Table 2.2: Transliteration
Tamil vowels: |
அ
|
ஆ
|
இ
|
ஈ
|
உ
|
ஊ
|
எ
|
ஏ
|
ஐ
|
ஒ
|
ஓ
|
ஔ
|
ஃ
|
|
|
|
|
|
Transl. vowels:
|
a
|
A
|
i
|
I
|
u
|
U
|
e
|
E
|
ai
|
o
|
O
|
au
|
q
|
|
|
|
|
|
Consonants:
|
க்
|
ங்
|
ச்
|
ஞ்
|
ட்
|
ண்
|
த்
|
ந்
|
ப்
|
ம்
|
ய்
|
ர்
|
ல்
|
வ்
|
ழ்
|
ள்
|
ற்
|
ன்
|
Transl.
consonants:
|
k
|
ng
|
c
|
nj
|
t
|
N
|
T
|
w
|
p
|
m
|
y
|
r
|
l
|
v
|
z
|
L
|
R
|
n
|
Sanskrit
sounds:
|
ஷ்
|
ஹ்
|
ஜ்
|
ஸ்ரீ
|
ஸ்
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Transl.
sanskrit sounds:
|
sh
|
h
|
j
|
sri
|
S
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The above table illustrates only the transliteration of basic
Tamil alphabets and their transliteration. Vowel+consonant combinations
are separate characters in Tamil. Mapping for those combinations
are given in a separate mapping file. The transliteration for the
entire Tamil character set can be found here:
utf8_to_latin_map.txt (change the encoding of the browser to UTF-8 to view the contents properly).
Sentence segmentation
Before the annotation takes place, the raw
corpus downloaded from the
source is sentence segmented automatically . Like English, Tamil can
also be ambiguous at various places that will
look like sentence boundaries. But in reality they may not be sentence
boundaries. Those
ambiguous sentence boundaries (such as
dots
at decimal numbers,
initials
in names and dates) are detected through heuristics and the sentences
are segmented only at appropriate
places.
Tokenization
Tokenization is one of the important
module that helps the annotation
task. This module splits the sentence into words. Tamil uses spaces to
mark word boundaries. But yet, a lot of Tamil wordforms are agglutinative
in nature, meaning they glue together atleast two words (in majority of
cases). Those cases can be identified as determiners+nouns,
nouns+postpositions, verbs+particles, nouns+particles and etc. Except
the first pattern (determiners+nouns), in all other cases, the second
part of the wordforms are restricted and can be listed. So it is
possible to split certain Tamil agglutinative wordforms into separate
tokens. Certain particles (also called clitics) such as
{உம்/um/also}, {ஓ/O/or} are not treated as separate tokens
in Tamil. But
for the purpose of annotation we treat them as separate tokens. The
same module will be used for tokenization when parsing the raw Tamil
text.
Example 2.1: Splitting Agglutinative
Combinations in Tokenization
Before
Splitting |
puTiya cattaTTinpati , pATukAkkappatta winaivuc
cinnaTTiliruwTu
1000 ati varai ewTa kattumAnamum katta anumaTi
illai . |
After
Splitting
|
puTiya cattaTTin pati ,
pATukAkkap patta
winaivuc cinnaTT iliruwTu
1000 ati varai ewTa kattumAnam um katta anumaTi
illai . |
In the
Example 2.1
above,
pati and
iliruwTu are
postpositions,
patta
is an auxiliary verb and
um is a clitic.
This kind of agglutination is very prevalent in Tamil, and it would be
useful to tagging process if we are able to reduce the vocabulary size
by splitting the known combinations as separate tokens.
Table 2.3: Words and Suffixes for
Tokenization
Clitics
|
um, E, EyE, AvaTu
|
Postpositions
|
kUta,
utan, pati, kuRiTTu, iliruwTu, anRu, uL, ARu, Tavira,
pOTu,
pOla, pinnar, pin, arukE, aRRa, inRi, illATa, mITu, kIz, mEl, munpE,
otti, paRRi, paRRiya, pOnRa, mUlam, vaziyAka etc.
|
Auxiliary verbs
|
patta, pattu, uLLa, pata, mAttATu,
patuvArkaL, uLLAr, uLLanar, illai, iruwTAr, iruwTaTu, pattaTu, pattana,
mutiyum, kUtATu, vENtum, kUtum, iruppin, uLLana, mutiyATu, patATu,
koNtu, ceyTu etc.
|
Particles
|
Aka,
Ana and their spelling variants such as Akak, Akac, AkaT
|
Demonstrative pronouns (as
prefixes)
|
ap,
ac, ic, iw, aw |
Some of the most commonly occurring (from
the corpus) words and
suffixes which participate in agglutination is given in the
Table 2.3 above. Except demonstrative pronouns, all other words and suffixes
are added after the stem. Among the categories in the
Table 2.3 above.
Clitics and Particles are the most participated in the agglutination.
The tokenizer will make use of this list and try to separate these
words from the original wordform. Even after the tokenization it would
be possible to reconstruct the original sentence by making use of the
attribute called
'no_space_after'.
The
'no_space_after'
will be set to 1 if the following token is part of the current
token. Whenever the splitting takes place this attribute will be set to
1 for the first token. For example, The
'no_space_after' attribute
for
pATukAkkap
will be
1.
Whereas the
'no_space_after'
attribute for
um will be
0.
The splitting for the corpora has been done
semi automatically using some of the most commonly occuring
combination from the above list and edited manually during the
annotation process. At present, the tokenizer includes only few
commonly occurring combinations from the
Table 2.3 such as Clitics,
Particles and very few postpositions.
We evaluated how much such combinations have been splitted from the
original corpora. We found that
953
splits took place out of
9581
words. We simply did this by counting how many
'no_space_after' attributes
have been set to 1. We can say that almost
10% of the additional corpus
size is due to splitting some wordforms into separate tokens.
The annotation scheme followed for
TamilTB.v0.1 is similar to that of
Prague Dependency Treebank 2.0 (PDT 2.0).
PDT 2.0 uses the notion 'layers' to distinguish annotation at various
levels (linguistic) such as word level and structural level. Precisely,
PDT 2.0 is annotated on 3 levels or layers: (i)
morphological layer (m-layer), (ii)
analytical layer (a-layer) and
(iii)
tectogrammatical layer
(t-layer). At present,
TamilTB.v0.1
is annotated on only two layers: m-layer and a-layer.
Example 2.2: A Tamil Sentence
Tamil:
|
பண்பாட்டு
|
அடையாளங்களைப்
|
பாதுகாக்க
|
தொல்பொருள்
|
ஆய்வுத்
|
துறை
|
உருவாக்கப்
|
பட்டு
|
,
|
தனிச்
|
சட்டங்கள்
|
இயற்றப்
|
பட்டு
|
உள்ளன
|
.
|
Tr:
|
paNpAttu
|
ataiyALangkaLaip
|
pATukAkka
|
TolporuL
|
AyvuT
|
TuRai
|
uruvAkkap
|
pattu
|
,
|
Tanic
|
cattangkaL
|
iyaRRap
|
pattu
|
uLLana
|
.
|
Gloss:
|
Cultural
|
symbols
|
to protect
|
Archaelogy
|
|
department
|
to create
|
AUX
|
,
|
separate
|
laws
|
to enact
|
AUX
|
AUX
|
|
English:
|
Having
created Arachelogical Department, separate laws have been enacted to
protect cultural symbols .
|
In the above example, the actual setence is given in Tamil
script (indicated as
Tamil:) in the
1st row, the transliterated (indicated as
Tr:)
version in
the 2nd row, gloss in the 3rd row, and the actual English translation
in the 4th row. The same format is used to illustrate sentence examples
elsewhere in the document. There are 15 words in the Tamil sentence
(including punctuations), each word will be treated as a node in each
annotation layer. Each node will have general attributes and attributes
specifict to a particular annotation layer. For ex: a node in
morphological layer will have attributes such as,
'lemma', 'form', 'tag' and
'no_space_after'
corresponding to lemma, wordfom, POS tag of a particular wordform
and whether the following word is part of the current node (wordform).
A node in analytical layer will have attributes such as dependency
label (
'afun') of the current node, whether the current node is an element in the coordination conjunction (
'is_member') etc. These attributes will be set automically during annotation when using
TrEd. Also, the lower layer (m-layer) attributes are visible to upper layers (a-layer or t-layer).
Only transliterated version of the text will be used in all layers of
annotation for the ease of processing. Examples in Tamil script are
shown only for display purposes.
The
following subsections briefly describe the annotation layers of
TamilTB.v0.1 with an example.
Morphological Layer
The purpose of m-layer is to assign Parts of Speech (POS) tag or more
refined morphological tag to each word in the sentence. This is
accomplished by setting
the 'tag'
attribute of the node (corresponds to word) to the POS or morphological
tag. The '
lemma' attribute will store the conceptual root or the word listed in dictionary as the
'lemma' of
the wordform. The following
Figure 2.1 illustrates m-layer
annotation.
The
Figure 2.1 shows, there are three text
values that are displayed at each node. The text at the top of the node
is the
'form'
or the exact word which appeared in the text. The text at the middle
(for ex: paNpAttu) of the node is the '
lemma', and
the text at the bottom (for ex: NO--3SN--) of the node is the
morphological tag of the wordform. The length (string) of each
morphological tag is
9
and each character position will correspond to
some feature of a wordform. The first 2 positions in the morpholgical
tag corresponds to main POS and refined POS. Both together will
represent fine details of a wordform. Thus it is possible to train the
POS tagger for a fine grained tagset or coarse grained tagset. This
kind of tagging is known
as
positional tagging.
Positional tagging is suitable for morphologically rich languages and
has been successfully applied to languages such as Czech. The
Section: Morphological Annotation will give a detailed
description about positional tagging and the tagset used to perform
annotation for
TamilTB.v0.1.
Analytical Layer
Analytical layer (a-layer) is used to annotate the sentence at syntactic level. There are two phases in a-layer annotation: (i)
capture the dependency structure of the sentence in the form of tree
and (ii) identify the relationship between words or nodes in the
tree. From m-layer, we know that each word
corresponds to a node in the tree but they are without their parents
assigned. The dependency structure is captured by hanging the
dependent nodes (words) under their
governing nodes (words).
Visually, dependent nodes will hang as children of their
governing nodes. There will be one extra node called
technical root to which the
predicate node and the terminal node (end of the sentence) will be
attached. The following
Figure 2.2 illustrates the a-layer annotation of a sentence shown in the
Example 2.2.
Edges between the nodes indicate the relationship with which they are
connected. In linguistic terms, it is called
syntactic relation between governor
and dependent. The relationship between two nodes are stored in the
attribute called
afun.
Instead of storing the
afun
along the edges, the
afun is
treated as another attribute of the dependent node. So the
afun value of a node indicates the
relation with which it connects to its governing node. For example, the
afun value of the word
'cattangkaL'(laws) is
'Sb' meaning Subject, is connected
to the verb
'iyaRRap(to enact)'.
The more detailed treatment of various syntactic relationships and
a-layer annotation scheme is given in
Section: Syntactic Annotation.
The annotated data is available in three formats:
- TMT format - XML-based format used in the TectoMT system
- CoNLL format - tabular-separated format in the CoNLL shared task
style
- TnT style POS tagged format - tabular-separated columns with
word forms, POS tags, and lemmas.
The syntactic trees can be comfortably browsed in the TrEd
tree editor, after installing the TMT file support extension into it
(Menu: Setup->Manage extensions... install TMT files support
extension).
Go to the
download section to obtain the data