Characteristics of PDT 1.0

PDT 1.0 characteristics - TREEBANK DATA

We look at the PDT 1.0's data from three different points of view:

1. a complete set of PDT 1.0 files
2. layers of annotation
3. training/testing of automatic taggers and parsers

The files are organized into subdirectories which correspond to the aforementioned division. For the counts, a token means a text occurence of a word form or punctuation, and a sentence is a text unit that can be roughly defined as "from the sentence-inital capital letter to the final punctuation (if any)", but the notion includes also verbless headings, various captions, titles and subtitles, table cells etc. Please note also that there is a certain number of empty sentences (with no tokens inside them), which have been left there for various technical reasons. It is less than 10% of the total number of sentences, but please take it into consideration when computing corpus statistics related to the sentence count.

1. PDT 1.0 files (naming conventions)

The file-id names are designed in a way of "name.extension" convention. As the PDT 1.0 files are presented in two different formats (SGML, fs), the file-id names also uniquely specify the inner format of an individual file.

The SGML-format filenames

The "SGML" name is a 4-character string where the first two characters xy - letters/digits - correspond to the source which the texts originally come from; then we speak about the xy data subset: c[12abcde] (Ceskomoravsky Profit), l[12a-z], n[1-9] (Lidove noviny newspapers), m[12abcdefghil] (Mlada Fronta Dnes newspapers) and v[1abc] (Vesmir). The next two characters - digits - correspond to the unique number of the file within the complete xy data subset of annotated files.

The "SGML" extensions are determined by the groups (according to the point 2) which the files belong to:

.suffix Description

a files annotated only on the syntactic-analytic layer;
the morphological information (lemmas and tags) included in these files were generated by the enclosed Czech taggers; namely the elements (see CSTS documentation) <MDl src="a">, < MDt src="a"> generated by the feature-based tagger and the elements <MDl src="b">, <MDt src="b"> generated by the hidden Markov model tagger

am files annotated on both morphological layer and syntactic-analytic layers; together with the manually assigned morphological information, we included the morphological information (lemmas and tags) generated by the Czech taggers like in case of the *.a files (see above)

amt files annotated on the morphological layer, syntactic-analytic layer and tectogrammatical layer

m files annotated only on the morphological layer

.suffix	Description
a	files annotated only on the syntactic-analytic layer; the morphological information (lemmas and tags) included in these files were generated by the enclosed Czech taggers; namely the elements (see CSTS documentation) <MDl src="a">, < MDt src="a"> generated by the feature-based tagger and the elements <MDl src="b">, <MDt src="b"> generated by the hidden Markov model tagger
am	files annotated on both morphological layer and syntactic-analytic layers; together with the manually assigned morphological information, we included the morphological information (lemmas and tags) generated by the Czech taggers like in case of the *.a files (see above)
amt	files annotated on the morphological layer, syntactic-analytic layer and tectogrammatical layer
m	files annotated only on the morphological layer

For example, lb25.am is the 25th (out of the total number of annotated lb subset files) file, containing articles from Lidove noviny newspapers, annotated on the morphological and syntactic-analytic layer in the SGML format.

The "fs"-format filenames
The first four characters of "fs" names have the same interpretation as the characters in the "SGML" names. The next 1, 2 or 3 characters specify the layers of annotation (a, am, amt, m - like "SGML" extensions). The "fs" extension is simply the string fs. The "fs" name lb25am.fs is the "fs" counterpart of the "SGML" name lb25.am.
The files annotated only on the morphological layer are provided in the "fs" format for completness, too, albeit their "structure" is only trivial (a list of nodes).

DATA LOCATION TABLE - rows No. 1 and 2

2. Layers of Annotation

On each of the three layers of annotation, the following data volumes have been annotated:

# of tokens # of sentences

morphological (total) 1,725,242 111,175

syntactic-analytic (total) 1,507,333 98,263

tectogrammatical (so far; sample only) 3,490 203

morphological and syntactic-analytic 1,255,590 81,614

In the overview table , the rows with the 'M' mask (the second column) provide detailed information on all files annotated on the morphological layer. Similarly, the rows with the 'A' mask (in the same table) provide detailed information on all files annotated on the syntactic-analytic layer. Rows with the 'MA' mask provide detailed information on files annotated on both morphological and syntactic-analytic layers.

Data annotated on the tectogrammatical layer should be understood as a PREVIEW; the complete annotation of PDT 1.0 on the tectogrammatical layer is the subject for the next five-year project (CKL).

DATA LOCATION TABLE - rows No. 15, 16, 17

3. Training/testing of automatic taggers/parsers

All annotated data are split into three parts (training data, development and evaluation test data) in order to enable fair comparison of potential machine-learning experiments and their results. Of course, any other testing methods are still possible, such as n-fold cross-validation, which requires other partitioning of data. We simply provide the following partitioning of data as a possibility for direct comparison among different systems.

Morphologically annotated data

For general use

	xy DATA SUBSETS	# of tokens	# of sentences
training data; data location (row No. 14)	ca,cb,cc,cd,ce, l1,l2,la,lb,lc, ld,le,lf,li,lj, lk,ll,lm,ln,lo, lp,lq,lr,ls,lt, m1,m2,ma,md,me,mf,mg,mh,ml, n1,n2,n3,n4,n5, n6,n7,n8,n9, v1,va	1,470,711	94,885
development test data data location (row No. 12)	c1, lg, mb, vb	129,574	8,244
evaluation test data data location (row No. 13 )	c2, lh, mc, vc	124,957	8,046

For training a tagger, to be used as a preprocessor of training data for a(n analytical) parser

	xy DATA SUBSETS	# of tokens	# of sentences
training data data location (row No. 11 )	lp, me,mf,mg,mh, n1,n2,n3,n4,n5,n6,n7,n8,n9	469,652	29,561
development test data data location (row No. 9)	c1, lg, mb, vb	129,574	8,244
evaluation test data data location (row No. 10 )	c2, lh, mc, vc	124,957	8,046

Syntactically (analytically) annotated data

For general use

	xy DATA SUBSETS	# of tokens	# of sentences
training data data location (row No. 8 )	c1,c2,ca,cb,cc,cd,ce, l1,l2,la,lb,lc, ld,le,lf,lg,lh,li, lj,lk,ll,lm,ln, lo,lq,lr,ls,lt, m1,m2,ma,mb,mc, md,ml, v1,va,vb,vc	1,255,590	81,614
development test data data location (row No. 6)	lu,lv,lw	126,030	8,159
evaluation test data data location (row No. 7)	mi, lx, ly, lz	125,713	8,490

For training a(n analytical) parser, to be used as a preprocessor of training data for (tectogrammatical) parsers

	xy DATA SUBSETS	# of tokens	# of sentences
training data data location (row No. 5 )	cd,ce, l1,l2,la,lc,le,lf,li, m1,m2, v1,va	403,830	25,358
development test data data location (row No. 3)	lu,lv,lw	126,030	8,159
evaluation test data data location (row No. 4)	mi, lx, ly, lz	125,713	8,490

Tectogrammatically annotated data (for future reference; currently, only samples are provided in PDT 1.0.)

For general use

	xy DATA SUBSETS	# of tokens	# of sentences
training data	ca,cb,cc, lb,lj,lk,ll,lm,ln,lo,lq,lr,ls,lt, ma,md,ml	582,957	38,495
development test data	lu,lv,lw	126,030	8,159
evaluation test data	mi, lx,ly,lz	125,713	8,490

	# of tokens	# of sentences
morphological (total)	1,725,242	111,175
syntactic-analytic (total)	1,507,333	98,263
tectogrammatical (so far; sample only)	3,490	203
morphological and syntactic-analytic	1,255,590	81,614