Up

PDT 1.0 characteristics - TREEBANK DATA

We look at the PDT 1.0's data from three different points of view:

1. a complete set of PDT 1.0 files
2. layers of annotation
3. training/testing of automatic taggers and parsers

The files are organized into subdirectories which correspond to the aforementioned division. For the counts, a token means a text occurence of a word form or punctuation, and a sentence is a text unit that can be roughly defined as "from the sentence-inital capital letter to the final punctuation (if any)", but the notion includes also verbless headings, various captions, titles and subtitles, table cells etc. Please note also that there is a certain number of empty sentences (with no tokens inside them), which have been left there for various technical reasons. It is less than 10% of the total number of sentences, but please take it into consideration when computing corpus statistics related to the sentence count.



1. PDT 1.0 files (naming conventions)


The file-id names are designed in a way of  "name.extension" convention. As the PDT 1.0 files are presented in two different formats (SGML, fs), the file-id names also uniquely specify the inner format of an individual file.

DATA LOCATION TABLE - rows No. 1 and 2


2. Layers of Annotation

On each of the three layers of annotation, the following data volumes have been annotated:

  # of tokens # of sentences
morphological (total) 1,725,242 111,175
syntactic-analytic (total) 1,507,333 98,263
tectogrammatical (so far; sample only) 3,490 203
morphological and syntactic-analytic 1,255,590 81,614

In the overview  table , the rows with the 'M' mask (the second column) provide detailed information on all files annotated on the morphological layer. Similarly, the rows with the 'A' mask  (in the same table) provide detailed information on all files annotated on the syntactic-analytic layer. Rows with the 'MA' mask provide detailed information on files annotated on both morphological and syntactic-analytic layers.

Data annotated on the tectogrammatical layer should be understood as a PREVIEW; the complete annotation of PDT 1.0 on the tectogrammatical layer is the subject for the next five-year project (CKL).

DATA LOCATION TABLE - rows No. 15, 16, 17


3. Training/testing of automatic taggers/parsers


All annotated data are split into three parts (training data, development and evaluation test data) in order to enable fair comparison of potential machine-learning experiments and their results. Of course, any other testing methods are still possible, such as n-fold cross-validation, which requires other partitioning of data. We simply provide the following partitioning of data as a possibility for direct comparison among different systems.
  1. Morphologically annotated data

    1. For general use
         xy DATA SUBSETS # of tokens # of sentences
      training data; 
      data location (row No. 14)
      ca,cb,cc,cd,ce, l1,l2,la,lb,lc, ld,le,lf,li,lj, lk,ll,lm,ln,lo, lp,lq,lr,ls,lt,
      m1,m2,ma,md,me,mf,mg,mh,ml,
      n1,n2,n3,n4,n5, n6,n7,n8,n9,
      v1,va
      1,470,711 94,885
      development test data
      data location (row No. 12)
      c1, lg, mb, vb 129,574 8,244
      evaluation test data 
      data location (row No. 13 )
      c2, lh, mc, vc 124,957 8,046
    2. For training a tagger, to be used as a preprocessor of training data for a(n analytical) parser
         xy DATA SUBSETS # of tokens # of sentences
      training data
      data location (row No. 11 )
      lp,
      me,mf,mg,mh,
      n1,n2,n3,n4,n5,n6,n7,n8,n9
      469,652 29,561
      development test data
      data location (row No. 9)
      c1, lg, mb, vb 129,574 8,244
      evaluation test data 
      data location (row No. 10 )
      c2, lh, mc, vc 124,957 8,046
  2. Syntactically (analytically) annotated data
    1. For general use
         xy DATA SUBSETS # of tokens # of sentences
      training data
      data location (row No. 8 )
      c1,c2,ca,cb,cc,cd,ce,
      l1,l2,la,lb,lc, ld,le,lf,lg,lh,li, lj,lk,ll,lm,ln, lo,lq,lr,ls,lt,
      m1,m2,ma,mb,mc, md,ml,
      v1,va,vb,vc
      1,255,590 81,614
      development test data
      data location (row No. 6)
      lu,lv,lw 126,030 8,159
      evaluation test data 
      data location (row No. 7)
      mi, lx, ly, lz 125,713 8,490
    2. For training a(n analytical) parser, to be used as a preprocessor of training data for (tectogrammatical) parsers
         xy DATA SUBSETS # of tokens # of sentences
      training data
      data location (row No. 5 )
      cd,ce,
      l1,l2,la,lc,le,lf,li,
      m1,m2,
      v1,va
      403,830 25,358
      development test data
      data location (row No. 3)
      lu,lv,lw 126,030 8,159
      evaluation test data 
      data location (row No. 4)
      mi, lx, ly, lz 125,713 8,490
  3. Tectogrammatically annotated data (for future reference; currently, only samples are provided in PDT 1.0.)
    1. For general use
         xy DATA SUBSETS # of tokens # of sentences
      training data ca,cb,cc,
      lb,lj,lk,ll,lm,ln,lo,lq,lr,ls,lt,
      ma,md,ml
      582,957 38,495
      development test data lu,lv,lw 126,030 8,159
      evaluation test data  mi, lx,ly,lz 125,713 8,490