The files are organized
into subdirectories
which correspond to the aforementioned division. For the counts, a
token means a text occurence of a word form or punctuation, and a
sentence is a text unit that can be roughly defined as "from
the sentence-inital capital letter to the final punctuation (if any)",
but the notion includes also
verbless headings, various captions, titles and subtitles, table cells
etc.
Please note also that there is a certain number of empty sentences
(with no tokens inside them), which
have been left there for
various technical reasons. It is less than 10% of the total number of
sentences, but please take it into consideration when computing corpus
statistics related to the sentence count.
The "SGML" name is a 4-character string where the first two characters xy - letters/digits - correspond to the source which the texts originally come from; then we speak about the xy data subset: c[12abcde] (Ceskomoravsky Profit), l[12a-z], n[1-9] (Lidove noviny newspapers), m[12abcdefghil] (Mlada Fronta Dnes newspapers) and v[1abc] (Vesmir). The next two characters - digits - correspond to the unique number of the file within the complete xy data subset of annotated files.
The "SGML" extensions are determined by the groups (according to the point 2) which the files belong to:
.suffix | Description |
---|---|
a |
files annotated only
on the syntactic-analytic layer; the morphological information (lemmas and tags) included in these files were generated by the enclosed Czech taggers; namely the elements (see CSTS documentation) <MDl src="a">, < MDt src="a"> generated by the feature-based tagger and the elements <MDl src="b">, <MDt src="b"> generated by the hidden Markov model tagger |
am |
files annotated on both morphological layer and syntactic-analytic layers; together with the manually assigned morphological information, we included the morphological information (lemmas and tags) generated by the Czech taggers like in case of the *.a files (see above) |
amt | files annotated on the morphological layer, syntactic-analytic layer and tectogrammatical layer |
m | files annotated only on the morphological layer |
For example, lb25.am is the 25th (out of the total number of annotated lb subset files) file, containing articles from Lidove noviny newspapers, annotated on the morphological and syntactic-analytic layer in the SGML format.
The first four characters of "fs" names have the same interpretation as the characters in the "SGML" names. The next 1, 2 or 3 characters specify the layers of annotation (a, am, amt, m - like "SGML" extensions). The "fs" extension is simply the string fs. The "fs" name lb25am.fs is the "fs" counterpart of the "SGML" name lb25.am.
The files annotated only on the morphological layer are provided in the "fs" format for completness, too, albeit their "structure" is only trivial (a list of nodes).
DATA LOCATION TABLE - rows No. 1 and 2
On each of the three layers of annotation, the following data volumes have been annotated:
# of tokens | # of sentences | |
morphological (total) | 1,725,242 | 111,175 |
syntactic-analytic (total) | 1,507,333 | 98,263 |
tectogrammatical (so far; sample only) | 3,490 | 203 |
morphological and syntactic-analytic | 1,255,590 | 81,614 |
In the overview table , the rows with the 'M' mask (the second column) provide detailed information on all files annotated on the morphological layer. Similarly, the rows with the 'A' mask (in the same table) provide detailed information on all files annotated on the syntactic-analytic layer. Rows with the 'MA' mask provide detailed information on files annotated on both morphological and syntactic-analytic layers.
Data annotated on the tectogrammatical layer should be understood as a PREVIEW; the complete annotation of PDT 1.0 on the tectogrammatical layer is the subject for the next five-year project (CKL).
DATA LOCATION TABLE - rows No. 15, 16, 17
Morphologically annotated data
xy DATA SUBSETS | # of tokens | # of sentences | |
training data;
data location (row No. 14) |
ca,cb,cc,cd,ce, l1,l2,la,lb,lc, ld,le,lf,li,lj, lk,ll,lm,ln,lo, lp,lq,lr,ls,lt,
m1,m2,ma,md,me,mf,mg,mh,ml, n1,n2,n3,n4,n5, n6,n7,n8,n9, v1,va |
1,470,711 | 94,885 |
development test data
data location (row No. 12) |
c1, lg, mb, vb | 129,574 | 8,244 |
evaluation test data
data location (row No. 13 ) |
c2, lh, mc, vc | 124,957 | 8,046 |
xy DATA SUBSETS | # of tokens | # of sentences | |
training data
data location (row No. 11 ) |
lp,
me,mf,mg,mh, n1,n2,n3,n4,n5,n6,n7,n8,n9 |
469,652 | 29,561 |
development test data
data location (row No. 9) |
c1, lg, mb, vb | 129,574 | 8,244 |
evaluation test data
data location (row No. 10 ) |
c2, lh, mc, vc | 124,957 | 8,046 |
xy DATA SUBSETS | # of tokens | # of sentences | |
training data
data location (row No. 8 ) |
c1,c2,ca,cb,cc,cd,ce,
l1,l2,la,lb,lc, ld,le,lf,lg,lh,li, lj,lk,ll,lm,ln, lo,lq,lr,ls,lt, m1,m2,ma,mb,mc, md,ml, v1,va,vb,vc |
1,255,590 | 81,614 |
development test data
data location (row No. 6) |
lu,lv,lw | 126,030 | 8,159 |
evaluation test data
data location (row No. 7) |
mi, lx, ly, lz | 125,713 | 8,490 |
xy DATA SUBSETS | # of tokens | # of sentences | |
training data
data location (row No. 5 ) |
cd,ce,
l1,l2,la,lc,le,lf,li, m1,m2, v1,va |
403,830 | 25,358 |
development test data
data location (row No. 3) |
lu,lv,lw | 126,030 | 8,159 |
evaluation test data
data location (row No. 4) |
mi, lx, ly, lz | 125,713 | 8,490 |
xy DATA SUBSETS | # of tokens | # of sentences | |
training data | ca,cb,cc,
lb,lj,lk,ll,lm,ln,lo,lq,lr,ls,lt, ma,md,ml |
582,957 | 38,495 |
development test data | lu,lv,lw | 126,030 | 8,159 |
evaluation test data | mi, lx,ly,lz | 125,713 | 8,490 |