A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here).
PDT-corpora included in PDT-C:
The difference from the separately published original treebanks can be briefly described as follows:
Layers of annotations. The PDT-annotation scheme has a multi-layer architecture:
In addition to the above-mentioned three (main) annotation layers in the PDT-scenario, there is also the raw text layer (w-layer), where the text is segmented into documents and paragraphs and individual tokens are assigned unique identifiers. There is additional audio and speech recognition layer (z-layer) in the spoken data. In the spoken data part (as opposed to the written corpora), the w-layer is in fact also an “annotated” layer, namely the manually provided transcription of the audio signal.
In order not to lose any piece of the original information, tokens (nodes) at a lower layer are explicitly referenced from the corresponding closest (immediately higher) layer. These links allow for tracing every unit of annotation all the way down to the original text, or to the transcript and audio (in the spoken data).
Sarančata jsou doposud ve stadiu larev a pohybují se pouze lezením. V tomto období je účinné bojovat proti nim chemickými postřiky, ale dožívající družstva ani soukromí rolníci nemají na jejich nákup potřebné prostředky.
Example sentences from PDT-C 1.0, with tectogrammatical annotation including coreference links (blue and brown arrows), MWEs (red stripes) and discourse annotation (orange arrows and attributes/lables). Lit.: Grasshoppers are still in the larvae stadium, crawling only. At this time of the year, it is efficient to fight them using chemicals, but neither the ailing cooperatives nor private farmers can afford them.
In the current PDT-C 1.0 release, manual annotation has been fully performed at the lowest morphological layer; also, basic phenomena of the annotation at the highest deep syntactic layer (structure, functions, verbal valency) have been done manually in all four datasets. Manual annotation of the surface syntactic layer is contained only in the dataset of PDT written texts. Additional semantic features in PDT dataset have been also done manually. Table 1 presents an overview of various types of annotation at the three annotation layers in each dataset and the information of the manner in which the annotations was carried out.
Dataset / Type of annotation |
PDT Written |
PCEDT (Czech) Translated |
PDTSC Spoken |
PDT-Faust User-generated |
Audio |
non-applicable |
non-applicable |
provided |
non-applicable |
ASR Transcription |
non-applicable |
non-applicable |
provided |
non-applicable |
Transcript |
non-applicable |
non-applicable |
manually |
non-applicable |
Translation |
non-applicable |
manually |
non-applicable |
manually |
Morphological layer |
||||
Speech reconstruction |
non-applicable |
non-applicable |
manually |
non-applicable |
Lemmatization |
manually |
manually |
manually |
manually |
Tagging |
manually |
manually |
manually |
manually |
Surface syntactic layer |
||||
Dependency structure |
manually |
automatically |
automatically |
automatically |
Syntactic function |
manually |
automatically |
automatically |
automatically |
Clause segmentation |
automatically |
not annotated |
not annotated |
not annotated |
Deep syntactic layer |
||||
Deep syntactic structure |
manually |
manually |
manually |
manually |
Deep syntactic function |
manually |
manually |
manually |
manually |
Verbal valency |
manually |
manually |
manually |
manually |
Nominal valency |
manually |
not annotated |
not annotated |
not annotated |
Grammatemes |
manually |
not annotated |
not annotated |
not annotated |
Coreference grammatical |
manually |
manually |
manually |
not annotated |
Coreference textual |
manually |
manually |
manually |
not annotated |
Bridging relation |
manually |
not annotated |
not annotated |
not annotated |
Topic-focus articulation |
manually |
not annotated |
not annotated |
not annotated |
Discourse |
manually |
not annotated |
not annotated |
not annotated |
Genre specification |
manually |
not annotated |
not annotated |
not annotated |
Quotation |
manually |
not annotated |
not annotated |
not annotated |
Multiword expressions |
manually |
not annotated |
not annotated |
not annotated |
Table 1: Overview of various types of annotation and their realization in the datasets
The data volume is given in Table 2. Altogether, the consolidated treebank contains 3,885,591 tokens with manual morphological annotation and 2,245,945 t-nodes with manual deep syntactic annotation (manual annotation of the surface syntactic layer is contained only in the dataset of written texts and it consists of 1,503,741 a-nodes).
|
PDT Written |
PCEDT (Czech) Translated |
PDTSC Spoken |
PDT-Faust User-generated |
Total |
Morphological layer (number of m-forms) |
1,957,150 |
1,152,289 |
742,316 |
33,836 |
3,885,591 |
Surface syntactic layer (number of a-nodes) |
1,503,741 |
1,152,289 |
742,316 |
33,837 |
3,432,183 |
Deep syntactic layer (number of t-nodes) |
675,034 |
932,334 |
608,472 |
30,105 |
2,245,945 |
Table 2. Volume of the datasets (number of tokens on the respective layers)