Introduction
Texts
The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a major update of the Prague Czech-English Dependency Treebank 1.0 (LDC2004T25). It is a manually parsed Czech-English parallel corpus sized over 1.2 million running words in almost 50,000 sentences for each part.
Data
The English part contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.
Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:
- dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
- semantic labeling of content words and types of coordinating structures
- argument structure, including an argument structure ("valency") lexicon for both languages
- ellipsis and anaphora resolution.
This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. For more details see below and documentation.
Annotation of the Czech part
Sentences of the Czech translation were automatically morphologically annotated and parsed into surface-syntax dependency trees in the PDT 2.0 annotation style. This annotation style is sometimes called analytical annotation; it constitutes the analytical layer of the corpus. The manual tectogrammatical (deep-syntax) annotation was built as a separate layer above the automatic analytical (surface-syntax) parse. A sample of 2,000 sentences was manually annotated on the analytical layer.
Annotation of the English part
The resulting manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources:
- PropBank (LDC2004T14)
- VerbNet
- NomBank (LDC2008T23)
- flat noun phrase structures (by courtesy of D. Vadas and J.R. Curran)
For each sentence, the original Penn Treebank phrase structure trees are preserved in this corpus together with their links to the analytical and tectogrammatical annotation.
Layers of Annotation
The PDT 2.0-style annotation contains multiple layers:
- w-layer
- m-layer
- a-layer
- t-layer
Figure1
The English part contains the original Penn Treebank annotation, which we call p-layer (phrase-structure layer). Figure1 shows the visualization of a phrase-structure tree in TrEd, the browser and editor of PCEDT 2.0.
w-layer
The bottom-most layer ("word" layer) is the tokenized plain text, where each token has been assigned a unique ID. This layer is fully integrated in the next upper layer, the m-layer.
m-layer
This is the morphological layer. The tokens with their IDs are automatically part-of-speech tagged and lemmatized. From this point on, we can regard the tokens as linearly ordered nodes with their respective IDs, POS-tags and lemmas. In the corpus, the m-layer is not visualized separately but rather as a part of the analytical layer (which brings the representation of syntactic dependencies along with their surface-syntactic labeling).
a-layer
Figure2
The a-layer (analytical layer) represents the surface syntax (a parse). The syntactic dependencies are provided with labels that carry the usual syntactic information; e.g. "subject", "attribute" or "predicate complement". For more details see a comprehensive specification of the Czech manual analytical annotation and a brief specification of the English analytical annotation. Figure2 presents the visualization of an analytical sentence representation in TrEd.
t-layer
The topmost - tectogrammatical - layer is a linguistic representation that combines syntax and, to a certain extent, semantics, in the form of semantic labeling, anaphora resolution and argument structure description based on a valency lexicon. This representation draws on the framework of the Functional Generative Description (Sgall, Hajičová, Panevová, 1986). The original tectogrammatical language representation in the theoretical works of the 1960s was developed mainly with rule-based text generation in mind. This annotated corpus follows the essential ideas of this formal language description, but, at the same time, it is designed to serve as training data in statistical machine learning and helps both text generation and text analysis. Compared to the complex Prague Dependency Treebank 2.0, the tectogrammatical annotation in PCEDT 2.0 is slightly simplified and it, e.g., does not yet contain the topic-focus (information structure) annotation for either language.
Figure3
For more details see a comprehensive specification of the Czech tectogrammatical annotation and a brief specification of the English tectogrammatical annotation. Figure3 shows a tectogrammatical sentence representation visualized in TrEd.
Figure4
Alignment
PCEDT 2.0 is an automatically word-aligned parallel corpus. The alignment is directed from the English part to the Czech part, for each layer separately. The English annotation layers contain the alignment information in the form of references from English nodes to their corresponding Czech nodes.
Figure4 and Figure5 present the alignment at the respective layers of annotation.
Figure5