Chapter 1. Introduction

The present manual describes how sentences are represented at the tectogrammatical level in the Prague Dependecy Treebank. It is meant to be used by the PDT users, both by those who are interested in the linguistic side of the representation and those who work on further processing of the data, using e.g. statistical or other methods for automatic syntactic analysis or synthesis.

Preceding (lower) levels of PDT are concerned with:

The tectogrammatical annotation is structural and dependency based; it captures the so called deep, semantic structure of the sentence. At the tectogrammatical level, each (well-formed) sentence has at least one representation unambiguously characterizing the meaning of the sentence (or one of its meanings if the sentence is ambiguous). The tectogrammatical level representation contains all the information encoded in the structure of the sentence and its lexical items - all the information necessary for translating the tectogrammatical representation into the lower levels, as well as for its interpretation in the sense of intensional semantics.

The tectogrammatical representation of a sentence contains all kinds of information: apart from the actual deep structure of the sentence and the functions of its parts, it contains also other information, such as various kinds of grammatemes, the information regarding the grammatical and textual coreference and the topic-focus articulation of the sentence (including the deep word order, i.e. the information about the communicative dynamism).

The tectogrammatical level builds to a large extent on the analytical level. Since the same data were analyzed, it was not necessary to start from scratch, when representing the data at the tectogrammatical level; it was possible to take over basically the whole analytical structure (at least as far as the autosemantic lexical items are concerned) - the analytical and tectogrammatical levels are based on the same conception of dependency. Certain parts of the data were processed automatically before the actual manual annotation; other parts were, on the other hand, processed when the annotators had finished their work. Various procedures were introduced and implemented for this purpose. These procedures are not sufficient for determining the definitive representation even though they are able to translate certain constructions into the tectogrammatical representation very precisely. The decisive definition of the tectogrammatical level is in this manual; the output of the automatic procedures is further processed (modified) by the annotators. (The present manual is not concerned with the description of these automatic procedures.)

The data in PDT 2.0 do not necessarily reflect the most updated version of the tectogrammatical annotation rules. Therefore, the purpose of this manual is twofold: first, it summarizes our up-to-date ideas as to the rules for the annotation of Czech sentences at the tectogrammatical level (i.e. how Czech texts should be analyzed), second, it attempts to describe as precisely as possible the data as annotated in PDT 2.0. The discrepancy between the described annotation rules and the real state of the annotation is caused by the fact that only in the process of annotation it became clear whether the rules (as formulated at the beginning) are adequate or whether they need to be made more precise or replaced by other rules. In the annotation process, also certain problematic constructions emerged (not described so far) for which it was necessary to introduce new rules. New rules were constituted during the whole process of annotation and, even in the very end of the annotation, new modifications of the rules were introduced. It was not possible (for reasons of time) to run a subsequent check on whether the data correspond to the latest version of the rules in all areas. Only certain selected phenomena were checked (and corrected if necessary); mostly the important and frequent ones. In the manual, the reader is always informed about such a discrepancy between the rules and real state of affairs.

The chapters of the manual are organized in the way that reflects the sentence representation at the tectogrammatical level. The basic principles of the sentence representation at the tectogrammatical level are described in Chapter 2, Basic principles of sentence representation at the tectogrammatical level; this section also provides the reader with the most important notions used further in the manual. The next chapter Chapter 3, Node types classifies the tectogrammatical tree nodes into different types. The next two chapters Chapter 4, Tectogrammatical lemma (t-lemma) and Chapter 5, Complex nodes and grammatemes are devoted to the description of the attributes further specifying individual lexical units (represented by nodes). This is followed by the description of the sentence structure, with special emphasis on the dependency relations between lexical units (Chapter 6, Sentence representation structure). The annotation of some special kinds of syntactic structures is described in Chapter 8, Specific syntactic constructions. A separate chapter is devoted to functors and sub-functors (Chapter 7, Functors and subfunctors). Coreference (Chapter 9, Coreference) and topic-focus articulation (Chapter 10, Topic-focus articulation) are dealt with in a separate chapter each, too. The last chapter (Chapter 11, Data format) contains the information concerning the format of the annotated data that is relevant w.r.t. the manual annotation.