Natural language is an extraordinarily complex system; therefore, it is useful to decompose its description into several layers. The highest level in the framework of the Functional Generative Description (FGD), which serves as the theoretical basis for PDT, is called the tectogrammatical level and is supposed to represent the semantic structure of the sentence. The tectogrammatical level in PDT is based on the ideas developed in FGD; in a number of details, though, it is modified or supplemented.
The tectogrammatical level in PDT is governed by the following principles:
the basic unit of annotation at the tectogrammatical level is a sentence as a basic means of conveying meaning.
for every well-formed (Czech) sentence, it is possible to provide its tectogrammatical representation: a tectogrammatical tree structure (tectogrammatical tree in sequel).
in case of ambiguity, it is in theory possible to assign one sentence more tectogrammatical trees. However, in PDT only one tree is assigned to each sentence, such that it corresponds to the given reading of the sentence.
in case of synonymy, on the other hand, different sentences can be assigned an single tectogrammatical tree (it has to be a case of strict synonymy, though, i.e. the truth conditions have to be absolutely identical). An example of synonymous expressions with identical tectogrammatical representation are expressions like otcův klobouk (=Father's hat) and klobouk otce (=lit. hat Father.GEN). Synonymy is in fact very rare in PDT (less frequent than originally thought in FGD).
Tectogrammatical trees have these basic properties:
tectogrammatical trees are data structures the basis of which is formed by a rooted tree (in the sense of the theory of graphs): it consists of a set of nodes and a set of edges and one of the nodes is marked as the root of the tree.
tectogrammatical tree nodes either represent expressions present at the surface level or they are "artificial", newly established nodes that have no counterparts at the surface structure. Functional words (like subordinating conjunctions, auxiliary verbs) are not assigned separate nodes in the trees (see Section 1, "Relation between the tectogrammatical level and the lower levels").
Each node is itself a complex unit with certain inner structure. It is possible to conceive of it as a set of attributes, more precisely as a set of ordered attribute - value pairs. Whether a given attribute is or is not present in a given node follows from its nodetype (see Chapter 3, Node types).
Fig. 2.1: Examples of nodes representing expressions present at the surface structure of the sentence are: starý (=old), sultán (=sultan), nový (=new), sultán (=sultan), vystřídali se (=changed places). The prepositional phrase na trůnu (=on the throne) is represented by a single node (the preposition na is not assigned a separate node). In order to represent the coordination starý sultán a nový sultán (=the old and new sultan), the conjunction a (=and) is assigned a separate node. An example of a newly established node is the node representing the Patient (functor
=PAT
) of the verb vystřídat se (=exchange, replace).
Node attributes can be divided into several groups. The basic attributes of a tectogrammatical tree node are the tectogrammatical lemma, grammatemes and the functor. The tectogrammatical lemma expresses the lexical meaning of the node (see Chapter 4, Tectogrammatical lemma (t-lemma)). The grammatemes correspond to (the meanings of) certain lexical and morphological categories (see Chapter 5, Complex nodes and grammatemes). The functors capture the kind of syntactic dependency between autosemantic expressions, i.e. they correspond to syntactic functions (see Chapter 7, Functors and subfunctors). There are also attributes providing the information regarding the coreference (see Chapter 9, Coreference), topic - focus articulation and deep word order (see Chapter 10, Topic-focus articulation) of the sentence. The remaining attributes concern special properties of the structure and certain syntactic and semantic properties impossible to capture in any other way.
The attribute values are of different types (see Section 2, "A node and types of attribute values"). Attribute values are mostly sequences of symbols; the set of sequences for a given attribute is usually fixed. A special type of attributes are attributes of the type reference. These attributes are used for representing relations (most often coreference relations) that go "across" the tree or even cross tree boundaries.
Fig. 2.1: In the example tree, there is one attribute of the type reference, representing reciprocity (i.e. a grammatical coreference relation) between the Patient and Actor of the predicate vystřídat se. It is depicted as a red interrupted arrow.
For the list of all attributes, see Section 4, "Attributes of nodes in a tectogrammatical tree".
tectogrammatical tree edges capture the dependency relations between the nodes (more precisely between the autosemantic expressions) of tectogrammatical trees. Not every edge, though, represents a linguistic dependency (see Section 1, "Dependency"). Edges have no attributes of their own; attributes that actually belong to edges (e.g. the type of dependency) are presented as attributes of the corresponding nodes.
Fig. 2.1: The edges are represented as straight connecting lines between the nodes. The edges representing dependency are marked by a thick grey line. For more details see Section 1, "Dependency".
tectogrammatical tree nodes are in a linear order; this linear order represents the deep word order of the sentence (see Section 3, "Deep structure word order").
Also the following terms are used when talking about tectogrammatical trees (here explained only informally):
Technical root node of a tectogrammatical tree. The root node of a sentence is a node with no linguistic interpretation; it only serves technical purposes (e.g. it bears the sentence indentifier). It has always exactly one daughter node. The root of a sentence is called technical root node of a tectogrammatical tree. When talking about tectogrammatical tree nodes (further in the text), the technical root node is not taken into account (if not stated otherwise).
Fig. 2.1: The technical root node of the tectogrammatical tree is the highest node, its only daughter node is connected to it by a thin dotted line (the value of the nodetype
attribute of the technical root node is root
; the technical root node also has the id
attribute, which serves for identifying the sentence in the corpus).
Mother node. Node X is the mother of node Y, if there is an edge between X and Y and if X is closer to the technical root node of the tree (i.e. if it is higher in the tree).
Fig. 2.1: The mother of the node representing the expression (starý) sultán is the node for a.
Immediate daughter node. Node X is an immediate daughter of node Y, if Y is the mother of X.
Since tectogrammatical trees make use of linear ordering, there are right and left daughter nodes. A right (left) immediate daughter of node M is such an immediate daughter which occurs to the right (left) of node M.
Fig. 2.1: The immediate daughter nodes of the node representing the verb vystřídat se are these three nodes: the node for the conjunction a, the newly established node for the Patient and the node for the prepositional phrase na trůnu. All immediate daughter nodes of vystřídat se are left daughters.
Governing/dependent node. If nodes X and Y (or: the expressions represented by them) are in a dependency relation, X is the governing node (or dependent node) of node Y. The governing node does not have to be the mother node of the dependent node (there can even be more governing nodes for a single node) and the dependent node does not have to be an immediate daughter of its governing node (see also Section 1, "Dependency"). (In the technical documentation for PDT, the terms "effective mother node" and "effective daughter node" are used for this type of relation).
Fig. 2.1: The governing node of the node for starý is the node for sultán (which is also its mother node). The governing node of the node for sultán is the node representing the verb vystřídat se (which is not its mother node).
Sister node. Node X is a sister node of node Y if they have the same mother.
Since tectogrammatical trees make use of linear ordering, there are right and left sisters. A right (left) sister node of node M is such a sister that occurs to the right (left) of node M.
Fig. 2.1: The sister nodes of the node for a are the newly established node for the Patient of vystřídat se and the node representing the prepositional phrase na trůnu. All the sisters of the node representing the conjunction a are its right sisters.
Path from node M. For purposes of topic - focus articulation annotation, we also define the term right (left) path from node M and the rightmost (leftmost) path from node M.
A right (left) path from node M is such a path in the tree that starts at node M, goes downwards (towards the leaves) and ends in a node that has no right (left) immediate daughters. Node M is not part of the path.
The rightmost (leftmost) path from node M is such a right (left) path in the tree for which it holds that no node on the path has a right (left) sister.
Fig. 2.1: There is no right path leading from the node for vystřídat se. As for the leftmost path from the node representing vystřídat se, it consists of the nodes for a, sultán and starý.
Subtrees. A subtree of a tectogrammatical tree is a continuous subgraph of a tectogrammatical tree (a subset of its nodes and edges with a marked root node).
Root of a subtree. The root of a subtree is the node of the subtree the mother node of which (if existent) is not part of the subtree.
Expression. Linguistically relevant parts of a sentence are called expression. (Whole sentences are also expressions.)
Root of an expression. The root of an expression is short for the root of the subtree representing a given expression.
The root of a sentence is the root of the subtree corresponding to a whole sentence; i.e. it is the (only) direct daughter of the technical root node of the tectogrammatical tree.
Effective root of an expression. The effective root of an expression is the node that either has no governing node in the given tectogrammatical tree or the governing node of which is not part of the subtree representing the expression. The effective root of an expression can be identical to the root of the expression; however, sometimes it is not, e.g. in case of paratactic structures: the root node (there is only one root) is not identical to the effective root nodes (which are usually more than one).
Fig. 2.1: The root of the example sentence is the node for vystřídat se. This node is also the effective root of the sentence. The coordination starý sultán a nový sultán is represented by a subtree of the tectogrammatical tree; the root of the subtree (the root of the coordination) is the node representing the conjunction a, the effective root nodes are the two nodes representing the noun sultán.