EnglishČesky
Header Image n.1Header Image n.2Header Image n.3Header Image n.4Header Image n.5

Content

Introduction

Node types

Types of edges

Node structure

Functors

Formemes

Grammatemes

Valency

Additional specifications

In this text we present the main principles of the tectogrammatical representation applied to English and use English examples, but features that are not language-specific to English apply to the Czech tectogrammatical representation as well.

What is the tectogrammatical layer for?

The tectogrammatical annotation is built above the analytical layer. Like the analytical layer, it captures syntactic dependencies, but it is more semantically oriented and provides additional linguistic information. The basic idea about the tectogrammatical representation is that it emphasizes the similarities between languages and moderates their differences. The tectogrammatical representation of a sentence in a source language and in its translation to a target language are more similar than their analytical representations, since many language-specific features are cleared away from the tree structure into the inner structure of the nodes.

For illustration, the Penn Treebank sentence 1600/32 got a very precise translation to Czech and thus the source sentence and the target sentence ought to be represented in a similar way, if semantics is concerned.

  • [en] Mr. Carder also goes through periods when he buys stocks in conjunction with options to boost returns and protect against declines.
  • [cs] Pan Carder má rovněž období, kdy kupuje akcie ve spojení s opcemi, aby zlepšil výnosy a zabránil poklesům.

Figure1

The translation is precise, but not literal. The Czech sentence does not use the most straightforward translation of the main predicate ( means has, while the English predicate says goes through) and the English infinitive clause to boost returns and protect against declines that serves as an adverbial purpose clause corresponds to a finite past-tense subordinate clause introduced by the subordinator aby. This is a systematic difference between Czech and English. The tectogrammatical representations smooth out these lexical and structural differences anchored in the very text, as Figure1 demonstrates. For instance, the arguments and adjuncts of the main predicates get identical semantic labels and it does not even matter that one takes a direct object while the other takes a prepositional object. The structural difference between the English infinitive clause and the Czech finite subordinate clause with a subordinator is also "hidden"; i.e. it has moved from the tree structure into the inner structure of the nodes, partly as a number of different attribute values, partly as references to the analytical layer. The analytical representations, on the other hand, preserve these differences in the tree structure (Figure2). The most essential differences between the analytical layer and the tectogrammatical layer are:

  • Of tokens realized in the text, only content words and coordinating conjunctions are represented as nodes in the tree. The linguistic information contributed by function words is stored in the inner structure of the node (see Section Node Structure)
  • Instead of what is usually understood as lemma, the tectogrammatical representation introduces t-lemma. This is, especially in the English part, still mostly identical with the base form of a word, but some parts of speech are already rendered by a string introduced by #. This applies e.g. to personal pronouns and negation particles. The original word is normally present on the lower layers, but in the tectogrammatical tree it is encoded by the given t-lemma and a combination of grammatemes (a set of cognitive and grammatical categories - for more detail see Section Grammatemes). For instance, the pronoun he would get the t-lemma #PersPron and grammatemes for definiteness, gender and number. A few t-lemmas with # (we will call them t-lemma substitutes) do not represent any node present on the lower layers, but they are only present on the tectogrammatical layer. Nodes that do not correspond to any surface nodes are called generated nodes. They either get a t-lemma substitute, or they are copies of nodes located elsewhere in the text. All generated nodes have the attribute value is_generated="1".
  • The generated nodes are used for instance to restore ellipsis. These generated nodes are either copies of other nodes present in the text, or purely artificial nodes with t-lemma substitutes. Whether a generated node has the t-lemma of an ordinary word or gets a t-lemma substitute depends on the position of the given node in the tree. With a few negligible exceptions, non-terminal nodes are copies of existing nodes with regular t-lemmas, whereas generated terminal nodes get t-lemma substitutes.
  • All occurrences of verbs are assigned a frame in the valency lexicon Engvallex. When the actual usage occurs in a context where not all its obligatory argument slots are occupied, the slots are filled in with generated nodes with t-lemma substitutes.
  • Not only verb arguments, but all tectogrammatical nodes get semantic labels (functors). These semantic labels describe the syntactico-semantic relation of the given node to its parent.
  • Anaphora and coreference are resolved, even among the generated nodes.
  • Generally, the tectogrammatical representation contains information on the information structure (topic-focus articulation). NB: PCEDT 2.0 does not yet contain this annotation.

Figure2