The tectogrammatical annotation is built above the analytical layer. Like the analytical layer, it captures syntactic dependencies, but it is more semantically oriented and provides additional linguistic information. The basic idea about the tectogrammatical representation is that it emphasizes the similarities between languages and moderates their differences. The tectogrammatical representation of a sentence in a source language and in its translation to a target language are more similar than their analytical representations, since many language-specific features are cleared away from the tree structure into the inner structure of the nodes.
For illustration, the Penn Treebank sentence 1600/32 got a very precise translation to Czech and thus the source sentence and the target sentence ought to be represented in a similar way, if semantics is concerned.
[en] Mr. Carder also goes through periods when he buys stocks in conjunction with options to boost returns and protect against declines.
[cs] Pan Carder má rovněž období, kdy kupuje akcie ve spojení s opcemi, aby zlepšil výnosy a zabránil poklesům.
The translation is precise, but not literal. The Czech sentence does not use the most straightforward translation of the main predicate (má means has, while the English predicate says goes through) and the English infinitive clause to boost returns and protect against declines that serves as an adverbial purpose clause corresponds to a finite past-tense subordinate clause introduced by the subordinator aby. This is a systematic difference between Czech and English. The tectogrammatical representations smooth out these lexical and structural differences anchored in the very text, as Figure 1 demonstrates. For instance, the arguments and adjuncts of the main predicates get identical semantic labels and it does not even matter that one takes a direct object while the other takes a prepositional object. The structural difference between the English infinitive clause and the Czech finite subordinate clause with a subordinator is also "hidden"; i.e. it has moved from the tree structure into the inner structure of the nodes, partly as a number of different attribute values, partly as references to the analytical layer. The analytical representations, on the other hand, preserve these differences in the tree structure (Figure 2).
The most essential differences between the analytical layer and the tectogrammatical layer are:
The general principles of the tectogrammatical representation have been most comprehensively described in the specification of the Czech tectogrammatical annotation. This has appeared in two versions. There are a comprehensive volume and an abbreviated version. Both contain a complete technical description of the data. The comprehensive specification gives the reader a detailed insight into the annotation of a number of linguistic phenomena. Based on these specifications, a similar documentation was elaborated for the English tectogrammatical representation in 2006. This documentation also contains most of the technical information present in the Czech specifications (e.g. lists of attribute values) and it describes the annotation of selected linguistic phenomena, some specific to English. The English annotation manual, however, suffers from the fact that it was too strongly conceived as a derivation of the Czech annotation manual and, no less, that, at the time of writing, there was no convenient tool available to non-programming linguists for querying the English data. The linguistic phenomena were thus selected and described on the basis of grammar textbooks and searches in the British National Corpus rather than based on the actual PTB-WSJ data. Later, we were confronted with the real PTB-WSJ data during the massive annotation and it turned out that some phenomena frequently represented in WSJ-PTB were neglected, while others, extensively presented in the textbooks, were only marginal issues in the American financial press texts. Particularly when the PML Tree Query engine was launched and querying the corpus became amazingly easy, it was plain to see that many linguistic instructions mentioned in the English manual proved untenable in practice, while other instructions kept throughout the corpus have not found their way into the manual. This brief description of the English tectogrammatical representation is meant to support the obsolete 2006 English annotation manual. We are still consulting a balanced corpus, whenever the PTB-WSJ data do not seem to be telling the whole story of a linguistic phenomenon, but instead of the BNC we then use the half-billion Corpus of Contemporary American English (COCA), which became freely available in 2008.