In this text we present the main principles of the tectogrammatical representation applied to English and use English examples, but features that are not language-specific to English apply to the Czech tectogrammatical representation as well.
The general principles of the tectogrammatical representation have been most comprehensively described in the specification of the Czech tectogrammatical annotation. This has appeared in two versions. There are a comprehensive volume and an abbreviated version. Both contain a complete technical description of the data. The comprehensive specification gives the reader a detailed insight into the annotation of a number of linguistic phenomena. Based on these specifications, a similar documentation was elaborated for the English tectogrammatical representation in 2006. This documentation also contains most of the technical information present in the Czech specifications (e.g. lists of attribute values) and it describes the annotation of selected linguistic phenomena, some specific to English. The English annotation manual, however, suffers from the fact that it was too strongly conceived as a derivation of the Czech annotation manual and, no less, that, at the time of writing, there was no convenient tool available to non-programming linguists for querying the English data. The linguistic phenomena were thus selected and described on the basis of grammar textbooks and searches in the British National Corpus rather than based on the actual PTB-WSJ data. Later, we were confronted with the real PTB-WSJ data during the massive annotation and it turned out that some phenomena frequently represented in WSJ-PTB were neglected, while others, extensively presented in the textbooks, were only marginal issues in the American financial press texts. Particularly when the PML Tree Query engine was launched and querying the corpus became amazingly easy, it was plain to see that many linguistic instructions mentioned in the English manual proved untenable in practice, while other instructions kept throughout the corpus have not found their way into the manual. This brief description of the English tectogrammatical representation is meant to support the obsolete 2006 English annotation manual. We are still consulting a balanced corpus, whenever the PTB-WSJ data do not seem to be telling the whole story of a linguistic phenomenon, but instead of the BNC we now use the half-billion Corpus of Contemporary American English (COCA), which became freely available in 2008.