2. Building the Prague English Dependency Treebank and the Prague Czech-English Dependency Treebank

This manual describes the tectogrammatical annotation of English data. The data used is original PennTreebank data, which comprises about 50,000 sentences. About one half (approx. 22,000 sentences) were translated into Czech. In 2004, the English data as well as the Czech translated texts were converted into the PDT 1.0-shape and released by the LDC as Prague Czech-English Dependency Treebank. The tectogrammatical annotation level was generated upon the both data, and a tiny fraction of the English tectogrammatical data was manually annotated. In 2006 the data was converted into the PDT 2.0 shape. The second half of the PTB texts was translated into Czech. The original English manual tectogrammatical annotation got lost in the conversion. The current annotation of the English data was launched in the fall 2006. First we had converted the English valency lexicon PropBank into EngValLex to obtain a reference and supporting tool for the annotation of verbal frames. For more information about EngValLex see Section 2.4.3, “Valency lexicon”.