The Prague Czech-English Dependency Treebank is a manually annotated parallel, aligned treebank built above the Penn Treebank - Wall Street Journal text collection. It comes in two versions. The current version is the Prague Czech-English Dependency Treebank 2.0. - a major update of the (Prague Czech-English Dependency Treebank 1.0) sized over 1.2 million running words in almost 50,000 sentences for each part.

The English part of PCEDT 2.0 contains the entire Penn Treebank - Wall Street Journal Section (LDC99T42). The Czech part consists of Czech translations of all of the Penn Treebank-WSJ texts. The corpus is 1:1 sentence-aligned. An additional automatic alignment on the node level (different for each annotation layer) is part of this release, too. The original Penn Treebank-like file structure (25 sections, each containing up to one hundred files) has been preserved. Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been translated to Czech and made part of this release.

Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (LDC2006T01, Prague Dependency Treebank 2.0). The main features of this annotation style are:

  • dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
  • semantic labeling of content words and types of coordinating structures
  • argument structure, including an argument structure ("valency") lexicon for both languages
  • ellipsis and anaphora resolution.

 

How to cite

If you make use of the Prague-English Dependency Treebank, please cite:

Hajič Jan, Hajičová Eva, Panevová Jarmila, Sgall Petr, Bojar Ondřej, Cinková Silvie, Fučíková Eva, Mikulová Marie, Pajas Petr, Popelka Jan, Semecký Jiří, Šindlerová Jana, Štěpánek Jan, Toman Josef, Urešová Zdeňka, Žabokrtský Zdeněk: Announcing Prague Czech-English Dependency Treebank 2.0. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Copyright © European Language Resources Association, İstanbul, Turkey, ISBN 978-2-9517408-7-7, pp. 3153-3160, 2012

@inproceedings{ biblio:HaHaAnnouncingPrague2012,
booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation ({LREC} 2012)},
title = {Announcing Prague Czech-English Dependency Treebank 2.0},
author = {Jan Haji{\v{c}} and Eva Haji{\v{c}}ov{\'{a}} and Jarmila Panevov{\'{a}} and Petr Sgall and Ond{\v{r}}ej Bojar and Silvie Cinkov{\'{a}} and Eva Fu{\v{c}}{\'{i}}kov{\'{a}} and Marie Mikulov{\'{a}} and Petr Pajas and Jan Popelka and Ji{\v{r}}{\'{i}} Semeck{\'{y}} and Jana {\v{S}}indlerov{\'{a}} and Jan {\v{S}}t{\v{e}}p{\'{a}}nek and Josef Toman and Zde{\v{n}}ka Ure{\v{s}}ov{\'{a}} and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
year = {2012},
publisher = {European Language Resources Association},
organization = {{ELRA}},
address = {{\.{I}}stanbul, Turkey},
venue = {L{\"{u}}tfi Kırdar Convention {{\&}} Exhibition Centre},
pages = {3153--3160},
isbn = {978-2-9517408-7-7},
}