This CD presents part of the Prague English Dependency Treebank (PEDT). PEDT is the manual tectogrammatical (syntactico-semantic) annotation of texts from the Wall Street Journal - Penn Treebank III. The present CD (PEDT 1.0) comprises 12,440 annotated and checked trees, which is about 25% of the original WSJ-PTB. The following components are included:
- manually annotated data, integrated valency lexicon Engvallex
- the valency lexicon Engvallex in printable form (latest revision: January 2009)
- the ready-to-install package of the tree editor/viewer TREd
- specification of the annotation format (Prague Markup Language)
Why Revisit The Wall Street Journal?
The Wall Street Journal section of the Penn Treebank was one of the first large manually annotated treebanks. It has become established as a standard reference corpus for statistical machine learning experiments. The PTB bracketing style was adopted by corpora of other languages, which strengthened the prominence of the original WSJ-PTB corpus. Although WSJ in practice is a restricted-domain corpus, which may affect its usability for general NLP tasks, we believe that building an additional syntactico-semantic annotation on WSJ is sensible. After having built and refined the Prague Dependency Treebank 2.0 (PDT 2.0), a one-million corpus of Czech 1990s newspaper texts with manual syntactico-semantic annotation, we have adapted the PDT-like annotation scheme to English.
We decided to draw on a corpus manually annotated in a widely known format, since the option of comparing both annotation schemes can be particularly useful for some users. In addition, familiar text examples facilitate the understanding of the new annotation scheme by users, and, in turn, we benefit from the constant confrontation with the PTB bracketing style while creating the annotation guidelines. Most importantly, the original manual annotation provided an excellent input for the conversion.
- Jan Hajič
- Silvie Cinková
- Lucie Mladová
- Jana Šindlerová
- Kristýna Čermáková
- Matěj Korvas
- Jan Mašek
- Magdaléna Rysová
- Josef Toman
- Kristýna Tomšů
- Kateřina Veselá
- Kateřina Veselovská
Anja Nedolužko, Jiří Semecký
This work has been performed at the Institute of Formal and Applied Linguistics and supported by the Czech Science Foundation (GA-ČR 405/06/0589) and several other grants that have been and are using the corpus for developing project-specific tools, namely the EU projects Companions (6th FP, Project No. FP6-IST-5-034434-IP) and EuromatrixPlus (7th FP, Project No. FP7-ICT-3-231720-STP).
Photographs by courtesy of Pavel Schlesinger and the Nedolužko family archive
This data can be viewed and used only by holders of the LDC licence for the Penn Treebank III. (Contact: firstname.lastname@example.org).
When quoting the annotated data, please use the following reference:
Silvie Cinková, Josef Toman, Jan Hajič, Kristýna Čermáková, Václav Klimeš, Lucie Mladová, Jana Šindlerová, Kristýna Tomšů, and Zdeněk Žabokrtský: Tectogrammatical Annotation of the Wall Street Journal. Prague Bulletin of Mathematical Linguistics, 2009, 92. [Draft]
You can visit our web page or write an email to our address: pedt AT mff DOT cuni DOT cz
Whether desirable or not, this is a child-care program, not an educational program.