Contents

This CD presents part of the Prague English Dependency Treebank (PEDT). PEDT is the manual tectogrammatical (syntactico-semantic) annotation of texts from the Wall Street Journal - Penn Treebank III. The present CD (PEDT 1.0) comprises 12,440 annotated and checked trees, which is about 25% of the original WSJ-PTB. The following components are included:

Why Revisit The Wall Street Journal?

The Wall Street Journal section of the Penn Treebank was one of the first large manually annotated treebanks. It has become established as a standard reference corpus for statistical machine learning experiments. The PTB bracketing style was adopted by corpora of other languages, which strengthened the prominence of the original WSJ-PTB corpus. Although WSJ in practice is a restricted-domain corpus, which may affect its usability for general NLP tasks, we believe that building an additional syntactico-semantic annotation on WSJ is sensible. After having built and refined the Prague Dependency Treebank 2.0 (PDT 2.0), a one-million corpus of Czech 1990s newspaper texts with manual syntactico-semantic annotation, we have adapted the PDT-like annotation scheme to English.

We decided to draw on a corpus manually annotated in a widely known format, since the option of comparing both annotation schemes can be particularly useful for some users. In addition, familiar text examples facilitate the understanding of the new annotation scheme by users, and, in turn, we benefit from the constant confrontation with the PTB bracketing style while creating the annotation guidelines. Most importantly, the original manual annotation provided an excellent input for the conversion.

PEDT Group 2009

People

Former colleagues

Anja Nedolužko, Jiří Semecký

Software support

Petr Pajas - TREd
Zdeněk Žabokrtský - TectoMT

Acknowledgements

This work has been performed at the Institute of Formal and Applied Linguistics and supported by the Czech Science Foundation (GA-ČR 405/06/0589) and several other grants that have been and are using the corpus for developing project-specific tools, namely the EU projects Companions (6th FP, Project No. FP6-IST-5-034434-IP) and EuromatrixPlus (7th FP, Project No. FP7-ICT-3-231720-STP).

Photographs by courtesy of Pavel Schlesinger and the Nedolužko family archive

Licence

This data can be viewed and used only by holders of the LDC licence for the Penn Treebank III. (Contact: ldc@ldc.upenn.edu).

Reference

When quoting the annotated data, please use the following reference:

Silvie Cinková, Josef Toman, Jan Hajič, Kristýna Čermáková, Václav Klimeš, Lucie Mladová, Jana Šindlerová, Kristýna Tomšů, and Zdeněk Žabokrtský: Tectogrammatical Annotation of the Wall Street Journal. Prague Bulletin of Mathematical Linguistics, 2009, 92. [Draft]

Contact us

You can visit our web page or write an email to our address: pedt AT mff DOT cuni DOT cz


Child-care program tectogrammatical tree
Whether desirable or not, this is a child-care program, not an educational program.

WSJ 1286/49