LDC Catalog || Introduction | Data | Support | Updates | References | Copyright || BROWSE PCEDT

Prague Czech-English Dependency Treebank 1.0


Introduction

Prague Czech-English Dependency Treebank version 1.0 (PCEDT 1.0) was developed at the Center for Computational Linguistics in cooperation with the Institute of Formal and Applied Linguistics and published by Linguistic Data Consortium (LDC) , catalog number LDC2004T25, ISBN 1-58563-321-6 .

PCEDT 1.0 is a corpus of Czech-English parallel resources suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation (with evaluation data provided for Czech-to-English systems).

Data

The core part of PCEDT 1.0 is a Czech translation of 21,600 English sentences from the Wall Street Journal part of Penn Treebank 3 corpus (PTB, LDC99T42, PTB is also included on the CD). Sentences of the Czech translation were automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures introduced in the theory of Functional Generative Description and closely related to the project of Prague Dependency Treebank (PDT, LDC2001T10). The original English sentences were transformed from the Penn Treebank phrase-structure trees into dependency representations. A heldout (development and evaluation) set of 515 sentence pairs was selected and manually annotated on tectogrammatical level in both Czech and English; for the purposes of quantitative evaluation this set has been retranslated from Czech to English by 4 different translation companies.

PCEDT 1.0 also comprises a parallel Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences, and a large monolingual corpus of Czech (2.4 M sentences).

The included Czech-English translation dictionary consists of 46,150 entry-translation pairs in its lemmatized version and 496,673 pairs of word forms in the version where for each entry-translation pair all the corresponding word form pairs have been generated. Also included is a English-Czech dictionary provided by Milan Svoboda under GNU/FDL license, this dictionary contains multi-word translations in 115,929 translation pairs.

Start browsing the PCEDT.

Future

The next version of PCEDT aims at translating the whole Wall Street Journal part of the Penn Treebank, we also plan to include reference retranslations for Czech. As a manual for tectogrammatical annotation of English is being created, the proportion of data annotated manually by humans will increase.

Support

PCEDT 1.0 has been supported by the following grants and projects

Updates

Updates or bug fixes may be available in the LDC catalog entry for this corpus, or at the PCEDT homepage.

Your questions and suggestions are welcome at pcedt (at) ufal (dot) mff (dot) cuni (dot) cz.

References

Content Copyright

Portions © 1999 Trustees of the University of Pennsylvania, © 1988-1989 Wall Street Journal, © 1993-1996 Reader's Digest, © 1991-1995 Lidové noviny, © 2004 Milan Svoboda, © 2002-2004 Center for Computational Linguistics, Charles University in Prague

Please, proceed to the Research-Usage License Agreement for the Prague Czech-English Dependency Treebank 1.0, or to its on-line version.


Contact ldc@ldc.upenn.edu.
© 2004 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.