Please note that this page is no longer maintained and PDT 0.5 (used only in JHU Summer Workshop in 1998) is no longer a supported release. You may want to consider using the current (as of March 2010) release, PDT 2.0 instead.

The Prague Dependency Treebank 0.5

The Prague Dependency Treebank (PDT) is a morphologically and syntactically annotated corpus of Czech as a representative of inflectionally rich free-word-order languages. (E.g., all the Slavic languages such as Russian, Polish, Serbo-Croatian and many others spoken together by more than 350 million people have similar typological properties as Czech in both morphology and syntax.)

The Prague Dependency Treebank is - to a certain extent - modeled after the Penn Treebank but it uses the dependency syntax representation of sentences. It has three layers:

morphological (uses word forms, tags, lemmas)
analytical, or surface syntax (uses dependencies and analytical functions of dependencies)
tectogrammatical, which captures linguistic meaning (contains tectogrammatical functions such as Actor, Patient, Addressee, etc.)

The Prague Dependency Treebank is a long-term project which should end in the year 2000. The current version is thus preliminary and identified as "PDT version 0.5" (reflecting mostly the amount of material currently available).

The text material contains samples from the following sources:

Lidové noviny (daily newspapers), 1991, 1994, 1995
Mladá fronta Dnes (daily newspapers), 1992
Ceskomoravský Profit (business weekly), 1994
Vesmír (scientific magazine), Academia Publishers, 1992, 1993

The electronic source has been provided by the Institute of the Czech National Corpus, in a format jointly developed by the ICNK and IFAL.

PDT version 0.5

The current version of PDT (0.5) contains 456705 tokens (words and punctuation) in 26610 sentences and 576 files annotated on the morphological and analytical levels. In order to keep results of NLP applications comparable the data has been divided into a training set (19126 sentences), a development test set (3697 sentences) and a (cross-)evaluation test data set (3787 sentences).

An idea about the division into files can be extracted from

The Workshop 98 data description, division and placement.

The internal format of the files is based on SGML.

The SGML document type definition is here.

The PDT Version 0.5 is freely available for research purposes providing you fill in and submit the Licence Agreement.

Documentation

General information is given in

J. Hajic: Building a Syntactically Annotated Corpus: The Prague Dependency Treebank. In: Issues of Valency and Meaning, pp. 106-132, Karolinum, Praha 1998 (PS)

A rough description of the level 1 annotation and a deeper insight into the level 2 annotation rules is available in Czech. Look at the Manual for the Annotators.

The inner structure of the level 1 morphological tags can be better understood from

J. Hajic, B. Hladká: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: Proceedings of the 36th Annual Meeting of the ACL and the 17th ICCL, pp. 483-490, Université de Montréal, Montréal 1998 (PS)

The information on the transition from level 2 to level 3 can be found in:

A. Bohmova and E. Hajicova. How Much of the Underlying Syntactic Structure Can Be Tagged Automatically. In Proceedings of ATALA Workshop, pp. 31-39, Paris, France, 1999. (DOC)

A. Bohmova, P. Sgall: Automatic procedures in tectogrammatical tagging (DOC)

A. Bohmova, J. Panevova and P. Sgall. Syntactic Tagging Procedure for the Transation from the Analytic to the Tectogrammatical Tree Structure. In Proceedings of the Second Workshop on Text, Speech, Dialogue, pp. , Marianske lazne, Czech Republic, 1999. (PS)

A. Bemova, J. Hajic, B. Hladka and J.Panevova. Morphological and Syntactic Tagging of the Prague Dependency Treebank. In Proceedings of ATALA Workshop, pp. 21-29, Paris, France, 1999. (PS)

J. Hajic, E. Hajicova, J. Panevova and P. Sgall. Syntax v Ceskem narodnim korpusu [Syntax in the Czech National Corpus]. In Slovo a slovesnost,3, LIX, pp. 168-177, 1998.

E. Hajicova. Prague Dependency Treebank: From Analytic to Tectogrammatical Annotation. In Proceedings of the First Workshop on Text, Speech, Dialogue, pp. 45-50, Brno, Czech Republic, 1998.

E. Hajicova, J. Panevova and P. Sgall. Language Resources Need Annotations to Make Them Really Reusable: The Prague Dependency Treebank. In Proceedings of the First International Conference on Language Resources, pp. 713-718, Granada, Spain, 1998.

E. Hajicova. The Prague Dependency Treebank: Crossing Sentence Boundary. In Proceedings of the Second Workshop on Text, Speech, Dialogue, pp. , Marianske lazne, Czech republic, 1999. (PS)

Supported by

The Treebank has been supported by the following grants and projects:

Grant Agency of the Czech Republic No. 405/96/0198 (Treebank Definition and Procedures Specification)
Grant Agency of the Czech Republic No. 405/96/K214 (Tools and Level 1 Annotation)
Ministry of Education of the Czech Republic Project No. VS96151 (Tools and Structural Annotation on the Level 2)
National Science Foundation grant No. #IIS-9732388 (Version 0.5 Preparation for the Workshop 98)

This page is maintained by Daniel Zeman. Last change: 27.6.2000 by Webar