The Prague Dependency Treebank 0.5

The Prague Dependency Treebank (PDT) is a morphologically and syntactically annotated corpus of Czech as a representative of inflectionally rich free-word-order languages. (E.g., all the Slavic languages such as Russian, Polish, Serbo-Croatian and many others spoken together by more than 350 million people have similar typological properties as Czech in both morphology and syntax.)

The Prague Dependency Treebank is - to a certain extent - modeled after the Penn Treebank but it uses the dependency syntax representation of sentences. It has three layers:

  1. morphological (uses word forms, tags, lemmas)
  2. analytical, or surface syntax (uses dependencies and analytical functions of dependencies)
  3. tectogrammatical, which captures linguistic meaning (contains tectogrammatical functions such as Actor, Patient, Addressee, etc.)

The Prague Dependency Treebank is a long-term project which should end in the year 2000. The current version is thus preliminary and identified as "PDT version 0.5" (reflecting mostly the amount of material currently available).

The text material contains samples from the following sources:

  1. Lidové noviny (daily newspapers), 1991, 1994, 1995
  2. Mladá fronta Dnes (daily newspapers), 1992
  3. Ceskomoravský Profit (business weekly), 1994
  4. Vesmír (scientific magazine), Academia Publishers, 1992, 1993

The electronic source has been provided by the Institute of the Czech National Corpus, in a format jointly developed by the ICNK and IFAL.

PDT version 0.5

The current version of PDT (0.5) contains 456705 tokens (words and punctuation) in 26610 sentences and 576 files annotated on the morphological and analytical levels. In order to keep results of NLP applications comparable the data has been divided into a training set (19126 sentences), a development test set (3697 sentences) and a (cross-)evaluation test data set (3787 sentences).

An idea about the division into files can be extracted from
The Workshop 98 data description, division and placement.
The internal format of the files is based on SGML.
The SGML document type definition is here.

The PDT Version 0.5 is freely available for research purposes providing you fill in and submit the Licence Agreement.


