Please note that this page is no longer maintained
and PDT 0.5 (used only in JHU Summer Workshop in 1998)
is no longer a supported release. You may want to consider using
the current (as of March 2010) release,
PDT 2.0 instead.
The Prague Dependency Treebank 0.5
The Prague Dependency Treebank (PDT) is a morphologically and
syntactically annotated corpus of Czech as a representative of
inflectionally rich free-word-order languages. (E.g., all the
Slavic languages such as Russian, Polish, Serbo-Croatian and many
others spoken together by more than 350 million people have similar
typological properties as Czech in both morphology and syntax.)
The Prague Dependency Treebank is - to a certain extent - modeled
after the Penn
Treebank but it uses the dependency syntax representation of
sentences. It has three layers:
- morphological (uses word forms, tags, lemmas)
- analytical, or surface syntax (uses dependencies and analytical
functions of dependencies)
- tectogrammatical, which captures linguistic meaning
(contains tectogrammatical functions such as Actor, Patient,
Addressee, etc.)
The Prague Dependency Treebank is a long-term project which should
end in the year 2000.
The current version is thus preliminary and identified as "PDT version
0.5" (reflecting mostly the amount of material currently
available).
The text material contains samples from the following
sources:
- Lidové noviny (daily newspapers), 1991, 1994, 1995
- Mladá fronta Dnes (daily newspapers), 1992
- Ceskomoravský Profit (business weekly), 1994
- Vesmír (scientific magazine), Academia Publishers, 1992, 1993
The electronic source has been provided by the Institute of the Czech National
Corpus, in a format jointly developed by the ICNK and IFAL.
PDT version 0.5
The current version of PDT (0.5) contains 456705 tokens (words and
punctuation) in 26610 sentences and 576 files annotated on the morphological
and analytical levels. In order to keep
results of NLP applications comparable the data has been divided into
a training set (19126 sentences), a development test set (3697
sentences) and a (cross-)evaluation test data set (3787
sentences).
An idea about the division into files can be extracted
from
The
Workshop 98 data description, division and placement.
The internal format of the files is based on SGML.
The SGML document
type definition is here.
The PDT Version 0.5 is freely available for research purposes
providing you fill in and submit the Licence
Agreement.
Documentation
General information is given in
J. Hajic: Building a Syntactically Annotated Corpus: The Prague
Dependency Treebank. In: Issues of Valency and Meaning,
pp. 106-132, Karolinum, Praha 1998 (PS)
A rough description of the level 1 annotation and a deeper insight
into the level 2 annotation rules is available in Czech. Look
at the
Manual
for the Annotators.
The inner structure of the level 1 morphological tags can
be better understood from
J. Hajic, B. Hladká: Tagging Inflective Languages: Prediction of
Morphological Categories for a Rich, Structured Tagset.
In: Proceedings of the 36th Annual Meeting of the ACL and the 17th ICCL,
pp. 483-490, Université de Montréal, Montréal 1998 (PS)
The information on the transition from level 2 to level 3 can be
found in:
A. Bohmova and E. Hajicova.
How Much of the Underlying Syntactic Structure
Can Be Tagged Automatically. In Proceedings of ATALA
Workshop, pp. 31-39, Paris, France, 1999. (DOC)
A. Bohmova, P. Sgall: Automatic procedures in tectogrammatical tagging (DOC)
A. Bohmova, J. Panevova and P. Sgall.
Syntactic Tagging Procedure for the Transation from the Analytic
to the Tectogrammatical Tree Structure.
In Proceedings of the Second Workshop on Text, Speech,
Dialogue, pp. , Marianske lazne, Czech
Republic, 1999.
(PS)
A. Bemova, J. Hajic, B. Hladka and J.Panevova.
Morphological and Syntactic Tagging of the Prague
Dependency Treebank. In Proceedings of ATALA Workshop, pp. 21-29,
Paris, France, 1999.
(PS)
J. Hajic, E. Hajicova, J. Panevova and P. Sgall.
Syntax v Ceskem narodnim korpusu
[Syntax in the Czech National Corpus]. In Slovo a
slovesnost,3, LIX, pp. 168-177, 1998.
E. Hajicova. Prague Dependency Treebank:
From Analytic to Tectogrammatical Annotation. In Proceedings
of the First Workshop on Text, Speech, Dialogue, pp. 45-50, Brno, Czech
Republic, 1998.
E. Hajicova, J. Panevova and P. Sgall.
Language Resources Need Annotations
to Make Them Really Reusable: The Prague Dependency Treebank.
In Proceedings of the First
International Conference on Language Resources, pp. 713-718,
Granada, Spain, 1998.
E. Hajicova. The Prague Dependency Treebank: Crossing Sentence
Boundary. In Proceedings of the Second Workshop on Text, Speech,
Dialogue, pp. , Marianske lazne, Czech republic,
1999.
(PS)
Supported by
The Treebank has been supported by the following grants and
projects:
This page is maintained by Daniel Zeman. Last change:
27.6.2000 by Webar