What can be found on the PDT 1.0 CD-ROM

See also the CD sitemap.

Jump to: Corpora | Software | Documentation | Publications


Corpora

Prague Dependency Treebank (PDT) 1.0
1507333 tokens (98263 sentences, 1897 files) of Czech text, drawn from Lidové noviny (news daily), Mladá fronta Dnes (news daily), Českomoravský profit (business weekly), and Vesmír (scientific magazine), annotated on the morphological and analytical levels.
Raw Czech Texts
Over 39 millions of tokens (2385000 sentences) of Czech text, drawn from Lidové noviny (news daily) 1994-1995, tokenized but annotated neither morphologically, nor syntactically.
Czech-English Parallel Corpus
877658 Czech tokens and 1010346 English tokens (53117 sentences, 450 articles) of parallel texts, drawn from the Reader's Digest Výběr, automatically sentence-aligned, morphologically/POS analyzed and tagged. The Czech part has also been automatically parsed on the analytical level.

Software

Netgraph (on-line tree search-and-view tool)
Searches a treebank, delivers and shows trees meeting some criteria. Works on-line, provides limited access to PDT data even for those who have not purchased the PDT 1.0 CD-ROM. Requirements: a web browser with Java Runtime Environment plugin installed.
Tred
Highly customizable tree editor, written in Perl, understanding Perl-based macros. Requirements: Perl language with the Perl-tk library.
Free Morphology
Analyzes a word form into all plausible lemma-tag combinations. Generates a word form from a lemma-tag pair. Limited Czech dictionary included. Requirements: Perl.
Czech Taggers
Two approaches to Czech tagging. Both tools aim to automatically select (based on context) the correct lemma-tag pair out of the choice provided by the morpho-analyzer (see above).
File format conversion tools
Two different formats have been used to encode PDT trees, the FS format (older), and the CSTS format (newer). Perl-based tools are provided to convert FS to CSTS and vice versa. Penn TreeBank-like format can also be used, as far as theory permits. Other scripts convert generic dependency and phrase structures. Finally, in the early years of the PDT project, so-called compact tags (as opposed to today's positional tags) were used. All data on the CD-ROM use the new positional tags but for the users' convenience there is a compact / positional tags conversion tool.
Useful free software from third parties
Czech fonts, Gzip for Windows, Acrobat Reader for Windows, Linux and Sun Sparc, Ghostscript and Ghostview for Windows, SGML parser.

Documentation

Data
What is PDT
Data technical description
Data location table
Morphological layer
Czech Tagset Description and Quick reference
Manual for the annotators
Analytical layer
Manual for the annotators (Czech version here)
Tectogrammatical layer
Manual for the annotators (Czech version here)
Software
Czech Morphology and Tagging
Free Morphology
Czech Taggers
Hidden Markov Model
Feature-based
Tred
Netgraph
Graph

Publications

PDT References (including selected morpho, anal and tecto references)
Additional Morphology and Tagging References
Additional Analytical Level References
Additional Tectogrammatical Level References
Czech-English Parallel Corpus References
A compilation of the above