What can be found on the PDT 1.0 CD-ROM
See also the CD sitemap.
Jump to: Corpora | Software | Documentation | Publications
- Prague Dependency Treebank (PDT) 1.0
- 1507333 tokens (98263 sentences, 1897 files) of Czech text, drawn from Lidové noviny (news daily), Mladá fronta Dnes (news daily), Českomoravský profit (business weekly), and Vesmír (scientific magazine), annotated on the morphological and analytical levels.
- Raw Czech Texts
- Over 39 millions of tokens (2385000 sentences) of Czech text, drawn from Lidové noviny (news daily) 1994-1995, tokenized but annotated neither morphologically, nor syntactically.
- Czech-English Parallel Corpus
- 877658 Czech tokens and 1010346 English tokens (53117 sentences, 450 articles) of parallel texts, drawn from the Reader's Digest Výběr, automatically sentence-aligned, morphologically/POS analyzed and tagged. The Czech part has also been automatically parsed on the analytical level.
- Netgraph (on-line tree search-and-view tool)
- Searches a treebank, delivers and shows trees meeting some criteria. Works on-line, provides limited access to PDT data even for those who have not purchased the PDT 1.0 CD-ROM. Requirements: a web browser with Java Runtime Environment plugin installed.
- Tred
- Highly customizable tree editor, written in Perl, understanding Perl-based macros. Requirements: Perl language with the Perl-tk library.
- Free Morphology
- Analyzes a word form into all plausible lemma-tag combinations. Generates a word form from a lemma-tag pair. Limited Czech dictionary included. Requirements: Perl.
- Czech Taggers
- Two approaches to Czech tagging. Both tools aim to automatically select (based on context) the correct lemma-tag pair out of the choice provided by the morpho-analyzer (see above).
- File format conversion tools
- Two different formats have been used to encode PDT trees, the FS format (older), and the CSTS format (newer). Perl-based tools are provided to convert FS to CSTS and vice versa. Penn TreeBank-like format can also be used, as far as theory permits. Other scripts convert generic dependency and phrase structures. Finally, in the early years of the PDT project, so-called compact tags (as opposed to today's positional tags) were used. All data on the CD-ROM use the new positional tags but for the users' convenience there is a compact / positional tags conversion tool.
- Useful free software from third parties
- Czech fonts, Gzip for Windows, Acrobat Reader for Windows, Linux and Sun Sparc, Ghostscript and Ghostview for Windows, SGML parser.
- Data
- What is PDT
- Data technical description
- Data location table
- Morphological layer
- Czech Tagset Description and Quick reference
- Manual for the annotators
- Analytical layer
- Manual for the annotators (Czech version here)
- Tectogrammatical layer
- Manual for the annotators (Czech version here)
- Software
- Czech Morphology and Tagging
- Free Morphology
- Czech Taggers
- Hidden Markov Model
- Feature-based
- Tred
- Netgraph
- Graph
- PDT References (including selected morpho, anal and tecto references)
- Additional Morphology and Tagging References
- Additional Analytical Level References
- Additional Tectogrammatical Level References
-
- Czech-English Parallel Corpus References
- A compilation of the above