PDT 2.0 - Guide

Jan Hajič

Eva Hajičová

Jaroslava Hlaváčová

Václav Klimeš

Jiří Mírovský

Petr Pajas

Jan Štěpánek

Barbora Vidová Hladká

Zdeněk Žabokrtský


Table of Contents

1. Introduction
1.1. What is PDT 2.0
1.2. Historical background of the project
1.3. Development of the project
1.4. About Czech
1.5. Directory structure
2. Layers of annotation
2.1. Morphological layer
2.1.1. Logical structure
2.1.2. Physical realization
2.1.3. Annotation process
2.2. Analytical layer
2.2.1. Logical structure
2.2.2. Physical realization
2.2.3. Annotation process
2.3. Tectogrammatical layer
2.3.1. Logical structure
2.3.2. Physical realization
2.3.3. Annotation process
2.4. Sample preview of annotation on the three layers
3. Data
3.1. Sources of text
3.2. Division of the data according to the layer of annotation
3.3. Division of the data into training and test sets
3.4. Data formats
3.4.1. PML
3.4.2. Perl Storable Format
3.4.3. FS
3.4.4. CSTS
3.5. Conventions of file naming
3.6. Full data
3.7. Sample data
3.8. PDT-VALLEX
3.9. PDT 1.0 update
4. Tools
4.1. Searching trees: Netgraph
4.2. Viewing (browsing) trees: TrEd
4.3. Automatic tree processing: btred/ntred
4.4. Converting data between formats
4.4.1. Conversion between the PDT formats
4.4.2. Conversion from formats of other treebanks
4.5. Parsing Czech: from plain text to PDT-formatted dependency trees
4.6. Creating data for parser development
4.7. Macros for error detection
5. Documentation
6. Publications
6.1. Theoretical background of PDT
6.2. PDT 2.0
6.2.1. General information
6.2.2. Morphological layer
6.2.3. Analytical layer
6.2.4. Tectogrammatical layer
6.3. Tools
6.3.1. Netgraph
6.3.2. Morphological analysis and tagging
6.3.3. Parsing
6.3.4. Automatic functor assignment
7. Distribution and license
7.1. License agreement
8. Installation
9. Credits
10. Acknowledgments

List of Figures

2.1. Linking the layers
2.2. Data and annotation workflow diagram
2.3. The analytical tree of the example sentence
2.4. The tectogrammatical tree of the example sentence (a detailed view)
3.1. Number of tokens from the particular sources
3.2. Division of the data to layers
3.3. Division of the data into training and test sets
3.4. PDT-VALLEX sample entry in the presentation format
3.5. PDT-VALLEX in the TrEd editor
4.1. Creating a query in Netgraph
4.2. A result tree in Netgraph
4.3. Tectogrammatical tree in TrEd

List of Tables

2.1. An example sentence
2.2. Morphological analysis of the example sentence
3.1. Data annotated on all three layers (tamw).
3.2. Data annotated only on m-layer and a-layer (amw).
3.3. Data annotated only on m-layer (mw).
3.4. Alternative grouping: All data annotated on m-layer (union of tamw, amw, and mw).
3.5. Alternative grouping: All data annotated on a-layer (union of tamw and amw).