DZ Parser

© 1995-2004 Daniel Zeman

DZ parser is a program that reads a pre-processed Czech sentence and returns a dependency tree describing the syntax of the input sentence. It assumes its input has been tokenized, morphologically annotated, morphologically disambiguated, and saved in the CSTS format (Prague Dependency Treebank). The output is in the same format.

There are two scripts: train.pl reads the training data and writes the trained statistical model; parse.pl reads the model and a preprocessed text and writes the parsed text.

Both scripts can be run in two different modes. In the default mode, train.pl reads the training corpus from the standard input, writes the trained model to the standard output and everything else to the standard error output. The parser takes the path to the trained model via an option, reads the preprocessed text from the standard input, writes the parsed text to the standard output and everything else to the standard error output. No secondary files are created on the disk. Additional configuration options can be supplied on the command line or in the parser.ini file.

In the debug mode, a working directory must be supplied where the user has the right to write. The text to be parsed contains manually assigned dependencies so the parser can compare its decissions to the gold standard and compute accuracy. The paths to the training and test data are typically stored in the parser.ini file and shared among many runs of the parser. The parsed text as well as any logs and diagnostics are written to numbered files in the working directory where the number uniquely identifies the experiment. Everything the parser says is preserved but regular cleaning is needed as the working directory can easily consume considerable part of the disk space.

The parser is not documented. Detailed comments in the source code and in the parser.ini file will help those who understand Czech. Note that the source code uses the ISO 8859-2 encoding. The diagnostic messages are in Czech as well. The encoding of the diagnostics can be specified in the parser.ini file - default is iso-8859-2 but cp852 can be needed for the MS Windows command line and cp1250 should be used if the logs are saved and then viewed in a MS Windows editor.

Installation

The parser is implemented in Perl. Since Perl is an interpreted language, it must be installed on your system before you can run the parser. However, it can be downloaded for free.

Unzip the contents of parser.zip into one folder. The .pm modules contain code to be loaded by the scripts. You may need to modify the PERLLIB environment variable if you intend to call the scripts from elsewhere so that the scripts can find their modules. At the moment the scripts also load some data files (2ice.txt, for instance) and they pretend to find them in the current folder. These files are also distributed in parser.zip.

The simplest usage

All one has to do is to call parse.pl. The parser first reads the trained statistical model. Currently the correct model is pracovni/374.stat. If the parser is to be retrained train.pl is the right script to call.

perl train.pl < training_corpus.csts > model.stat

perl parse.pl --stat=model.stat < preprocessed.csts > parsed.csts

Both scripts read parser.ini in the current folder by default. It is possible to drive them completely by the directions in this file so they can be called without any arguments and standard stream redirections.

Download

My thesis

The theoretical aspects of the parser and its detailed evaluation on PDT are given in my PhD thesis. It is available in various formats here.