DZ Parser 2.0

DZ parser is a program that reads a pre-processed natural language sentence and returns a dependency tree describing the syntax of the input sentence. It assumes its input has been tokenized, morphologically annotated, morphologically disambiguated, and saved in the CSTS format (Prague Dependency Treebank). The output is in the same format. (Note: Tools for conversion from and to the CoNLL shared task format are now included.)

There are two scripts: train.pl reads the training data and writes the trained statistical model; parse.pl reads the model and a preprocessed text and writes the parsed text.

Both scripts can be run in two different modes. In the default mode, train.pl reads the training corpus from the standard input, writes the trained model to the standard output and everything else to the standard error output. The parser takes the path to the trained model via an option, reads the preprocessed text from the standard input, writes the parsed text to the standard output and everything else to the standard error output. No secondary files are created on the disk. Additional configuration options can be supplied on the command line or in the parser.ini file. (This configuration file is self-documented in both Czech and English.)

In the debug mode, a working directory must be supplied where the user has the right to write. The text to be parsed contains manually assigned dependencies so the parser can compare its decissions to the gold standard and compute accuracy. The paths to the training and test data are typically stored in the parser.ini file and shared among many runs of the parser. The parsed text as well as any logs and diagnostics are written to numbered files in the working directory where the number uniquely identifies the experiment. Everything the parser says is preserved but regular cleaning is needed as the working directory can easily consume considerable part of the disk space.

Installation

The parser is implemented in Perl. Since Perl is an interpreted language, it must be installed on your system before you can run the parser. However, it is available for most platforms and it can be downloaded for free.

Unzip the contents of parser.zip into one folder. The .pm modules contain code to be loaded by the scripts. You may need to modify the PERLLIB environment variable if you intend to call the scripts from elsewhere so that the scripts can find their modules.

The simplest usage

All one has to do is to call parse.pl. The parser first reads the trained statistical model. You may download pre-trained models from this site or you can train the parser yourself. If the parser is to be retrained train.pl is the right script to call.

perl train.pl < training_corpus.csts > model.stat

perl parse.pl --stat=model.stat < preprocessed.csts > parsed.csts

Both scripts read parser.ini in the current folder by default. It is possible to drive them completely by the directions in this file so they can be called without any arguments and standard stream redirections.

What's new in version 2.0

UTF-8 is now default encoding.
Most text files with additional model data are now part of the trained model.
New pretrained models pdt2.stat (Czech, trained on PDT 2.0) and padt.stat (Arabic).
New scripts atrain.pl and aclass.pl for learning and assigning the analytical functions (DEPREL labels).
A voting superparser, vote.pl, can read outputs of multiple parsers, in CoNLL format, and let them vote about the ultimately best parse.
Lots of accompanying Perl tools, mostly for treebank manipulation, data format conversion etc.

License

DZ Parser is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Acknowledgements

This research has been supported by the grant MSM 0021620838 of the Ministry of Education of the Czech Republic.

Download

dzparser-2.0.zip (326 kB - without the trained model files)
dzparser-2.0-models.zip (33,961 kB - the trained model files)

Version 1.0

The version 1.0 is identical with the one I based my PhD thesis on; you can see the historical page (including download) here.

My thesis

The theoretical aspects of the parser and its detailed evaluation on PDT are given in my PhD thesis. It is available in various formats here.