A Complete Guide to Czech Language Parsing

If an application is supposed to understand natural language to some extent, it usually has to (syntactically) parse the input utterances. I.e., it attempts to discover relations between words of the sentence, and the way their meanings combine to form the overall meaning of the sentence. We call the application module responsible for that a parser.

A whole range of syntactic formalisms have been proposed to model the syntax of a natural language. Most parsers heavily rely on treebanks - corpora of written or spoken utterances, in which the word-to-word relations have been annotated manually by linguistically trained annotators. As most treebanks are bound to a particular syntactic formalism, the formalism used by a parser is usually determined by the data available for its training.

Two particular formalisms deserve our special attention: the dependency syntax, as defined in the Prague Dependency Treebank (PDT), and the constituent syntax, as defined in the Penn TreeBank. The former is the most important (and the only for which a treebank is available) formalism for Czech; the latter applies to English and is probably the most popular formalism in the world.

PDT has been annotated in three layers, called morphological, analytical and tectogrammatical. The analytical layer corresponds to the surface, the tectogrammatical to the deep syntax. Thus the analytical representation (AR) is closer to the appearance of the sentence in the text while the tectogrammatical one (TR) is closer to the meaning. PDT 1.0 (issued 2001) contained ARs and just a tiny sample of TRs. Considerable amount of TRs appear first in PDT 2.0 (issued 2005). That is why there are no parsers producing TRs yet.

The annotation in ARs consists of two parts: the dependency structure (tree), and the analytical functions (also syntactic tags, s-tags). Most parsers concentrate only on the tree structure and do not assign the s-tags. Nevertheless, the s-tag assignment is a rather easy task once the structure has been built. The linguistic description of how the ARs of particular language constructions should look like is given in the Manual for the annotators (Czech version here). The list and description of possible s-tags is given there as well.

An analytical dependency structure is a rooted tree where each node (except of the root) corresponds to one word of the underlying sentence (and for each word there is a corresponding node). The simplest representation of such a tree is a sequence of integer numbers: i-th position in the sequence corresponds to the i-th word of the underlying sentence, and the number in that position is interpreted as the index of the word, on which the i-th word depends. We use the terms dependent, depending node or child for the i-th word, and governor, governing node or parent for the other word.

The standard method of evaluating parser accuracy is computing the percentage of children that got the correct parent index, among all words in a test data set.

Tests on the Prague Dependency Treebank 1.0

PDT 1.0 provides two data sets inteded to evaluate analytical parsers, the d-test (development) and the e-test (cross-evaluation). See also the PDT 1.0 Data Layout Table. The d-test consists of 153 files, 7319 non-empty sentences, and 126,030 words. The evaluation on the d-test data is available for most parsers, so for the sake of comparability we stick with that data here.

The following table gives the accuracy figures for various parsers on the PDT 1.0 d-test data. (Note: the development of some of the parsers is going on. We try to maintain here either their published results, or the results we measured ourselves in case we have the parser or its output on d-test data available.)

Author (parser) Accuracy Notes
Combination ec+mc+zž+dz 86.3 Zeman & Žabokrtský (2005)
Hall/Novák/Charniak 85.0 Hall & Novák (2005)
Ryan McDonald 84.4 McDonald et al. (2005)
Eugene Charniak 84.3 Charniak (2000) describes the original parser for English. Czech results measured by Zeman on the output provided by Charniak in 2003.
Michael Collins 82.5 Collins et al. (1999) gives results on PDT 0.5. Re-run and re-measured on PDT 1.0 by Zeman.
Joakim Nivre 80.1 Nivre & Nilsson (2005)
Zdeněk Žabokrtský 75.2 Parser run and accuracy measured by Zeman in 2004.
Daniel Zeman 74.7 Zeman (2004a)
Václav Klimeš 74.7 Accuracy reported by Klimeš in 2006; to be published.
Tomáš Holan (r2l) 71.7 Measured by Zeman on parser output provided by Holan in early 2004.
Tomáš Holan (l2r) 69.9 Measured by Zeman on parser output provided by Holan in early 2004.
Tomáš Holan (pshrt) 62.8 Measured by Zeman on parser output provided by Holan in early 2004.

Note that due to version incompatibility, Charniak's parser cannot be re-trained and is gradually deprecated. The Collins' parser will be included on the PDT 2.0 CD-ROM. We plan to make other parsers (dz, zž) available as well. Let us know if you are interested.

Tests on the analytical layer of the Prague Dependency Treebank 2.0

PDT 2.0 provides two data sets inteded to evaluate analytical parsers, the d-test (development) and the e-test (cross-evaluation). Each of those sets is split into two parts, one that has tectogrammatical annotation as well (tamw/[de]test/*.a) and one that has not (amw/[de]test/*.a). For analytical parsing, both parts have to be combined. See also the PDT 2.0 Data Description. The d-test data consists of 9,270 sentences and 158,962 tokens. The e-test data consists of 10,148 sentences and 173,586 tokens. Do not use this data to test parsers that have been trained on PDT 1.0! Some of the current test data were declared as training data in PDT 1.0!

Extra care must be taken when running parsing experiments or reporting results on PDT 2.0 as to which source of morphological information was used by the parser: undisabiguated, automatically disambiguated (by which tagger?) or manually disambiguated. In fact, the same had to be taken into account when working with PDT 1.0. However, with version 2.0 it is easier to overlook that one is actually working with the wrong source of morphology, because:

It is strongly recommended that anyone report results of experiments where the parser had not access to any human annotation in the test data, including morphology (of course, use everything you find useful in the training data). The obvious reason is that your parser is unlikely to have such information available in a real-world application.

The following table gives the accuracy figures for various parsers on the PDT 2.0 test data. (Note: the development of some of the parsers is going on. We try to maintain here either their published results, or the results we measured ourselves in case we have the parser or its output on d-test data available.)

Author (parser) D-test
accuracy
E-test
accuracy
Notes
Combination rmd+mc+zž+5×th* 86.2 85.8 Holan & Žabokrtský (2006), Simply Weighted Parsers (SWP)
McDonald/Novák/Žabokrtský 84.7 Feature engineering over McDonald's MST parser. See Novák & Žabokrtský (2007).
Ryan McDonald 84.2 84.0 Same parser as in McDonald et al. (2005), run by Václav Novák in 2006. PDT 2.0 results published in Holan & Žabokrtský (2006). Automatically disambiguated tags used during both training and parsing.
Michael Collins 81.6 80.9 Same parser as in Collins et al. (1999), PDT 2.0 results published in Holan & Žabokrtský (2006). Automatically disambiguated tags used during both training and parsing.
Zdeněk Žabokrtský 76.1 75.9 A rule-based parser, described in Holan & Žabokrtský (2006). Automatically disambiguated tags used.
Daniel Zeman 75.0 74.8 Same parser and settings as in Zeman (2004a), run by Zeman in 2006. Automatically disambiguated tags used during both training and parsing.
Václav Klimeš 74.8 74.6 Accuracy reported by Klimeš in 2006; to be published. Automatically disambiguated tags used during both training and parsing.
Tomáš Holan (r2l) 74.0 73.9 Pushdown automaton parser (Holan & Žabokrtský, 2006). Automatically disambiguated tags used.
Tomáš Holan (l2r) 71.4 71.3 Pushdown automaton parser (Holan & Žabokrtský, 2006). Automatically disambiguated tags used.
Tomáš Holan (analog) 71.5 71.1 A parser that "searches for the local tree configuration most similar to the training data" (Holan & Žabokrtský, 2006) (after all, which parser does not?) The parser itself shall be described in Holan (2005). Automatically disambiguated tags used.
Tomáš Holan (r23) 61.1 61.7 Pushdown automaton parser (Holan & Žabokrtský, 2006). 3-letter word endings used, instead of tags.
Tomáš Holan (l23) 54.9 53.3 Pushdown automaton parser (Holan & Žabokrtský, 2006). 3-letter word endings used, instead of tags.

CoNLL Shared Task 2006

The CoNLL-X (2006) shared task involved dependency parsing of 13 languages including Czech. Training and test data were taken from PDT 1.0. However, the published results are not directly comparable to the results presented above because of the following reasons:

For an overview of the results by the various teams, see Buchholz & Marsi (2006).

Authors Labeled accuracy Notes
Joakim Nivre 82.4 Run later on the CoNLL-X data, see Nivre (2009).
Ryan McDonald, Kevin Lerman, Fernando Pereira 80.2
Joakim Nivre, Johan Hall, Jens Nilsson, Gülşen Eryiğit, Svetoslav Marinov 78.4
John O'Neil 76.6
Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto 76.2
Kenji Sagae 75.2
Simon Corston-Oliver, Anthony Aue 74.5
Ming-Wei Chang, Quang Do, Dan Roth 72.9
Richard Johansson, Pierre Nugues 71.5
Xavier Carreras, Mihai Surdeanu, Lluís Màrquez 68.8
Sebastian Riedel, Ruket Çakıcı, Ivan Meza-Ruiz 67.4
Eckhard Bick 63.0
Sander Canisius, Toine Bogers, Antal van den Bosch, Jeroen Geertzen, Erik Tjong Kim Sang 60.9
Markus Dreyer, David A. Smith, Noah A. Smith 60.5
Giuseppe Attardi 59.8
Yu-Chieh Wu, Yue-Shi Lee, Jie-Chi Yang 59.4
Ting Liu, Jinshan Ma, Huijia Zhu, Sheng Li 58.5
Michael Schiehlen, Kristina Spranger 53.3
Deniz Yuret 51.9

CoNLL Shared Task 2007

The CoNLL 2007 shared task involved dependency parsing of 10 languages including Czech. Training and test data were taken from PDT 2.0.

For an overview of the results by the various teams, see Nivre et al. (2007).

Authors Labeled Unlabeled
Tetsuji Nakagawa 80.19 86.28
Xavier Carreras 78.60 85.16
Jens Nilsson, Johan Hall, Joakim Nivre, Gülşen Eryiğit, Beáta Megyesi, Mattias Nilsson, Markus Saers 77.98 83.59
Ivan Titov, James Henderson 77.94 84.19
Giuseppe Attardi, Felice Dell'Orletta, Maria Simi, Atanas Chanev, Massimiliano Ciaramita 77.37 83.40
Johan Hall, Jens Nilsson, Joakim Nivre, Gülşen Eryiğit, Beáta Megyesi, Mattias Nilsson, Markus Saers 77.22 82.35
Xiangyu Duan, Jun Zhao, Bo Xu 75.34 80.82
Kenji Sagae, Jun'ichi Tsujii 74.83 81.27
Michael Schiehlen, Kristina Spranger 73.86 81.73
Wenliang Chen, Yujie Chang, Hitoshi Isahara 73.69 80.14
Le-Minh Nguyen, Akira Shimazu, Phuong-Thai Nguyen, Xuan-Hieu Phan 72.54 80.91
Keith Hall, Jiří Havelka, David A. Smith 72.27 78.47
Richard Johansson, Pierre Nugues 70.98 77.39
Prashanth Reddy Mannem 70.68 77.20
Maes 67.38 74.03
Yu-Chieh Wu, Jie-Chi Yang, Yue-Shi Lee 66.72 73.07
Sander Canisius, Erik Tjong Kim Sang 56.14 72.12
Jia 54.95 70.41
Svetoslav Marinov 53.47 59.57
Daniel Zeman 50.21 59.19

References

The following list of publications gives the picture of parsing results achieved within the ÚFAL research projects, as well as some relevant references to publications of authors at other sites.


Maintained by Daniel Zeman
Updated on 5 October 2009