Tags: 

A Complete Guide to Czech Language Parsing

If an application is supposed to understand natural language to some extent, it usually has to (syntactically) parse the input utterances. I.e., it attempts to discover relations between words of the sentence, and the way their meanings combine to form the overall meaning of the sentence. We call the application module responsible for that a parser.

A whole range of syntactic formalisms have been proposed to model the syntax of a natural language. Most parsers heavily rely on treebanks - corpora of written or spoken utterances, in which the word-to-word relations have been annotated manually by linguistically trained annotators. As most treebanks are bound to a particular syntactic formalism, the formalism used by a parser is usually determined by the data available for its training.

Two particular formalisms deserve our special attention: the dependency syntax, as defined in the Prague Dependency Treebank (PDT), and the constituent syntax, as defined in the Penn TreeBank. The former is the most important (and the only for which a treebank is available) formalism for Czech; the latter applies to English and was historically more popular in some parts of the world.

PDT has been annotated in three layers, called morphological, analytical and tectogrammatical. The analytical layer corresponds to the surface, the tectogrammatical to the deep syntax. Thus the analytical representation (AR) is closer to the appearance of the sentence in the text while the tectogrammatical one (TR) is closer to the meaning. PDT 1.0 (released 2001) contained ARs and just a tiny sample of TRs. Considerable amount of TRs appear first in PDT 2.0 (released 2005).

The annotation in ARs consists of two parts: the dependency structure (tree), and the analytical functions (also syntactic tags, s-tags or dependency relation labels). Some parsers concentrate only on the tree structure and do not assign the s-tags. Nevertheless, the s-tag assignment is a rather easy task once the structure has been built. The linguistic description of how the ARs of particular language constructions should look like is given in the Manual for the annotators (Czech version here). The list and description of possible s-tags is given there as well.

An analytical dependency structure is a rooted tree where each node (except of the root) corresponds to one word of the underlying sentence (and for each word there is a corresponding node). The simplest representation of such a tree is a sequence of integer numbers: i-th position in the sequence corresponds to the i-th word of the underlying sentence, and the number in that position is interpreted as the index of the word, on which the i-th word depends. We use the terms dependent, depending node or child for the i-th word, and governor, governing node or parent for the other word.

The standard method of evaluating parser accuracy is computing the percentage of children that got the correct parent index, among all words in a test data set. This is also called the unlabeled attachment score (UAS) to emphasize that labels of the dependency relations are not evaluated. Alternatively we can require that both the parent is identified and the relation is labeled correctly. Then we have the labeled attachment score (LAS). Unless specifically noted otherwise, the term accuracy in this overview means UAS.

Tests on the Prague Dependency Treebank 1.0

PDT 1.0 provides two data sets inteded to evaluate analytical parsers, the d-test (development) and the e-test (cross-evaluation). See also the PDT 1.0 Data Layout Table. The d-test consists of 153 files, 7319 non-empty sentences, and 126,030 words. The evaluation on the d-test data is available for most parsers, so for the sake of comparability we stick with that data here.

The following table gives the accuracy figures for various parsers on the PDT 1.0 d-test data. (Note: the development of some of the parsers is going on. We try to maintain here either their published results, or the results we measured ourselves in case we have the parser or its output on d-test data available.)

Author (parser) Accuracy Notes
Combination ec+mc+zž+dz 86.3 Zeman & Žabokrtský (2005)
Hall/Novák/Charniak 85.0 Hall & Novák (2005)
Ryan McDonald 84.4 McDonald et al. (2005)
Eugene Charniak 84.3 Charniak (2000) describes the original parser for English. Czech results measured by Zeman on the output provided by Charniak in 2003.
Michael Collins 82.5 Collins et al. (1999) gives results on PDT 0.5. Re-run and re-measured on PDT 1.0 by Zeman.
Joakim Nivre 80.1 Nivre & Nilsson (2005)
Zdeněk Žabokrtský 75.2 Parser run and accuracy measured by Zeman in 2004.
Daniel Zeman 74.7 Zeman (2004a)
Václav Klimeš 74.7 Accuracy reported by Klimeš in 2006; to be published.
Tomáš Holan (r2l) 71.7 Measured by Zeman on parser output provided by Holan in early 2004.
Tomáš Holan (l2r) 69.9 Measured by Zeman on parser output provided by Holan in early 2004.
Tomáš Holan (pshrt) 62.8 Measured by Zeman on parser output provided by Holan in early 2004.

Note that due to version incompatibility, Charniak's parser cannot be re-trained. The Collins' parser was included on the PDT 2.0 CD-ROM.

Tests on the analytical layer of the Prague Dependency Treebank 2.0

PDT 2.0 provides two data sets inteded to evaluate analytical parsers, the d-test (development) and the e-test (cross-evaluation). Each of those sets is split into two parts, one that has tectogrammatical annotation as well (tamw/[de]test/*.a) and one that has not (amw/[de]test/*.a). For analytical parsing, both parts have to be combined. See also the PDT 2.0 Data Description. The training data consists of 68,562 sentences and 1,172,299 tokens. The d-test data consists of 9,270 sentences and 158,962 tokens. The e-test data consists of 10,148 sentences and 173,586 tokens. Do not use this data to test parsers that have been trained on PDT 1.0! Some of the current test data were declared as training data in PDT 1.0!

Extra care must be taken when running parsing experiments or reporting results on PDT 2.0 as to which source of morphological information was used by the parser: undisabiguated, automatically disambiguated (by which tagger?) or manually disambiguated. In fact, the same had to be taken into account when working with PDT 1.0. However, with version 2.0 it is easier to overlook that one is actually working with the wrong source of morphology, because:

  • The annotation is stand-alone, meaning that morphology resides in a file separate from the actual corpus texts.
  • Alternative morphological annotation from a different source would be in a file separate from the primary morphological annotation.
  • Unlike in the CSTS format used in PDT 1.0, the human/machine distinction is not made explicit by using different XML elements in the new PML format.
  • Most notably, due to space considerations, PDT 2.0 is usually distributed with manually disambigated morphology only. This means that one would have to run a morphological analyzer and a tagger (such as those provided on the PDT 2.0 CD) to obtain machine-disambiguated morphology for the data.

It is strongly recommended that anyone report results of experiments where the parser had not access to any human annotation in the test data, including morphology (of course, use everything you find useful in the training data). The obvious reason is that your parser is unlikely to have such information available in a real-world application.

The following table gives the accuracy figures for various parsers on the PDT 2.0 test data. (Note: the development of some of the parsers is going on. We try to maintain here either their published results, or the results we measured ourselves in case we have the parser or its output on d-test data available.)

Author (parser) D-test
accuracy
E-test
accuracy
Notes
Combination rmd+mc+zž+5×th* 86.2 85.8 Holan & Žabokrtský (2006), Simply Weighted Parsers (SWP)
Hall/Nilsson/Nivre 86.0 85.8 Malt Parser 1.7 with stacklazy algorithm and Java implementation of LibSVM learner (see Nivre (2009)), run by Zeman in June 2013, using feature definition file provided by the Uppsala team. Automatically disambiguated tags used during both training and parsing.
McDonald/Novák/Žabokrtský   84.7 Feature engineering over McDonald's MST parser. See Novák & Žabokrtský (2007).
Ryan McDonald 84.2 84.0 Same parser as in McDonald et al. (2005), run by Václav Novák in 2006. PDT 2.0 results published in Holan & Žabokrtský (2006). Automatically disambiguated tags used during both training and parsing.
Michael Collins 81.6 80.9 Same parser as in Collins et al. (1999), PDT 2.0 results published in Holan & Žabokrtský (2006). Automatically disambiguated tags used during both training and parsing.
Zdeněk Žabokrtský 76.1 75.9 A rule-based parser, described in Holan & Žabokrtský (2006). Automatically disambiguated tags used.
Daniel Zeman 75.0 74.8 Same parser and settings as in Zeman (2004a), run by Zeman in 2006. Automatically disambiguated tags used during both training and parsing.
Václav Klimeš 74.8 74.6 Accuracy reported by Klimeš in 2006; to be published. Automatically disambiguated tags used during both training and parsing.
Tomáš Holan (r2l) 74.0 73.9 Pushdown automaton parser (Holan & Žabokrtský, 2006). Automatically disambiguated tags used.
Tomáš Holan (l2r) 71.4 71.3 Pushdown automaton parser (Holan & Žabokrtský, 2006). Automatically disambiguated tags used.
Tomáš Holan (analog) 71.5 71.1 A parser that “searches for the local tree configuration most similar to the training data” (Holan & Žabokrtský, 2006) (after all, which parser does not?) The parser itself shall be described in Holan (2005). Automatically disambiguated tags used.
Tomáš Holan (r23) 61.1 61.7 Pushdown automaton parser (Holan & Žabokrtský, 2006). 3-letter word endings used, instead of tags.
Tomáš Holan (l23) 54.9 53.3 Pushdown automaton parser (Holan & Žabokrtský, 2006). 3-letter word endings used, instead of tags.

A Winner?

In their EMNLP paper, Koo et al. (2010) report unlabeled accuracy of 87.32 % on PDT. Unfortunately they do not specify what version of PDT and what test dataset they used, not to mention the distinction between gold and automatically disambiguated morphology. So it is difficult to tell how this result compares to the others.

CoNLL Shared Task 2006

The CoNLL-X (2006) shared task involved dependency parsing of 13 languages including Czech. Training and test data were taken from PDT 1.0. However, the published results are not directly comparable to the results presented above because of the following reasons:

  • Both training and test data is smaller than in original PDT: 72,703 training sentences (1,249,408 tokens) and 365 test sentences (5853 tokens).
  • The source of morphology is unknown, it could be the manually annotated gold standard.
  • Different attachment metric (e.g. punctuation nodes do not count).
  • The official score is labeled accuracy, i.e. attachment plus dependency label.

For an overview of the results by the various teams, see Buchholz & Marsi (2006).

Authors Labeled accuracy Notes
Joakim Nivre 82.4 Run later on the CoNLL-X data, see Nivre (2009).
Ryan McDonald, Kevin Lerman, Fernando Pereira 80.2  
Joakim Nivre, Johan Hall, Jens Nilsson, Gülşen Eryiğit, Svetoslav Marinov 78.4  
John O'Neil 76.6  
Yuchang Cheng, Masayuki Asahara, Yuji Matsumoto 76.2  
Kenji Sagae 75.2  
Simon Corston-Oliver, Anthony Aue 74.5  
Ming-Wei Chang, Quang Do, Dan Roth 72.9  
Richard Johansson, Pierre Nugues 71.5  
Xavier Carreras, Mihai Surdeanu, Lluís Màrquez 68.8  
Sebastian Riedel, Ruket Çakıcı, Ivan Meza-Ruiz 67.4  
Eckhard Bick 63.0  
Sander Canisius, Toine Bogers, Antal van den Bosch, Jeroen Geertzen, Erik Tjong Kim Sang 60.9  
Markus Dreyer, David A. Smith, Noah A. Smith 60.5  
Giuseppe Attardi 59.8  
Yu-Chieh Wu, Yue-Shi Lee, Jie-Chi Yang 59.4  
Ting Liu, Jinshan Ma, Huijia Zhu, Sheng Li 58.5  
Michael Schiehlen, Kristina Spranger 53.3  
Deniz Yuret 51.9  

CoNLL Shared Task 2007

The CoNLL 2007 shared task involved dependency parsing of 10 languages including Czech. Training and test data were taken from PDT 2.0.

  • The training-test data split should correspond to the “official” one published with PDT. However, only part of the data is used: 25,364 training sentences (432,296 tokens), 364 development sentences (5760 tokens) and 286 test sentences (4724 tokens).
  • The source of morphology is probably the manually annotated gold standard.
  • Unlike in 2006, accuracy of attaching punctuation nodes does count.
  • Both labeled and unlabeled accuracy have been published.

For an overview of the results by the various teams, see Nivre et al. (2007).

Authors Labeled Unlabeled
Tetsuji Nakagawa 80.19 86.28
Xavier Carreras 78.60 85.16
Jens Nilsson, Johan Hall, Joakim Nivre, Gülşen Eryiğit, Beáta Megyesi, Mattias Nilsson, Markus Saers 77.98 83.59
Ivan Titov, James Henderson 77.94 84.19
Giuseppe Attardi, Felice Dell'Orletta, Maria Simi, Atanas Chanev, Massimiliano Ciaramita 77.37 83.40
Johan Hall, Jens Nilsson, Joakim Nivre, Gülşen Eryiğit, Beáta Megyesi, Mattias Nilsson, Markus Saers 77.22 82.35
Xiangyu Duan, Jun Zhao, Bo Xu 75.34 80.82
Kenji Sagae, Jun'ichi Tsujii 74.83 81.27
Michael Schiehlen, Kristina Spranger 73.86 81.73
Wenliang Chen, Yujie Chang, Hitoshi Isahara 73.69 80.14
Le-Minh Nguyen, Akira Shimazu, Phuong-Thai Nguyen, Xuan-Hieu Phan 72.54 80.91
Keith Hall, Jiří Havelka, David A. Smith 72.27 78.47
Richard Johansson, Pierre Nugues 70.98 77.39
Prashanth Reddy Mannem 70.68 77.20
Maes 67.38 74.03
Yu-Chieh Wu, Jie-Chi Yang, Yue-Shi Lee 66.72 73.07
Sander Canisius, Erik Tjong Kim Sang 56.14 72.12
Jia 54.95 70.41
Svetoslav Marinov 53.47 59.57
Daniel Zeman 50.21 59.19

CoNLL Shared Task 2009

The CoNLL 2009 shared task focused on semantic role labeling but it also involved dependency parsing of 7 languages including Czech. Training and test data were taken from PDT 2.0.

  • The training-test data split should correspond to the “official” one published with PDT. However, only part of the data is used: 38,727 training sentences (652,544 tokens), 5228 development sentences (87,988 tokens) and 4213 test sentences (70,348 tokens).
  • Both manually and automatically disambiguated morphology was available in all three datasets.
  • Only labeled attachment scores were published for the syntactic part of the task.

For an overview of the results by the various teams, see Hajič et al. (2009) and also this site.

Authors Labeled Unlabeled
Andrea Gesmundo, James Henderson, Paola Merlo, Ivan Titov 80.38  
Bernd Bohnet 80.11  
Wanxiang Che, Zhenghua Li, Yongqiang Li, Yuhang Guo, Bing Qin, Ting Liu 80.01  
Hai Zhao, Wenliang Chen, Jun'ichi Kazama, Kiyotaka Uchimoto, Kentaro Torisawa 79.70  
Yotaro Watanabe, Masayuki Asahara, Yuji Matsumoto 78.17  
Yi Zhang, Rui Wang, Stephan Oepen 75.58  
Xavier Lluís, Stefan Bott, Lluís Màrquez 75.00  
Brown 73.29  
Buzhou Tang, Lu Li, Xinxin Li, Xuan Wang, Xiaolong Wang 72.60  
Qifeng Dai, Enhong Chen, Liu Shi 58.69  
Han Ren, Donhong Ji, Jing Wan, Mingyao Zhang 57.30  
Daniel Zeman 57.06  
Roser Morante, Vincent van Asch, Antal van den Bosch 49.41  

References

The following list of publications gives the picture of parsing results achieved within the ÚFAL research projects, as well as some relevant references to publications of authors at other sites.

  • Ondřej Bojar (2004a): Czech Syntactic Analysis Constraint-Based, XDG: One Possible Start. In: Prague Bulletin of Mathematical Linguistics, 81, pp. 43-54. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Ondřej Bojar (2004b): Problems of Inducing Large Coverage Constraint-Based Dependency Grammar for Czech. In: H. Christiansen, P. R. Skadhauge, J. Villadsen (eds.): Proceedings of International Workshop on Constraint Solving and Language Processing, pp. 29-42. Roskilde Universitet, Roskilde, Denmark.
    Available: PDF PS BibTeX
  • Sabine Buchholz, Erwin Marsi (2006): CoNLL-X Shared Task on Multilingual Dependency Parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pp. 149-164. Association for Computational Linguistics, New York City, New York, USA.
  • Eugene Charniak (2000): A Maximum-Entropy-Inspired Parser. In: Proceedings of NAACL. Association for Computational Linguistics, Seattle, Washington.
    Available: gzipped PS from Eugene Charniak's homepage
  • Michael Collins, Jan Hajič, Eric Brill, Lance Ramshaw, Christoph Tillmann (1999): A Statistical Parser for Czech. In: Proceedings of the 37th Meeting of the ACL, pp. 505-512. University of Maryland, College Park, Maryland.
    Available: PS from Michael Collins' homepage
  • Jan Hajič, Eric Brill, Michael Collins, Barbora Hladká, Douglas Jones, Cynthia Kuo, Lance Ramshaw, Oren Schwartz, Christoph Tillmann, Daniel Zeman (1998): Core Natural Language Processing Technology Applicable to Multiple Languages. Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Maryland.
    Available: PDF PS BibTeX
  • Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, Yi Zhang: The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL): Shared Task, pp. 1–18, Association for Computational Linguistics, Boulder, Colorado
    Available: PDF
  • Keith Hall, Václav Novák (2005): Corrective Modeling for Non-Projective Dependency Parsing. In: Proceedings of the International Workshop on Parsing Technologies (IWPT). Association for Computational Linguistics, Vancouver, British Columbia.
    Available: PDF
  • Keith Hall, Václav Novák (2010): Corrective Dependency Parsing. In: Joakim Nivre (ed.): Trends in Parsing Technology. Springer-Verlag Berlin Heidelberg.
  • Tomáš Holan (2005): Genetické učení závislostních analyzátorů. In: P. Vojtáš (ed.): Proceedings of ITAT 2005 Univerzita Pavla Jozefa Šafárika, Košice, Slovakia.
  • Tomáš Holan, Vladislav Kuboň, Martin Plátek, Karel Oliva (2003): A Theoretical Basis of an Architecture of a Shell of a Reasonably Robust Syntactic Analyser. In: V. Matoušek, P. Mautner (eds.): Proceedings of the 7th International Conference on Text, Speech and Dialogue, pp. 58-65. Springer-Verlag, Berlin / Heidelberg / New York, České Budějovice, Czechia.
    Available: BibTeX
  • Tomáš Holan, Zdeněk Žabokrtský (2006): Combining Czech Dependency Parsers. To Appear In: Proceedings of the 9th International Conference on Text, Speech and Dialogue. Springer-Verlag, Berlin / Heidelberg / New York, Brno, Czechia.
    Available: PDF (preliminary version)
  • Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola, David Sontag (2010): Dual Decomposition for Parsing with Non-Projective Head Automata. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1288-1298. MIT, Massachusetts, USA.
    Available: PDF BibTeX
  • Vladislav Kuboň (2001a): Problems of Robust Parsing of Czech. PhD thesis. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Vladislav Kuboň (2001b): A Method for Analyzing Clause Complexity. In: Prague Bulletin of Mathematical Linguistics, 75, pp. 5-28. Univerzita Karlova, Praha, Czechia.
    Available: BibTeX
  • Vladislav Kuboň, Tomáš Holan, Karel Oliva, Martin Plátek (1998a): Two Useful Measures of Word Order Complexity. In: A. Polguere, S. Kahane (eds.): Proceedings of the COLING-ACL Workshop on Dependency-Based Grammars, pp. 21-28. Université de Montréal, Montréal, Quebec.
    Available: BibTeX
  • Vladislav Kuboň, Tomáš Holan, Karel Oliva, Martin Plátek (1998b): Two Useful Measures of Word Order Complexity. In: ÚFAL Technical Report, 4. Univerzita Karlova, Praha, Czechia.
    Available: BibTeX
  • Vladislav Kuboň, Tomáš Holan, Karel Oliva, Martin Plátek (2001): Word-Order Relaxations & Restrictions within a Dependency Grammar. In: Proceedings of International Workshop on Parsing Technologies, pp. 237-240. Qīnghuá Dàxué Chūbǎnshè, Běijīng, China.
    Available: BibTeX
  • Vladislav Kuboň, Martin Plátek (2001): A Method of Accurate Robust Parsing for Czech. In: V. Matoušek, P. Mautner, R. Mouček, K. Taušer (eds.): Proceedings of the 5th International Conference on Text, Speech and Dialogue, pp. 69-92. Springer-Verlag, Berlin / Heidelberg / New York, Plzeň, Czechia.
    Available: BibTeX
  • Markéta Lopatková, Martin Plátek, Vladislav Kuboň (2005): Závislostní redukční analýza přirozených jazyků. In: P. Vojtáš (ed.): Proceedings of ITAT 2004 Univerzita Pavla Jozefa Šafárika, Košice, Slovakia.
    Available: BibTeX
  • Ryan McDonald, Fernando Pereira, Kiril Ribarov, Jan Hajič (2005): Non-projective Dependency Parsing using Spanning Tree Algorithms. In: Proceedings of the Human Language Technology / Empirical Methods in Natural Language Processing conference (HLT-EMNLP) Association for Computational Linguistics, Vancouver, British Columbia.
    Available: PDF from the ACL Anthology
  • Joakim Nivre (2009): Non-Projective Dependency Parsing in Expected Linear Time. In: Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pp. 351-359. Association for Computational Linguistics, Suntec, Singapore.
    Available: PDF from the ACL Anthology
  • Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian Riedel, Deniz Yuret (2007): The CoNLL 2007 Shared Task on Dependency Parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pp. 915-932. Association for Computational Linguistics, Praha, Czechia.
  • Joakim Nivre, Jens Nilsson (2005): Pseudo-Projective Dependency Parsing. In: Proceedings of the 43rd Annual Meeting of the ACL. University of Michigan, Ann Arbor, Michigan.
    Available: PDF BibTeX from the ACL Anthology
  • Václav Novák, Zdeněk Žabokrtský (2007): Feature Engineering in Maximum Spanning Tree Dependency Parser. In: Proceedings of the 10th International Conference on Text, Speech and Dialogue. Západočeská univerzita, Plzeň, Czechia. Springer-Verlag Berlin Heidelberg, LNCS 4629.
    Available: PDF
  • Kiril Ribarov (2000): Rule-Based Tagging: Morphological Tagsets versus Tagset of Analytical Functions. In: M. Gavrilidou, G. Karaiannis, S. Markantonatou, S. Piperidis, G. Stainhaouer (eds.): Proceedings of the 2nd International Conference on Language Resources (LREC), pp. 1123-1125. European Language Resources Association, Athîna, Greece.
    Available: PDF PS BibTeX
  • Kiril Ribarov (2002): On the Rule-Based Parsing of Czech. In: Prague Bulletin of Mathematical Linguistics, 77, pp. 77-99. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Kiril Ribarov (2004): Automatic Building of a Dependency Tree - The Rule-Based Approach and Beyond. PhD thesis. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Anoop Sarkar, Daniel Zeman (2000): Automatic Extraction of Subcategorization Frames for Czech. In: Proceedings of the 18th International Conference on Computational Linguistics, pp. 691-697. Universität des Saarlandes, Saarbrücken, Germany.
    Available: PDF PS BibTeX
  • Daniel Zeman (1998): A Statistical Approach to Parsing of Czech. In: Prague Bulletin of Mathematical Linguistics, 69. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Daniel Zeman (2001a): How Much Will a RE-based Preprocessor Help a Statistical Parser? In: Proceedings of International Workshop on Parsing Technologies, pp. 253-256. Qīnghuá Dàxué Chūbǎnshè, Běijīng, China.
    Available: PDF PS BibTeX
  • Daniel Zeman (2001b): Parsing with Regular Expressions: A Minute to Learn, a Lifetime to Master. In: Prague Bulletin of Mathematical Linguistics, 75, pp. 29-37. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Daniel Zeman (2002a): Can Subcategorization Help a Statistical Dependency Parser? In: S.-C. Tseng (ed.): Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 1156-1162. Zhōngyāng Yánjiùyuàn, Táiběi, Taiwan.
    Available: PDF PS BibTeX
  • Daniel Zeman (2002b): How to Decrease the Performance of a Statistical Parser. In: Prague Bulletin of Mathematical Linguistics, 78, pp. 53-62. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Daniel Zeman (2004a): Parsing with a Statistical Dependency Model. PhD thesis. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Daniel Zeman (2004b): Neprojektivity v Pražském závislostním korpusu (PDT). In: ÚFAL Technical Report, 22. Univerzita Karlova, Praha, Czechia.
    Available: PDF PS BibTeX
  • Daniel Zeman, Anoop Sarkar (2000): Learning Verb Subcategorization from Corpora: Counting Frame Subsets. In: M. Gavrilidou, G. Karaiannis, S. Markantonatou, S. Piperidis, G. Stainhaouer (eds.): Proceedings of the 2nd International Conference on Language Resources (LREC), pp. 227-233. European Language Resources Association, Athîna, Greece.
    Available: PDF PS BibTeX
  • Daniel Zeman, Zdeněk Žabokrtský (2005): Improving Parsing Accuracy by Combining Diverse Dependency Parsers. In: Proceedings of the International Workshop on Parsing Technologies (IWPT 2005). Association for Computational Linguistics, Vancouver, British Columbia.
    Available: PDF

Maintained by Daniel Zeman
Updated on 11 July 2013