A Guide to Czech Language Tagging at UFAL
We present results achieved by either former or current researchers at Institute of Formal and Applied Linguistics (UFAL), Faculty of Mathematics and Physics, Charles University in Prague.
Practically every natural language processing system for (not only) an inflective language needs a morphologically processed text, i.e. needs to know for each word the list of all possible combinations (tags) of morphological category values which make sense for the given word. However, most the systems need more precise information - they need just a single combination of morphological category values which fits to the particular context. The task called tagging uses the context of a word (in the input text) to select the correct tag from the list of all possible tags.
When developing morphological tools (morphological analyzer, tagger) for a given language, it is necessary first to define a set of possible tags which correspond to a linguistic notion of morphology. Each tag contains such information (in the general sense) about the grammatical categories of the word form in question, which belong to the morphological level of natural language description. In the tag system developed for the Czech morphological processing, the positional tag system has been developed - Czech Positional Tag System (quick 'html' reference).
The strategies we apply to tag texts belong to corpus-based approaches (in the main, see Publications), i.e. they work on annotated corpora to achieve appropriate features the character of which depends on the underlying algorithm (probabilities, memory patterns, transformation rules, weights, ...). For Czech, the situation is more than great - there are two sources of data - Prague Dependency Treebank (PDT) and Czech Academic Corpus (CAC). Mainly thanks to the presence of CAC (annotated during the 60s and 70s in the Institute of Czech Language) we were able to run the very first tagging experiment (probabilistic one).
Taggers
- Perceptron-based tagger (MORCE)
- Feature-based tagger (available on the PDT 2.0 CD-ROM)
Publications
- Jan Hajič: Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Charles University Press, Prague, Czech Republic, 2004. Available: BibTeX
- Jan Hajič: Morphological Tagging: Data vs. Dictionaries. In: Proceedings of the 6th Applied Natural Language Processing and the 1st NAACL Conference, Seattle, Washington, 2000, pp. 94-101. Available: PDF PS BibTeX
- Jan Hajič, Pavel Krbec, Karel Oliva, Pavel Květoň, Vladimír Petkevič: Serial Combination of Rules and Statistics: A Case Study in Czech Tagging. In: Proceedings of the 39th Association of Computational Linguistics Conference, Association for Computational Linguistics, Toulouse, France, 2001. Available: PDF PS BibTeX
- Jan Hajič, Vladislav Kuboň: Tagging as a Key to Successful MT. In: D. Obdržálek, J. Tesková (eds.): Proceedings of the MIS, MATFYZPRESS, Prague, Czech Republic, Praha, 2003, pp. 56-65. Available: PDF PS BibTeX
- Jan Hajič, Barbora Vidová-Hladká: Morfologické značkování korpusu českých textů stochastickou metodou. In: Slovo a slovesnost, 58, (4), Czech Academy of Science, Prague, 1997, pp. 288-304. Available: PDF PS BibTeX
- Jan Hajič, Barbora Vidová-Hladká: Probabilistic and Rule-Based Tagger of an Inflective Language - a Comparison. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington DC, USA, 1997, pp. 111-118. Available: PDF PS BibTeX
- Jan Hajič, Barbora Vidová-Hladká: Czech Language Processing – PoS Tagging. In: Proceedings of the 1st International Conference on Language Resources and Evaluation, Granada, Spain, 1998, pp. 931-936. Available: PDF PS BibTeX
- Jan Hajič, Barbora Vidová-Hladká: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: Proceedings of the COLING - ACL Conference, Montreal, Canada, 1998, pp. 483-490. Available: PDF PS BibTeX
- Jirka Hana, Daniel Zeman, Jan Hajič, Hana Hanová, Barbora Hladká, Emil Jerábek: Manual for Morphological Annotation PDT. TR-2005-27, Institute of Formal and Applied Linguistics, MFF UK, Prague, Czech Republic. 2005. Available: HTML PDF BibTeX
- Pavel Květoň, Karel Oliva: (Semi-)Automatic Detection of Errors in PoS-Tagged Corpora. In: S.-Ch. Tseng (ed.): Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, 2002, pp. 509-515. Available: BibTeX
- Karel Oliva, Pavel Květoň: Achieving an Almost Correct PoS-Tagged Corpus. In: P. Sojka, I. Kopeček, K. Pala (eds.): Proceedings of the 5th International Conference on Text, Speech and Dialogue, (2448), Springer-Verlag Berlin Heidelberg New York, 2002, pp. 19-26. Available: PDF PS BibTeX
- Karel Oliva, Pavel Květoň: Linguistically Motivated Bigrams in Part-of-Speech Tagging of Language Corpora. In: Prague Bulletin of Mathematical Linguistics, 78, MFF UK, Prague, Czech Republic, Prague, 2002, pp. 23-36. Available: PDF PS BibTeX
- Karel Oliva, Pavel Květoň, Roman Ondruška: The Computational Complexity of Rule-Based Part-of-Speech Tagging. In: V. Matoušek, P. Mautner (eds.): Proceedings of the 7th International Conference on Text, Speech and Dialogue, Springer-Verlag Berlin Heidelberg New York, 2003, pp. 82-89. Available: PDF PS BibTeX
- Karel Oliva, Pavel Květoň, Vladimír Petkevič, Milena Hnátková: The Linguistic Basis of a Rule-Based Tagger of Czech. In: P. Sojka, I. Kopeček, K. Pala (eds.): Proceedings of the 4th International Conference on Text, Speech and Dialogue, Springer-Verlag Berlin Heidelberg New York, 2000, pp. 3-8. Available: PDF PS BibTeX
- Jirí Mírovský: Morphological Annotation of Text: Automatic Disambiguation. MSc thesis, MFF UK, Prague, Czech Republic, 1998. Available: PDF PS BibTeX
- Drahomíra "johanka" Spoustová, Jan Hajič, Jan Votrubec, Pavel Krbec, Pavel Květoň: The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech, In Proceedings of the BSNLP'2007 workshop. 2007. Available: PDF BibTeX
- Drahomíra "johanka" Spoustová: Kombinované statisticko-pravidlové metody značkování češtiny. PhD thesis, UK MFF, 2007. Available: PDF BibTeX
- Drahomíra "johanka" Spoustová: Combining Statistical and Rule-Based Approaches to Morphological Tagging of Czech Texts. In: Prague Bulletin of Mathematical Linguistics, 89, MFF UK, Prague, Czech Republic, Prague, 2008, pp. 23-40. Available: PDF BibTeX
- Drahomíra "johanka" Spoustová, Pavel Pecina, Jan Hajič, Miroslav Spousta: Validating the Quality of Full Morphological Annotation. In: Proceedings of LREC 2008. Available: PDF BibTeX
- Barbora Vidová-Hladká: Czech Language Tagging. PhD thesis, ÚFAL MFF UK, 2000. Available: PDF PS BibTeX
- Barbora Vidová-Hladká: The Context (not only) for Human. In: M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, G. Stainhaouer (eds.): Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece, 2000, pp. 1113-1116. Available: PDF PS BibTeX
- Barbora Vidová-Hladká: Software Tools for Large Czech Corpora Annotation. MSc thesis, MFF UK, Prague, Czech Republic, 1994. Available: BibTeX
- Barbora Vidová-Hladká, Kiril Ribarov: PoS tags for automatic tagging and syntactic structures. In: E. Hajičová (ed.): Issues of Valency and Meaning. Studies in Honour of Jarmila Panevová, Karolinum, Charles University Press, Prague, Czech Republic, 1998, pp. 226-240. Available: PDF PS BibTeX
- Jan Votrubec: Volba vhodné sady rysů pro morfologické značkování čestiny MSc thesis, MFF UK, Prague, Czech Republic, 2005. Available: PDF, BibTeX
Talks
- Johanka Spoustová: Kombinované metody značkování, 5/2007. Available: video, slides
- Johanka Spoustová: Nové pokroky ve značkování (nejen) češtiny, 4/2008. Available: video, slides