The Prague Dependency Treebank 1.0
Czech Language Morphology and Tagging
Practically every natural language processing system (machine translation, information retrieval, parsing, etc.) for (not only) an inflective language needs a morphologically processed text, i.e. to know for each word the list of all possible combinations (tags) of morphological category values which make sense for the given word. However, most the systems need more precise information. They need just a single combination of morphological category values (from the list of all available combinations) to be identified which fits to the particular context. The task called tagging uses the context of a word in the input text to select the correct tag from the list of all possible tags.
When developing a morphological analyzer for a given language, it is necessary first to define a set of possible tags, which correspond to our linguistic notion of morphology. Each tag will contain such information (in the general sense) about the grammatical categories of the word form in question, which belong to the morphological level of natural language description.
In the tag system developed for the Czech morphological processing, two equivalent tag notation systems have been developed. Both of them use a string symbols to denote a morphological tag. One of them is called "compact", the other one "positional". The compact tag system (Czech Compact Tag System, available in: pdffile, psfile) is used in the Czech morphological dictionary, since it takes less space. The Czech positional tag system - detailed description (psfile, pdffile), quick reference (htmlfile,pdffile) - is directly usable in the taggers used to tag Czech texts. Conversion tables Compact Tags---> Positional Tags (b2800a.f2o) and Positional Tags ---> Compact Tags (b2800a.o2f) together with mapping scripts are also available. More detailed description of both mentioned tag sets is available in [Jan Hajic: Disambiguation of Rich Inflection - Computational Morphology of Czech. Charles University Press - Karolinum, in press, visit References. The difference between a morphologically complex and ambiguous inflective language and a language with a poor inflection is reflected in the cardinality of the particular tag sets; see the Penn Treebank tag set (available in: pdffile, psfile).
To read more about, and possibly download software for Czech morphological processing, visit Czech Morphology main page.
To read more about, and possibly download software for Czech tagging, visit Czech Language Tagging main page.
- Jan Hajic: Disambiguation of Rich Inflection - Computational Morphology of Czech. Charles University Press - Karolinum, in press.