A Guide to Czech Language Tagging at UFAL

We present results achieved by either former or current researchers at Institute of Formal and Applied Linguistics (UFAL), Faculty of Mathematics and Physics, Charles University in Prague.

Practically every natural language processing system for (not only) an inflective language needs a morphologically processed text, i.e. needs to know for each word the list of all possible combinations (tags) of morphological category values which make sense for the given word. However, most the systems need more precise information - they need just a single combination of morphological category values which fits to the particular context. The task called tagging uses the context of a word (in the input text) to select the correct tag from the list of all possible tags.

When developing morphological tools (morphological analyzer, tagger) for a given language, it is necessary first to define a set of possible tags which correspond to a linguistic notion of morphology. Each tag contains such information (in the general sense) about the grammatical categories of the word form in question, which belong to the morphological level of natural language description. In the tag system developed for the Czech morphological processing, the positional tag system has been developed - Czech Positional Tag System (quick 'html' reference).

The strategies we apply to tag texts belong to corpus-based approaches (in the main, see Publications), i.e. they work on annotated corpora to achieve appropriate features the character of which depends on the underlying algorithm (probabilities, memory patterns, transformation rules, weights, ...). For Czech, the situation is more than great - there are two sources of data - Prague Dependency Treebank (PDT) and Czech Academic Corpus (CAC). Mainly thanks to the presence of CAC (annotated during the 60s and 70s in the Institute of Czech Language) we were able to run the very first tagging experiment (probabilistic one).




Maintained by Johanka Spoustová and Barbora Hladká