Up

Czech Feature-based Tagger (and full morphology)

Author: Jan Hajic, 2001

Download

CZ010619x.tgz (Linux)
CZ010619xs.tgz (Sun/Solaris)

Description

The Feature-based (exponential model) Tagger is a fast implementation of the Czech tagger developed at UFAL and described elsewhere on these pages (Czech Language Tagging page). In order to get the best possible results, the tagger requires preprocessing by a Czech morphological module with a very high coverage. This module covers a superset of the Czech "FM" morphology. Both the morphological module and the tagger are supplied as binary executables, together with all necessary precompiled Czech data. Input must be in the ISO Latin 2 (iso-8859-2) code and follow the usual csts.dtd definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd). (As is the case with many of the tools provided with PDT 1.0, both executables also accept - and then produce - a "simplified SGML", which is not a real, valid SGML, but simply contains at least the tags for words, punctuation, and sentence breaks, one item per line.)

Current on-line (client/server) version of the tagger can be found here (that is the same page as the "FM" online morphology; use an appropriate checkbox to invoke the tagger instead of the "FM" morphology.)

Supported platforms

The tagger and the included morphological module are compiled for Linux (2.2.x and above, such as Red Hat 6.2 and later) and Solaris (SunOS 5.7 and later) on Sparc machines.

Installation

The name of the binary packages depends on the version (date) of the distribution: it has the form CZyymmddx.tgz (for Linux) and CZyymmddxs.tgz (for Solaris), where yymmdd is the year, month and date of the distribution. Occasionally, new distribution can be found on UFAL's/CKL's website(s) in addition to the one on the distribution CD (CZ010619x.tgz, CZ010619xs.tgz).

Unpack the CZyymmddx[s].tgz archive in a directory where you want the tagger to live, e.g. (suppose you downloaded or copied it to your home directory first):

cd /usr/local
mkdir CZ
cd CZ
cp -p ~/CZ010619x.tgz .
tar -xzvf CZ010619x.tgz

Then please refer to the CZyymmddx[s].README file in the directory for further installation instructions.

More on Czech tagging (papers and more detailed documentation and such) can be found here.

Up