Czech HMM-based Tagger (using full morphology)
Author: Pavel Krbec (HMM Tagger)
2001, Jan Hajic (morphology) 2001
The HMM based Tagger is an implementation
of the Czech tagger developed at UFAL
. In order to work, the tagger requires preprocessing by a Czech morphological
module with a very high coverage. This module covers a superset of the
"HM" morphology. Both the morphological module and the tagger are supplied in two independent packages
as binary executables, together with all necessary precompiled Czech data.
Input must be in the ISO Latin 2 (iso-8859-2) code and follow the usual
definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd).
(As is the case with many of the tools provided with PDT 1.0, both executables
also accept - and then produce - a "simplified SGML", which is not a real,
valid SGML, but simply contains at least the tags for words, punctuation,
and sentence breaks, one item per line.)
The tagger and the included morphological
module are compiled for Linux (2.2.x and above, such as Red Hat 7.0 and
Unpack the HMMtgr.tgz archive in
a directory where you want the tagger to live, e.g.:
cp HMMtgr.tgz /usr/local/HMMtools
tar -xzvf HMMtgr.tgz
The morphological module
CZ010619x has to be installed in the same directory as the HMM tagger. Follow the installation guide attached to the morphological module.
Check the installation with the following
You should be able to see something
similar to this output:
libm.so.6 => /lib/libm.so.6 (0x40020000)
libc.so.6 => /lib/libc.so.6 (0x40040000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
The README.files file contains the
info about the distribution files. In case some of the files mentioned
in the README.files is missing the installation will not work properly.
Running the HMM-tagger
The main script has to be run from the
distribution directory. Follow this example:
run_all <INPUT_FILE> <OUTPUT_FILE>
The HMM tagger is an unix application
which heavily depends on the unix environment. It needs a fully configured
and working system.
perl v5.6.0 (perl -v) and above installed
bash (installed in /bin/bash) and bash utilities
such as tee, cat and others
/tmp directory with enough free space
In order to get maximum performance, be sure
that the /tmp directory is locally mounted
The HMM tagger consumes about 13-20MBytes
of memory. The more memory it gets, the faster I/O operations can you expect.
Tagging many small files tends to be very
CPU expensive, as the train data will be reloaded after each file. The
solution is to concatenate the files.