Czech HMM Tagger and Morphology

Czech HMM-based Tagger (using full morphology)

Author: Pavel Krbec (HMM Tagger) 2001, Jan Hajic (morphology) 2001

Download

Description

The HMM based Tagger is an implementation of the Czech tagger developed at UFAL . In order to work, the tagger requires preprocessing by a Czech morphological module with a very high coverage. This module covers a superset of the Czech "HM" morphology. Both the morphological module and the tagger are supplied in two independent packages as binary executables, together with all necessary precompiled Czech data. Input must be in the ISO Latin 2 (iso-8859-2) code and follow the usual csts.dtd definition, and output is produced in the same way (ISO Latin 2 code, csts.dtd). (As is the case with many of the tools provided with PDT 1.0, both executables also accept - and then produce - a "simplified SGML", which is not a real, valid SGML, but simply contains at least the tags for words, punctuation, and sentence breaks, one item per line.)

Supported platforms

The tagger and the included morphological module are compiled for Linux (2.2.x and above, such as Red Hat 7.0 and later).

Installation

Unpack the HMMtgr.tgz archive in a directory where you want the tagger to live, e.g.:

mkdir /usr/local/HMMtools
cp HMMtgr.tgz /usr/local/HMMtools
cd /usr/local/HMMtools
tar -xzvf HMMtgr.tgz

The morphological module CZ010619x has to be installed in the same directory as the HMM tagger. Follow the installation guide attached to the morphological module.

Check the installation with the following command:

ldd insert-hash

You should be able to see something similar to this output:
        libm.so.6 => /lib/libm.so.6 (0x40020000)
        libc.so.6 => /lib/libc.so.6 (0x40040000)
        /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

The README.files file contains the info about the distribution files. In case some of the files mentioned in the README.files is missing the installation will not work properly.

Running the HMM-tagger

The main script has to be run from the distribution directory. Follow this example:

cd /usr/local/HMMtools
run_all <INPUT_FILE> <OUTPUT_FILE>

The HMM tagger is an unix application which heavily depends on the unix environment. It needs a fully configured and working system.

Requirements:

perl v5.6.0 (perl -v) and above installed in /usr/bin/perl
bash (installed in /bin/bash) and bash utilities such as tee, cat and others
/tmp directory with enough free space

Performance tips:

In order to get maximum performance, be sure that the /tmp directory is locally mounted
The HMM tagger consumes about 13-20MBytes of memory. The more memory it gets, the faster I/O operations can you expect.
Tagging many small files tends to be very CPU expensive, as the train data will be reloaded after each file. The solution is to concatenate the files.