Czech Morphology (Limited, Perl implementation)

Czech "Free" Morphology

Author: Jan Hajic, 2000-2001

Download

Description

The Free Morphology (FM) is a pair of (almost) universal (i.e., language-independent) morphology tools (FMAnalyze.pl, FMGenerate.pl) for analysis and generation of word forms for inflective languages. A frequency-based, high coverage Czech dictionary is enclosed.

The FM works best for inflective languages which can be described using segmentation of a word form into two parts: a root and an ending. Even if linguistically not quite justified, many phenomena which would normally break this simple rule can be made to work in this framework.

Special provision is made in the code for up to two "inflectional" prefixes which might both be present in some word forms. Such prefixes are found in many Slavic languages, such as Czech, Slovak, Polish, etc. Adaptation to a different language only needs a simple change in the Perl source code (if anything).

Currently, the FM expects that the data being processed is 8-bit-coded, either in ISO Latin 2 or in the MS Windows CP 1250 code page. The dictionary is provided in the ISO Latin 2 (ISO-8859-2, suffix .il2) encoding, and the conversion between CP 1250 and ISO 8859-2 is done on-the-fly if needed (and requested, see below). For code conversions from/to different coding schemes (such as LaTeX, for example, SGML entities, etc., the HBCode.pl utility can be used (included), or any other conversion tool available.

The analyzer (FMAna*.pl) can be also run in "accents-off" mode, which is invoked by supplying the dictionary name with the digit '7' added just after the dictionary name, in front of the regular suffix. I.e., for the supplied dictionary, use CZE-a7.il2 instead of CZE-a.il2 as the Dictionary-file (see below for usage notes). The analyzer would then use the original file, but it will recognize input forms with all accents stripped off (many emails, SMS messages etc. still use "unaccented Czech"). It will accept word forms with no diacritics, and analyze them into all possible lemmas and tags as if all possible combinations of diacritical marks were present. The output lemma, comments etc. are then already with diacritics (which is probably what you want; for coversion to a totally unaccented output, you might use the supplied HBCode.pl program. Please note that HBCode.pl cannot be used for restoring the accents in the original text, for obvious reasons - clever readers should be able, however, use a particular combination of FMAnalyze.pl and FMGenerate.pl with little programming effort to do exactly that [without disambiguation, of course]). -- Please note that the generator (FMGen*.pl) does not display this behavior.

Current on-line (client/server) version of the tools, which works with the latest corrections and additions to the dictionary, can be found here (for those in the U.S., there might be still a working version here).

For more information on Czech morphology and tagging, see Czech Language Morphology and Tagging page.

Supported platforms

Both FMAnalyze.pl and FMGenerate.pl are perl scripts which have been tested on Perl 5.005_3 and are supposed to run on later versions of perl as well. For those working in the MS Windows environment, a working perl interpreter can be downloaded from ActiveState (versions for other platforms - if necessary - can be downloaded from there as well), or directly from this CD (ActiveState Perl 5.22 for MS Windows). This is the same perl interpreter as needed for TrEd (the tree editor); thus you need to download it and install only once.

Installation

Two archives are provided with identical contents.

On Unix systems, unpack the FMorph.tgz
archive:

tar -xzvf FMoprh.tgz

or the FMoprh.zip file using unzip (on Unix), pkunzip.exe (on DOS; the newer version handling long filenames is needed!) or WinZip or similar tool (on any MS Windows platform).

Running `FMAnalyze.pl`

For input and output in standard 8-bit ISO Latin2 coding of accented characters:

FMAnalyze.pl Dictionary-file [Mode] [Mode] < In > Out

FMAnaiso.pl Dictionary-file [Mode] [Mode] < In > Out

For input and output in the MS-Windows CP1250 code (standard Windows 8-bit code for Central and Eastern Europe):

FMAnawin.pl Dictionary-file [Mode] [Mode] < In > Out

On some systems, you might need to call it through the perl intepreter explicitly, e.g.:

perl FMAnawin.pl Dictionary-file [Mode] [Mode] < In > Out

Possible Modes:

demo: for interactive testing

all: no output filtering (default)

news, coll, old, spell: various output filtering modes for backward compatibility with the Czech National Corpus. Avoid if possible.

Running FMAnalyze.pl without parameters prints out a list of supported options.

Input Format

The input to FMAnalyze.pl is a free-format text (in either ISO Latin 2 or MS Windows CP 1250 8-bit coding). The program also recognizes several other formats and treats them as well as it can: the csts format (used for the Prague Dependency Treebank project, as well as for the Czech National Corpus), the HTML format (i.e., web pages can be fed directly into the program) and "other" SGML format. Effectively, the program also does tokenization of the input (if the input is not in the csts format, which is supposed to be tokenized already), but not sentence detection; the output of the tokenizer is, however, only a "simplified SGML", which omits most of the otherwise required SGML elements as specified in the csts.dtd.

Output Format

The output format is a simplified csts-like SGML markup, with one token per line. This format is accepted by further tools provided wit PDT 1.0, even though it does not strictly adhere to the csts.dtd.

Additional lemma information is always present (except for paradigm names, which are never present). The following one-letter special codes apply (within the <MMl> only:

^ for human-readable comments;

: for syntactic codes

; for semantic codes

, for style/variant/regional etc. codes

Every code and/or comment is separated from the lemma or a preceding code by an underscore (_). Underscores within comments denote spaces, therefore absolutely no spaces appear on the output.

Examples:

Input:

Prezident rezignoval na svou funkci.

Output: <csts> <f cap>Prezident<MMl>prezident<MMt>NNMS1-----A---- <f>rezignoval<MMl>rezignovat_:T<MMt>VpYS---XR-AA--- <f>na<MMl>na<MMt>RR--4----------<MMt>RR--6---------- <f>svou<MMl>svůj-1_^(přivlast.)<MMt>P8FS4---------1<MMt>P8FS7---------1 <f>funkci<MMl>funkce<MMt>NNFS3-----A----<MMt>NNFS4-----A----<MMt>NNFS6-----A---- <D> <d>.<MMl>.<MMt>Z:------------- </csts>

Running `FMGenerate.pl`

For input and output in standard 8-bit ISO Latin2 coding of accented characters:

FMGenerate.pl Dictionary-file [Mode...] < In > Out

FMGeniso.pl Dictionary-file [Mode...] < In > Out

For input and output in the MS-Windows CP1250 code (standard Windows 8-bit code for Central and Eastern Europe):

FMGenwin.pl Dictionary-file [Mode...] < In > Out

On some systems, you might need to call it through the perl intepreter explicitly, e.g.:

perl FMGenwin.pl Dictionary-file [Mode...] < In > Out

Possible Modes:

demo: for interactive testing

all: no output filtering (default)

positional (default) / compact: tag system on input

lemmainfo / nolemmainfo (default): output basic additional lemma information

pdgm / nopdgm (default): output paradigm name

comments / nocomments (default): output human-readable lemma comment

All Modes are optional; however, lemmainfo must be specified for the other two info Modes to be effective. Slash-separated modes in the above list are mutually exclusive.

Running FMGenerate.pl without any parameters prints out a list of supported options.

Input Format

Since the generation algorithm is a kind of a reverse function to morphological analysis, it's input must specify two things:

the lemma and

the tag.

The lemma must be a fully specified lemma, including the -n "suffix" (if any). No comments and/or other lemma information usually received as the result of morphological analysis are needed; in fact, it can be even received in the output if requested by the three optional Modes.

The tag can be a fully specified tag (using the chosen tag system, either compact or positional), or an underspecified tag template. Two characters have special meaning in the tag template: a dot (.) and a star (asterisk: *). The dot represents any single character (symbol) in the tag and can be used anywhere in the tag template. Star (asterisk) represents any sequence of symbols (including empty string), but it can be used only at the end of the tag template (but it can be also used as the only symbol in the tag template, denoting a request for all possible forms for the given lemma).

The two input parameters (requests) have to be formatted properly using the following SGML tags:

<Gil> for the lemma and

<Git> for the tag,

in this order, on a single line.

Output Format

The output is presented again in SGML:

<Gel> for the lemma (just copied from input <Gil>) and a sequence of pairs (or triples):

<Gei> lemma information (optional)

<Get> tag (always fully specified)

<Gef> generated form.

If the lemma is not found, or if it is not possible to generate any form based on the input tag or tag template, only the lemma is output. All results are output to a single line of text. Within the lemma info field (<Gei>), the same special codes appear as in the output of the analyzer, plus a code for a paradigm name (@).

Examples:

Input:
<Gil>auto<Git>NNNP7-----A----

Output: (lemmainfo off)
<Gel>auto<Get>NNNP7-----A----<Gef>auty

Output (all options on):
<Gel>auto<Gei>_:N_@mt1x<Get>NNNP7-----A----<Gef>auty

Input:
<Gil>auto<Git>NNN.7*

Output (all options on):
<Gel>auto<Gei>_:N_@mt1x<Get>NNNP7-----A----<Gef>auty<Gei>_:N_@mt1x<Get>NNNP7-----A---6<Gef>autama<Gei>_:N_@mt1x<Get>NNNS7-----A----<Gef>autem

The Dictionary Format

The dictionary is used directly by the code in source format in order to allow for easy modification and quick development cycle. Under this scenario, there is obviously a certain time penalty associated with loading and internally storing the dictionary (which is done using perl hashes, which are notoriously slow especially when swapping is involved - make sure you always have enough physical memory available). However, we believe that the possibility of quickly testing any modification made to the dictionary overweights this disadvantage. It is not terribly slow either, the initialization of the supplied Czech dictionary takes about 10 sec. CPU time on a Pentium III 650MHz machine.

Record Types

The are three record types in the dictionary:

First char on line Record description

; Comments

R Root record

E Ending record

For the Root end Ending records, the records are separated by vertical bars (|) into fields; the identification character is the sole contents of the first field in a record.

The Root Record

The Root Record has 10 fields:

Field number Field description

1 R

2 Paradigm name

3 Root string

4 Lemma

5 Tag 1 (Compact) (or 0 if not present)

6 Tag 2 (Positional) (or 0 if not present)

7 Alternate POS

8 Semantic single-letter "tag(s)" (or 0 if none)

9 Style single-letter "tag(s)" (or 0 if none)

10 Comment (or 0 if none)

Examples:

R|zn4|Pra|Praha|0|0|N|G|0|0
R|0abbr|iso|ISO-1|NFXX@-8|NNFXX-----@---8|N|KB|0|(Intl._Standards_Org.)

The Ending Record

The Ending Record has 7 fields:

Field number Field description

1 E

2 Paradigm name

3 Prefix 1 (Negation) allowed (0/1)

4 Prefix 2 (Superlative) allowed (0/1)

5 Ending string (0 for empty string)

6 Tag 1 (Compact)

7 Tag 2 (Positional)

Currently the code defines Prefix 1 (Negative prefix) as ne, and Prefix 2 (the superlative prefix) as nej. The corresponding placeholders in the tags are @ for negation (replaced by A (affirmative, negation prefix not present) or N (negative prefix present)), and # for comparative/superlative (replaced by 2 (comparative only - superlative prefix not present) or 3 (superlative prefix present). If this need to be changed (for another language, for example), including the correct order of Prefix 1 and 2, see the function MorphAnalyzePrefixedForm in FMAnalyze.pl and GetDictionaryGen in FMGenerate.pl. It is expected that these definitions will become part of the Dictionary in some future version of the tools.

Examples:

E|adv23|1|1|ji|DG#@|Dg-------#@----
E|ccf|0|0|ama|CFFP7-6|CyFP7---------6

Supplied Dictionary

One dictionary file is supplied:

Filename Description Code

CZE-a.il2 Czech Morphology, with full diacritics ISO Latin 2 (iso-8859-2)

Use this dictionary as the Dictionary-file. For "unaccented Czech" processing on input to the analyzer, use CZE-a7.il2 instead (the program will in fact use the same dictionary, CZE-a.il2, but in the "unaccented" mode of operation).

First char on line	Record description
`;`	Comments
`R`	Root record
`E`	Ending record

Field number	Field description
1	R
2	Paradigm name
3	Root string
4	Lemma
5	Tag 1 (Compact) (or 0 if not present)
6	Tag 2 (Positional) (or 0 if not present)
7	Alternate POS
8	Semantic single-letter "tag(s)" (or 0 if none)
9	Style single-letter "tag(s)" (or 0 if none)
10	Comment (or 0 if none)

Filename	Description	Code
CZE-a.il2	Czech Morphology, with full diacritics	ISO Latin 2 (iso-8859-2)

Czech "Free" Morphology

Download

Description

Supported platforms

Installation

Running FMAnalyze.pl

Input Format

Output Format

Running FMGenerate.pl

Input Format

Output Format

The Dictionary Format

Record Types

The Root Record

The Ending Record

Supplied Dictionary

Running `FMAnalyze.pl`

Running `FMGenerate.pl`