The Free Morphology (FM) is a pair of (almost) universal (i.e., language-independent) morphology tools (FMAnalyze.pl, FMGenerate.pl) for analysis and generation of word forms for inflective languages. A frequency-based, high coverage Czech dictionary is enclosed.
The FM works best for inflective languages which can be described using segmentation of a word form into two parts: a root and an ending. Even if linguistically not quite justified, many phenomena which would normally break this simple rule can be made to work in this framework.
Special provision is made in the code for up to two "inflectional" prefixes which might both be present in some word forms. Such prefixes are found in many Slavic languages, such as Czech, Slovak, Polish, etc. Adaptation to a different language only needs a simple change in the Perl source code (if anything).
Currently, the FM expects that the data being processed is 8-bit-coded, either in ISO Latin 2 or in the MS Windows CP 1250 code page. The dictionary is provided in the ISO Latin 2 (ISO-8859-2, suffix .il2) encoding, and the conversion between CP 1250 and ISO 8859-2 is done on-the-fly if needed (and requested, see below). For code conversions from/to different coding schemes (such as LaTeX, for example, SGML entities, etc., the HBCode.pl utility can be used (included), or any other conversion tool available.
The analyzer (FMAna*.pl) can be also run in "accents-off" mode, which is invoked by supplying the dictionary name with the digit '7' added just after the dictionary name, in front of the regular suffix. I.e., for the supplied dictionary, use CZE-a7.il2 instead of CZE-a.il2 as the Dictionary-file (see below for usage notes). The analyzer would then use the original file, but it will recognize input forms with all accents stripped off (many emails, SMS messages etc. still use "unaccented Czech"). It will accept word forms with no diacritics, and analyze them into all possible lemmas and tags as if all possible combinations of diacritical marks were present. The output lemma, comments etc. are then already with diacritics (which is probably what you want; for coversion to a totally unaccented output, you might use the supplied HBCode.pl program. Please note that HBCode.pl cannot be used for restoring the accents in the original text, for obvious reasons - clever readers should be able, however, use a particular combination of FMAnalyze.pl and FMGenerate.pl with little programming effort to do exactly that [without disambiguation, of course]). -- Please note that the generator (FMGen*.pl) does not display this behavior.
Current on-line (client/server) version of the tools, which works with the latest corrections and additions to the dictionary, can be found here (for those in the U.S., there might be still a working version here).
For more information on Czech morphology and tagging, see Czech Language Morphology and Tagging page.
Both FMAnalyze.pl and FMGenerate.pl are perl scripts which have been tested on Perl 5.005_3 and are supposed to run on later versions of perl as well. For those working in the MS Windows environment, a working perl interpreter can be downloaded from ActiveState (versions for other platforms - if necessary - can be downloaded from there as well), or directly from this CD (ActiveState Perl 5.22 for MS Windows). This is the same perl interpreter as needed for TrEd (the tree editor); thus you need to download it and install only once.
Two archives are provided with identical contents.
On Unix systems, unpack the FMorph.tgz
archive:
tar -xzvf FMoprh.tgz
or the FMoprh.zip file using unzip (on Unix), pkunzip.exe (on DOS; the newer version handling long filenames is needed!) or WinZip or similar tool (on any MS Windows platform).
For input and output in standard 8-bit ISO Latin2 coding of accented characters:
FMAnalyze.pl Dictionary-file [Mode] [Mode] < In > Out
FMAnaiso.pl Dictionary-file [Mode] [Mode] < In > Out
For input and output in the MS-Windows CP1250 code (standard Windows 8-bit code for Central and Eastern Europe):
FMAnawin.pl Dictionary-file [Mode] [Mode] < In > Out
On some systems, you might need to call it through the perl intepreter explicitly, e.g.:
perl FMAnawin.pl Dictionary-file [Mode] [Mode] < In > Out
Possible Modes:
Running FMAnalyze.pl without parameters prints out a list of supported options.
The input to FMAnalyze.pl is a free-format text (in either ISO Latin 2 or MS Windows CP 1250 8-bit coding). The program also recognizes several other formats and treats them as well as it can: the csts format (used for the Prague Dependency Treebank project, as well as for the Czech National Corpus), the HTML format (i.e., web pages can be fed directly into the program) and "other" SGML format. Effectively, the program also does tokenization of the input (if the input is not in the csts format, which is supposed to be tokenized already), but not sentence detection; the output of the tokenizer is, however, only a "simplified SGML", which omits most of the otherwise required SGML elements as specified in the csts.dtd.
The output format is a simplified csts-like SGML markup, with one token per line. This format is accepted by further tools provided wit PDT 1.0, even though it does not strictly adhere to the csts.dtd.
Additional lemma information is always present (except for paradigm names, which are never present). The following one-letter special codes apply (within the <MMl> only:
Examples:
For input and output in standard 8-bit ISO Latin2 coding of accented characters:
FMGenerate.pl Dictionary-file [Mode...] < In > Out
FMGeniso.pl Dictionary-file [Mode...] < In > Out
For input and output in the MS-Windows CP1250 code (standard Windows 8-bit code for Central and Eastern Europe):
FMGenwin.pl Dictionary-file [Mode...] < In > Out
On some systems, you might need to call it through the perl intepreter explicitly, e.g.:
perl FMGenwin.pl Dictionary-file [Mode...] < In > Out
Possible Modes:
Running FMGenerate.pl without any parameters prints out a list of supported options.
Since the generation algorithm is a kind of a reverse function to morphological analysis, it's input must specify two things:
The tag can be a fully specified tag (using the chosen tag system, either compact or positional), or an underspecified tag template. Two characters have special meaning in the tag template: a dot (.) and a star (asterisk: *). The dot represents any single character (symbol) in the tag and can be used anywhere in the tag template. Star (asterisk) represents any sequence of symbols (including empty string), but it can be used only at the end of the tag template (but it can be also used as the only symbol in the tag template, denoting a request for all possible forms for the given lemma).
The two input parameters (requests) have to be formatted properly using the following SGML tags:
The output is presented again in SGML:
Examples:
The dictionary is used directly by the code in source format in order to allow for easy modification and quick development cycle. Under this scenario, there is obviously a certain time penalty associated with loading and internally storing the dictionary (which is done using perl hashes, which are notoriously slow especially when swapping is involved - make sure you always have enough physical memory available). However, we believe that the possibility of quickly testing any modification made to the dictionary overweights this disadvantage. It is not terribly slow either, the initialization of the supplied Czech dictionary takes about 10 sec. CPU time on a Pentium III 650MHz machine.
The are three record types in the dictionary:
First char on line | Record description |
; | Comments |
R | Root record |
E | Ending record |
For the Root end Ending records, the records are separated by vertical bars (|) into fields; the identification character is the sole contents of the first field in a record.
The Root Record has 10 fields:
Field number | Field description |
1 | R |
2 | Paradigm name |
3 | Root string |
4 | Lemma |
5 | Tag 1 (Compact) (or 0 if not present) |
6 | Tag 2 (Positional) (or 0 if not present) |
7 | Alternate POS |
8 | Semantic single-letter "tag(s)" (or 0 if none) |
9 | Style single-letter "tag(s)" (or 0 if none) |
10 | Comment (or 0 if none) |
Examples:
R|zn4|Pra|Praha|0|0|N|G|0|0
R|0abbr|iso|ISO-1|NFXX@-8|NNFXX-----@---8|N|KB|0|(Intl._Standards_Org.)
The Ending Record has 7 fields:
Field number | Field description |
1 | E |
2 | Paradigm name |
3 | Prefix 1 (Negation) allowed (0/1) |
4 | Prefix 2 (Superlative) allowed (0/1) |
5 | Ending string (0 for empty string) |
6 | Tag 1 (Compact) |
7 | Tag 2 (Positional) |
Currently the code defines Prefix 1 (Negative prefix) as ne, and Prefix 2 (the superlative prefix) as nej. The corresponding placeholders in the tags are @ for negation (replaced by A (affirmative, negation prefix not present) or N (negative prefix present)), and # for comparative/superlative (replaced by 2 (comparative only - superlative prefix not present) or 3 (superlative prefix present). If this need to be changed (for another language, for example), including the correct order of Prefix 1 and 2, see the function MorphAnalyzePrefixedForm in FMAnalyze.pl and GetDictionaryGen in FMGenerate.pl. It is expected that these definitions will become part of the Dictionary in some future version of the tools.
Examples:
E|adv23|1|1|ji|DG#@|Dg-------#@----
E|ccf|0|0|ama|CFFP7-6|CyFP7---------6
One dictionary file is supplied:
Filename | Description | Code |
CZE-a.il2 | Czech Morphology, with full diacritics | ISO Latin 2 (iso-8859-2) |