System MORFO for morphological analysis of Czech
The morphological analysis assigns (lemma, tag) pairs to every wordform. Lemma is a basic wordform, tag is a code describing morphological properties of the wordform. Generally, there are more (lemma, tag) pairs assigned to a single wordform. The morphological analyzer returns all of them. The selection of the most appropriate pair in a given context is a task for the tagger. The MORFO system consists of four units:
- the analyzer,
- the generator,
- the dictionary editor,
- the library with the shared source code for handling dictionary objects.
The morphological analyzer works with the large morphological dictionary containing the great majority of Czech lemmas together with morphological patterns describing all their wordforms. Moreover, the analyzer can recognize new words (those that cannot be derived from the dictionary) that were created from the existing ones using a prefix. The prefixes were acquired automatically from the Czech National Corpus. Another new feature is the treatment of foreign words starting with an upper case letter. If they are not present in the dictionary, they get always a tag NNXXX-----A----, meaning that it is a noun with underspecified gender, number and case. It is only a heuristics but it works well.
The analyzer follows the older tools of Jan Haji�. The reimplementation was made in order to be able to process XML formats and also to meet the requirement of an easier maintaining of the tools.
The morphological generator is the opposite of the analyzer – from the (lemma, tag) pair, it generates the appropriate wordform. However, analyzer and generator are not the exact inversions – the generator cannot generate words that are not covered by the morphological dictionary. The analyzer and the generator are described in the user documentation
and the technical documentation.
The dictionary editor MorfoEd is the tool for maintaining the morphological dictionary. It can be used for adding new entries and deleting or changing the old ones. It is user friendly and its using minimizes the risk of a wrong format of input data. Moreover, it contains a guesser of morphological patterns (trained on the existing entries) that may speed up and simplify the work with the editor. The editor is described in the user documentation
and the technical documentation
The programs except the editor were written in the language C for the operating environment satisfying the Single Unix Specification. The interface is line oriented. The editor was written in Perl, together with the graphical library Tk.
The project MORFO may be used without the editor.
The format of the data is XML.
Licence: GNU compatible (libPNG)
Contact: David Kolovratn�k, Leo� P�ikryl
Download
The user must agree with the licence and fill in the registration before usage of the binary distribution.
x86 binary is available for download. Use tar xzf <filename.tar.gz>
to unpack it. (2010-05-06)
Source code is available for download. Use tar xzf <filename.tar.gz>
to unpack it. (2010-05-06)
Older releases
Documentation
The documentation is included in the tar archive. Since it may be difficult to compile its parts from DocBook sources, pdf files are linked here.
- user's documentation for the command line tools
- user's documentation for the editor
- technical documentation for the command line tools
- technical documentation for the editor
There is also available incomplete programming/reference documentation. However the analysis is covered well.