MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
Copyright 2014 by Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic.
MorphoDiTa releases are available on GitHub, either as
a pre-compiled binary package, or source code only. The binary package contains Linux,
Windows and OS X binaries, C++
library binary, Java bindings binary, and source code of
all language bindings (Java, Python, Perl) and MorphoDiTa. While the binary
packages do not contain compiled Python or Perl bindings, packages for those
languages are available in standard package repositories, i.e. on PyPI and CPAN.
To use MorphoDiTa, a language model is needed. The language models are available from LINDAT/CLARIN infrastructure and described further in the User's Manual. Currently the following language models are available:
MorphoDiTa is an open-source project and is freely available for non-commercial purposes. The library is distributed under LGPL and the associated models and data under CC BY-NC-SA, although for some models the original data used to create the model may impose additional licensing conditions.
If you use this tool for scientific work, please give credit to us by referencing MorphoDiTa website and Straková et al. 2014.
MorphoDiTa is available as a standalone tool and as a library for Linux/Windows/OS X. It does not require any additional libraries. As any supervised machine learning tool, it needs trained linguistic models to perform morphological analysis. The models for the Czech language are available with the tool.
MorphoDiTa releases are available on GitHub, either as
a pre-compiled binary package, or source code only. The binary package contains Linux,
Windows and OS X binaries, C++
library binary, Java bindings binary, and source code of
all language bindings (Java, Python, Perl) and MorphoDiTa. While the binary
packages do not contain compiled Python or Perl bindings, packages for those
languages are available in standard package repositories, i.e. on PyPI and CPAN.
To use MorphoDiTa, a language model is needed. Here is a list of available language models.
If you want to compile MorphoDiTa manually, sources are available on on GitHub, both in the pre-compiled binary package releases and in the repository itself.
G++ 4.7
or newer, alternatively clang 3.2
or newer
make
SWIG 2.0.5
or newer for language bindings other than C++
To compile MorphoDiTa on Unix-like systems, run make
in the src
directory.
Make targets and options:
exe
: compile the binaries (default)
lib
: compile the shared (dynamically loaded) library
BITS=32
or BITS=64
: compile for specified 32-bit or 64-bit architecture instead of the default one
RELEASE=1
: turn off assertions and use LTO
PROFILE=1
: turn on profiling
DEBUG=1
: compile with debug informations and C++ library debugging
Currently only G++ is supported under Windows. We use TDM-GCC for producing Windows builds, but MinGW and Cygwin are also known to work. If you are interested in adding support for other compilers (most notably, Visual Studio), let us know.
By default, Unix-like shell is required (i.e., Cygwin or MSYS). If you use standard Windows Cmd.exe (i.e., TDM-GCC or plain MinGW), use
make WINCMD=1
Note that make
in MinGW is usually distributed as mingw32-make
.
Binary Java bindings are available in MorphoDiTa binary packages.
To compile Java bindings manually, run make
in the bindings/java
directory, optionally with the options descriged in MorphoDiTa Installation.
Java 6 and newer is supported.
The Java installation specified in the environment variable JAVA_HOME
is
used. If the environment variable does not exist, the JAVA_HOME
can be
specified using
make JAVA_HOME=path_to_Java_installation
The Perl bindings are available as Ufal-MorphoDiTa
package on CPAN.
To compile Perl bindings manually, run make
in the bindings/perl
directory, optionally with the options descriged in MorphoDiTa Installation.
Perl 5.10 and later is supported.
Path to the include headers of the required Perl version must be specified
in the PERL_INCLUDE
variable using
make PERL_INCLUDE=path_to_Perl_includes
The Python bindings are available as ufal.morphodita
package on PyPI.
To compile Python bindings manually, run make
in the bindings/python
directory, optionally with options descriged in MorphoDiTa Installation. Both
Python 2.6+ and Python 3+ are supported.
Path to the include headers of the required Python version must be specified
in the PYTHON_INCLUDE
variable using
make PYTHON_INCLUDE=path_to_Python_includes
In a natural language text, the task of morphological analysis is to assign for each token (word) in a sentence its lemma (cannonical form) and a part-of-speech tag (POS tag). This is usually achieved in two steps: a morphological dictionary looks up all possible lemmas and POS tags for each word, and subsequently, a morphological tagger picks for each word the best lemma-POS tag candidate. The second task is called a disambiguation.
MorphoDiTa also performs these two steps of morphological analysis: It first outputs all possible pairs of lemma and POS tag for each token. Consequently, the optimal combination of lemmas and POS tags is selected for the words in a sentence using an algorithm described in Spoustová et al. 2009.
Like any supervised machine learning tool, MorphoDiTa needs a trained linguistic model. This section describes the available language models and also the commandline tools and interfaces. The C++ library is described elsewhere, either in MorphoDiTa API Tutorial or in MorphoDiTa API Reference.
Czech models are distributed under the CC BY-NC-SA licence. The Czech morphology uses the MorfFlex CZ Czech morphological dictionary and the Czech tagger is trained on PDT 2.5. Czech models work in MorphoDiTa version 1.0 or later.
Apart from MorfFlex CZ dictionary, a prefix guesser and statistical guesser are implemented and can be optionally used when performing morphological analysis.
Czech models are versioned according to the version of the MorfFlex CZ
morphological dictionary used, the version format is YYMMDD
, where YY
,
MM
and DD
are two-digit representation of year, month and day,
respectively. The latest version is 131112.
Compared to Featurama http://sourceforge.net/projects/featurama/ (state-of-the-art Czech tagger implementation), the models are 5 times faster and 10 times smaller.
This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The Czech morphological system was devised by Jan Hajič.
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
The morphological guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
In the Czech language, MorphoDiTa uses Czech morphological system by
Jan Hajič (Hajič 2004).
In this system, which we call PDT tag set, the tags are positional with 15
positions corresponding to part of speech, detailed part of speech, gender,
number, case, etc. (e.g. NNFS1-----A----
). Different meanings of same
lemmas are distinguished and additional comments can be provided for every
lemma meaning. The lemma itself without the comments and meaning specification
is called a raw lemma. The following examples illustrate this:
Japonsko_;G
(raw lemma: Japonsko
)
se_^(zvr._zájmeno/částice)
(raw lemma: se
)
tvořit_:T
(raw lemma: tvořit
)
For a more detailed reference about the Czech morphology, please see Lemma and Tag Structure in PDT 2.0.
The main Czech model contains the following files:
czech-morfflex-<version>.dict
czech-morfflex-pdt-<version>.tagger-best_accuracy
neopren
feature
set. Contains the czech-morfflex-<version>.dict
morphological dictionary.
The latest version czech-morfflex-pdt-131112.tagger-best_accuracy
reaches 95.67% tag
accuracy, 97.78% lemma accuracy and 94.97% overall accuracy on
PDT 2.5 etest data (whose morphological tags and lemmas
were remapped using the czech-morfflex-131112.dict
dictionary). Model speed: ~10k words/s,
model size: 18MB. For comparison, model trained by
Featurama (state-of-the-art
Czech tagger implementation) reaches 95.66%, 97.70%, 94.90% of tag, lemma and
overall accuracy, respectively, with speed ~2k words/s and size 210MB.
czech-morfflex-pdt-<version>.tagger-fast
neopren
feature set. Contains the czech-morfflex-<version>.dict
morphological dictionary.
The latest version czech-morfflex-pdt-131112.tagger-fast
reaches 94.70% tag accuracy,
97.64% lemma accuracy and 93.94% overall accuracy on
PDT 2.5 etest data (whose morphological tags and lemmas
were remapped using the czech-morfflex-131112.dict
dictionary). Model speed: ~60k words/s,
model size: 11MB.
The PDT tag set used by the main Czech model is very fine-grained. In many
situations, only the part of speech tags would be sufficient. Therefore, we
provide a variant of the model, denoted as pos_only
, where only the first
two characters of the fifteen-letter tags are used, representing the part of
speech and detailed part of speech, respectively. There are 67 such two-letter tags.
czech-morfflex-<version>-pos_only.dict
czech-morfflex-pdt-<version>-pos_only.tagger
neopren
feature
set. Containins the czech-morfflex-<version>-pos_only.dict
morphological dictionary.
The latest version czech-morfflex-pdt-131112-pos_only.tagger
reaches 99.20% tag
accuracy, 97.64% lemma accuracy and 97.60% overall accuracy on
PDT 2.5 etest data (which morphological tags and lemmas
were remapped using the czech-morfflex-131112-pos_only.dict
dictionary).
Model speed: ~200k words/s, model size: 4MB.
Deprecated: These model variants are deprecated as of MorphoDiTa 1.2, because very similar functionality can be achieved using strip_lemma_id
tag set converter. Next release of the models will not contain these variants.
The Czech morphological system distinguish different meanings of same lemmas by numbering the lemmas with multiple meanings and supplying additional comments for every lemma meaning, as described and demonstrated in Czech Morphological System. Sometimes this may be undesirable, for example when comparing to systems which do not use the MorfFlex CZ morphological dictionary.
Therefore, all already mentioned Czech models have a variant which does not
disambiguate lemma meanings and provides no additional comments. (In terms of
MorphoDiTa API, the lemmas are raw lemmas with empty lemma ids and
lemma comments.) These model variants are denoted by raw_lemmas
.
English models are created using the following data:
The resulting models are distributed under the CC BY-NC-SA licence. English models work in MorphoDiTa version 1.1 or later.
English models are versioned according to the release date, the version
format is YYMMDD
, where YY
, MM
and DD
are two-digit
representation of year, month and day, respectively. The latest version is
140407.
This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The morphological POS analyzer development was supported by grant of the Ministry
of Education, Youth and Sports of the Czech Republic No. LC536 "Center for
Computational Linguistics". The morphological POS analyzer research was
performed by Johanka Spoustová (Spoustová 2008; the Treex::Tool::EnglishMorpho::Analysis
Perl module). The lemmatizer was implemented by Martin Popel (Popel 2009; the
Treex::Tool::EnglishMorpho::Lemmatizer
Perl module). The lemmatizer is
based on morpha
, which was released under LGPL licence as a part of
RASP system.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The English morphology uses standard Penn Treebank POS tags. Nevertheless, the lemma structure is unique:
The negative prefix is separated from the (always nonempty) lemma using a ^
character (able^un
). During morphological generation, the negative prefix is honored.
Furthermore, when the lemma ends with ^
(i.e., negative prefix is empty, as
in able^
), forms with negative prefixes are generated. It is also possible
to generate all forms without any negative prefix by appending +
after the lemma
(for example able+
).
The English model contains the following files:
english-morphium-<version>.dict
english-morphium-wsj-<version>.tagger
english-morphium-<version>.dict
morphological dictionary.
The latest version english-morphium-wsj-140407.tagger
reaches 97.27% tag
accuracy on Wall Street Journal test portion (Section 22-24). Model speed: ~60k words/s,
model size: 6MB.
Stripping of negative prefixes (or handling the lemmas with negative prefixes
stripped) may not be desirable. Therefore, a variant of the English model
denoted by no_negation
is provided, which does not strip negative prefixes
from lemmas.
english-morphium-<version>-no_negation.dict
english-morphium-wsj-<version>-no_negation.tagger
english-morphium-<version>-no_negation.dict
morphological dictionary.
The latest version english-morphium-wsj-140407-no_negation.tagger
reaches 97.25%
tag accuracy on Wall Street Journal test portion (Section 22-24). Model
speed: ~60k words/s, model size: 6MB.
english-morphium-140407
and english-morphium-wsj-140407
(require MorphoDiTa 1.1 or later)english-morphium-140304
and english-morphium-wsj-140304
(require MorphoDiTa 1.0 or later)Probably the most common usage of MorphoDita is running a tagger to tag your data using
run_tagger tagger_model
The input is assumed to be in UTF-8 encoding and can be either already tokenized and segmented, or it can be a plain text which is tokenized and segmented automatically.
Any number of files can be specified after the tagger_model
. If an argument
input_file:output_file
is used, the given input_file
is processed and
the result is saved to output_file
. If only input_file
is used, the
result is saved to standard output. If no argument is given, input is read
from standard input and written to standard output.
The full command syntax of run_tagger
is
run_tagger [options] tagger_file [file[:output_file]]... Options: --input=untokenized|vertical --convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id --output=vertical|xml
The input format is specified using the --input
option. Currently supported
input formats are:
untokenized
(default): the input is tokenized and segmented using
a tokenizer defined by the model,
vertical
: the input is in vertical format, every line is considered
a word, with empty line denoting end of sentence.
Some tag sets can be converted to different ones. Currently supported tag set conversions are:
pdt_to_conll2009
: convert Czech PDT tag set to CoNLL 2009 tag set,
strip_lemma_comment
: strip lemma comment (see Lemma Structure in API Reference),
strip_lemma_id
: strip lemma id (see Lemma Structure in API Reference).
The output format is specified using the --output
option. Currently
supported output formats are:
xml
(default): Simple XML format without a root element, using
<sentence>
element to mark sentences and
<token lemma="..." tag="...">...</token>
element to encode token
and its assigned lemma and tag.
Example output for input Děti pojedou k babičce. Už se těší.
(line breaks added):
<sentence><token lemma='dítě' tag='NNFP1-----A----'>Děti</token> <token lemma='jet-1_^(pohybovat_se,_ne_však_chůzí)' tag='VB-P---3F-AA---'>pojedou</token> <token lemma='k-1' tag='RR--3----------'>k</token> <token lemma='babička' tag='NNFS3-----A----'>babičce</token> <token lemma='.' tag='Z:-------------'>.</token></sentence> <sentence><token lemma='už-1' tag='Db-------------'>Už</token> <token lemma='se_^(zvr._zájmeno/částice)' tag='P7-X4----------'>se</token> <token lemma='těšit_:T' tag='VB-S---3P-AA---'>těší</token> <token lemma='.' tag='Z:-------------'>.</token></sentence>
vertical
: Every output line is a tag separated triple form-lemma-tag, with empty
line denoting end of sentence.
Example output for input Děti pojedou k babičce. Už se těší.
:
Děti dítě NNFP1-----A---- pojedou jet-1_^(pohybovat_se,_ne_však_chůzí) VB-P---3F-AA--- k k-1 RR--3---------- babičce babička NNFS3-----A---- . . Z:------------- Už už-1 Db------------- se se_^(zvr._zájmeno/částice) P7-X4---------- těší těšit_:T VB-S---3P-AA--- . . Z:-------------
There are multiple commands performing morphological tasks.
The run_morpho_analyze
executable performs morphological analysis and
the run_morpho_generate
executable performs morphological generation.
The output of these commands is suitable for automatic processing.
The run_morpho_cli
executable performs both morphological analysis and generation,
but is designed to be used interactively and produces more human-readable output.
The morphological analysis can be performed by running
run_morpho_analyze morphology_model use_guesser
The input is assumed to be in UTF-8 encoding and can be either already
tokenized and segmented, or it can be a plain text which is tokenized and
segmented automatically. The input files are specified same as with the
run_tagger
command.
Some morphological models contain both a manually created dictionary and
a guesser. Therefore, a numeric use_guesser
argument is required.
If non-zero, the guesser is used, otherwise not.
Because tagger models contain an embedded morphological model, a tagger model
can be used instead of morphological one if --from_tagger
option is
specified.
The full command syntax of run_morpho_analyze
is
run_morpho_analyze [options] morphology_model use_guesser [file[:output_file]]... Options: --input=untokenized|vertical --convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id --output=vertical|xml --from_tagger
The input format is specified using the --input
option. Currently supported
input formats are:
untokenized
(default): the input is tokenized and segmented using
a tokenizer defined by the model,
vertical
: the input is in vertical format, every line is considered
a word, with empty line denoting end of sentence.
Note that the input data is also segmented, even if it is not strictly necessary. Therefore, the input is processed by whole paragraphs (ending by an empty line).
Some tag sets can be converted to different ones. Currently supported tag set conversions are:
pdt_to_conll2009
: convert Czech PDT tag set to CoNLL 2009 tag set,
strip_lemma_comment
: strip lemma comment (see Lemma Structure in API Reference),
strip_lemma_id
: strip lemma id (see Lemma Structure in API Reference).
The output format is specified using the --output
option. Currently
supported output formats are:
xml
(default): Simple XML format without a root element, using
using <token><analysis lemma="..." tag="..."/><analysis...>...</token>
element to encode morphological analysis.
Example output for input Děti pojedou k babičce. Už se těší.
(line breaks added):
<sentence><token><analysis lemma="dítě" tag="NNFP1-----A----"/><analysis lemma="dítě" tag="NNFP4-----A----"/><analysis lemma="dítě" tag="NNFP5-----A----"/>Děti</token> <token><analysis lemma="jet-1_^(pohybovat_se,_ne_však_chůzí)" tag="VB-P---3F-AA---"/>pojedou</token> <token><analysis lemma="k-1" tag="RR--3----------"/><analysis lemma="k-3_^(označení_pomocí_písmene)" tag="NNNXX-----A----"/><analysis lemma="k-4`kůň_:B_^(jednotka_výkonu)" tag="NNMXX-----A---8"/><analysis lemma="k-8_:B_^(ost._zkratka)" tag="XX------------8"/><analysis lemma="komanditní_:B_^(jen_komanditní_společnost)" tag="AAXXX----1A---8"/><analysis lemma="koncernový_:B" tag="AAXXX----1A---8"/><analysis lemma="kuo-1_:B_,t_^(stará_jednotka_výkonu)" tag="NNNXX-----A---8"/>k</token> <token><analysis lemma="babička" tag="NNFS3-----A----"/><analysis lemma="babička" tag="NNFS6-----A----"/>babičce</token> <token><analysis lemma="." tag="Z:-------------"/>.</token></sentence> <sentence><token><analysis lemma="už-1" tag="Db-------------"/><analysis lemma="už-2" tag="TT-------------"/>Už</token> <token><analysis lemma="se_^(zvr._zájmeno/částice)" tag="P7-X4----------"/><analysis lemma="s-1" tag="RV--2----------"/><analysis lemma="s-1" tag="RV--7----------"/>se</token> <token><analysis lemma="těšit_:T" tag="VB-P---3P-AA---"/><analysis lemma="těšit_:T" tag="VB-S---3P-AA---"/>těší</token> <token><analysis lemma="." tag="Z:-------------"/>.</token></sentence>
vertical
: Every output line contains a word and a tab separated lemma-tag
pairs assigned to the input word, with empty line denoting end of sentence.
Example output for input Děti pojedou k babičce. Už se těší.
:
Děti dítě NNFP1-----A---- dítě NNFP4-----A---- dítě NNFP5-----A---- pojedou jet-1_^(pohybovat_se,_ne_však_chůzí) VB-P---3F-AA--- k k-1 RR--3---------- k-3_^(označení_pomocí_písmene) NNNXX-----A---- k-4`kůň_:B_^(jednotka_výkonu) NNMXX-----A---8 k-8_:B_^(ost._zkratka) XX------------8 komanditní_:B_^(jen_komanditní_společnost) AAXXX----1A---8 koncernový_:B AAXXX----1A---8 kuo-1_:B_,t_^(stará_jednotka_výkonu) NNNXX-----A---8 babičce babička NNFS3-----A---- babička NNFS6-----A---- . . Z:------------- Už už-1 Db------------- už-2 TT------------- se se_^(zvr._zájmeno/částice) P7-X4---------- s-1 RV--2---------- s-1 RV--7---------- těší těšit_:T VB-P---3P-AA--- těšit_:T VB-S---3P-AA--- . . Z:-------------
The morphological generation can be performed by running
run_morpho_generate morphology_model use_guesser
The input is assumed to be in UTF-8 encoding. The input files are specified
same as with the run_tagger
command.
Input for morphological generation has to be in vertical format, each line containing a lemma, which can be optionally followed by a tab and a tag wildcard. The output has the same number of lines as input, line l contains tab separated form-lemma-tag triplets which can be generated from the lemma on he input line l. If a tag wildcard was provided, only triplets with matching tags are returned.
Some morphological models contain both a manually created dictionary and
a guesser. Therefore, a numeric use_guesser
argument is required.
If non-zero, the guesser is used, otherwise not.
Because tagger models contain an embedded morphological model, a tagger model
can be used instead of morphological one if --from_tagger
option is
specified.
The full command syntax of run_morpho_generate
is
run_morpho_generate [options] morphology_model use_guesser [input_file[:output_file]]... Options: --convert_tagset=pdt_to_conll2009|strip_lemma_comment|strip_lemma_id --from_tagger
Example input data:
dítě jet ?[fN]??[-1] k-1 babička NNFS3-----A----
Example output:
dítě dítě NNNS1-----A---- dítě dítě NNNS4-----A---- dítě dítě NNNS5-----A---- dítěte dítě NNNS2-----A---- dítěti dítě NNNS3-----A---- dítěti dítě NNNS6-----A---- dítětem dítě NNNS7-----A---- děti dítě NNFP1-----A---- děti dítě NNFP4-----A---- děti dítě NNFP5-----A---- dětma dítě NNFP7-----A---6 dětmi dítě NNFP7-----A---- dětem dítě NNFP3-----A---- dětí dítě NNFP2-----A---- dětech dítě NNFP6-----A---- dětima dítě_,h NNFP7-----A---6 ject jet Vf--------A---6 jet jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------A---- jeti jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------A---2 nejet jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------N---- nejeti jet-1_^(pohybovat_se,_ne_však_chůzí) Vf--------N---2 jet jet-2_,h_^(letadlo_s_tryskovým_pohonem)NNIS1-----A---- jety jet-2_,h_^(letadlo_s_tryskovým_pohonem) NNIP1-----A---- k k-1 RR--3---------- ke k-1 RV--3---------- ku k-1 RV--3---------1 babičce babička NNFS3-----A----
Some tag sets can be converted to different ones. Currently supported tag set conversions are:
pdt_to_conll2009
: convert Czech PDT tag set to CoNLL 2009 tag set,
strip_lemma_comment
: strip lemma comment (see Lemma Structure in API Reference),
strip_lemma_id
: strip lemma id (see Lemma Structure in API Reference).
Note that the tag set conversion is applied only to the output, not to the input lemmas and wildcards.
When only forms with a specific tag should be generated for a given lemma, tag wildcard can be specified. The tag wildcard is a simple wildcard allowing to filter the results of morphological generation.
Most characters of a tag wildcard match corresponding characters of a tag, with the following exceptions:
?
matches any character of a tag.
[chars]
matches any of the characters listed. The dash -
has no special meaning and if ]
is the first character in chars
, it is considered as one of the characters and does not end the group.
[^chars]
matches any of the characters not listed.
Morphological analysis and generation which is interactive and more human readable can be run using:
run_morpho_cli morphology_model
The input is read from standard input, command on each line. If there is no tab on a line, analysis is performed on the given word. If there is a tab on a line, generation is performed on the first word, using the second word as a tag wildcard. If the second word is empty (i.e., the input is for example ``on ``), all forms are generated.
Because tagger models contain an embedded morphological model, a tagger model
can be used instead of morphological one if --from_tagger
option is
specified.
The full command syntax of run_morpho_cli
is
run_morpho_cli [options] morphology_model Options: --from_tagger
Using the run_tokenizer
executable it is possible to perform only
tokenization and segmentation.
The input is a UTF-8 encoded plain text and the input files are specified same
as with the run_tagger
command.
The tokenizer can be specified either by using a morphology model
(--morphology
option), tagger model (--tagger
option) or by using
a tokenizer identifier (--tokenizer
option). Currently supported
tokenizer identifiers are:
czech
english
generic
The full command syntax of run_tokenizer
is
run_tokenizer [options] [file[:output_file]]... Options: --tokenizer=czech|english|generic --morphology=morphology_model_file --tagger=tagger_model_file --output=vertical|xml
The output format is specified using the --output
option. Currently
supported output formats are:
xml
(default): Simple XML format without a root element, using
<sentence>
element to mark sentences and <token>
element to mark
tokens.
Example output for input Děti pojedou k babičce. Už se těší.
(line breaks added):
<sentence><token>Děti</token> <token>pojedou</token> <token>k</token> <token>babičce</token><token>.</token></sentence> <sentence><token>Už</token> <token>se</token> <token>těší</token><token>.</token></sentence>
vertical
: Each token is on a separate line, every sentence is ended by
a blank line.
Example output for input Děti pojedou k babičce. Už se těší.
:
Děti pojedou k babičce . Už se těší .
It is possible to create custom morphological and tagging models.
Custom morphological models can be created using encode_dictionary
binary.
The encode_dictionary
reads from standard input and prints MorphoDiTa
morphological model on standard output. The input of encode_dictionary
is
a textual representation of morphological dictionary. It should be UTF-8
encoded and every line should be a tab separated triplet
lemma \t tag \t form
. All forms of one lemma must appear in a continuous region and no line
should appear more than once (sort -u
can be used to achieve this).
Run encode_dictionary
with the following options:
encode_dictionary generic max_suffix_len unknown_tag number_tag punctuation_tag symbol_tag
generic
: This parameter defines tokenizer and other language specific
behaviour. Other values than generic
take different options and are not
documented.
max_suffix_len
: Maximum length of suffixes in automatically inferred
inflexion classes. If unsure, use 8 (we use 8 for Czech and 4 for English).
Smaller values produce larger and slightly faster models.
unknown_tag
: Assigned to a form during analysis if no matching tag can be
found.
number_tag
: Assigned to a form during analysis if the form was not found
in the dictionary and it looks like a number. Can be the same as
unknown_tag
.
punctuation_tag
: Assigned to a form during analysis if the form was not found
in the dictionary and it consists of Unicode characters in the Punctuation
category. Can be the same as unknown_tag
.
symbol_tag
: Assigned to a form during analysis if the form was not found
in the dictionary and it consists of Unicode characters in the Symbol
category. Can be the same as unknown_tag
.
Example input data:
dog NN dog dog NNS dogs go VB go go VBP go go VBZ goes go VBG going go VBD went
Example command line:
encode_dictionary generic 8 UNK NUM PUNC SYM <input_data >output_model
Sometimes it is useful to train MorphoDiTa tagger using external morphological analysis, without having a MorphoDiTa morphological dictionary.
That is possible using a so called external morphology model. External morphology model can be created easily using
encode_dictionary external unknown_tag >output_model
No standard input is read in this case. The unknown_tag
parameter is used when
no tag is assigned to a word form during analysis. The resulting model is
printed on standard output.
The external morphology model does not contain any morphological dictionary.
Instead, it expects the user to perform morphological analysis and generation on
their own. Therefore, the input form to analysis is expected to be followed by
space separated lemma-tag pairs, which are returned by the analysis.
Similarly, the input lemma to generation is expected to be followed by space
separated form-tag pairs, which are again returned by the generation (possibly
filtered by a tag wildcard). (To extract the length of the form or lemma itself
even when followed by external analyses, API calls raw_form_len
or
raw_lemma_len
and lemma_id_len
can be used.)
Note that the tokenizer returned by the external morphology model is the same as the tokenizer of the generic model, and splits input on spaces. Therefore, it can be used to tokenize input, the tokens then passed to the external morphology, and the results can be after proper formatting used as input to MorphoDiTa in vertical input format.
Example input form for analysis using external morphology model:
wishes wish NNS wish VBZ
Example input lemma for generation using external morphology model:
go go VB go VBP goes VBZ going VBG went VBG
Custom tagging models can be trained using train_tagger
binary, which has
the following options:
train_tagger generic_234 morphology use_guesser features iterations prune_features [heldout_data [early_stopping]] <input_data >tagger_model
generic_234
: This parameter defines the tagger (elementary features and
algorithm) and the order of Viterbi decoding. Use either generic2
,
generic3
or generic4
. If unsure, use generic3
(best released
Czech and English models use generic3
). The generic2
produces faster,
but less accurate models, generic4
produces larger and only marginally
better models.
morphology
: File with the morphological dictionary to use.
use_guesser
: Use 0
/1
to specify whether morphological guesser
should be used. Unless you have a good reason not to, use 1
.
features
: File with feature sequences for the tagger. The file format and available
elementary features are described in following section.
iterations
: Number of training iterations. For English, values 5-10 are used,
for Czech, values 10-15 are used. Can be affected by early_stopping
.
prune_features
: Use 0
/1
to disable/enable pruning of feature
sequences not found in training data. Use 1
for smaller and marginally
less accurate models, and 0
for larger and marginally better models.
If unsure, use 1
(best released Czech and English models use 1
).
heldout_data
: Optional file with heldout data in the same format as input
data. If supplied, accuracy is measured on the heldout data after every
training iteration.
early_stopping
: Optionally use 0
/1
to disable/enable early
stopping. If early stopping is enabled, the resulting model is not the one
after the last training iteration, but the one with best heldout data
accuracy.
Example command line (use morphology from morpho.dict
, features from features.ft
and no heldout data):
train_tagger generic3 morpho.dict 1 features.ft 10 1 <input.data >tagger.model
Example command line (use morphology from morpho.dict
, features from features.ft
and use heldout data with early stopping):
train_tagger generic3 morpho.dict 1 features.ft 15 1 heldout.data 1 <input.data >tagger.model
See next sections for examples of input data and feature files.
The input data (and the heldout data) represent a sequence of sentences.
Different sentences do not interact in any way. Words of one sentence are
stored on consecutive lines, each line containing tab separated triplet
form \t lemma \t tag
in UTF-8 encoding. End of sentence is denoted
by an empty line.
Example:
Děti dítě NNFP1-----A---- pojedou jet-1_^(pohybovat_se,_ne_však_chůzí) VB-P---3F-AA--- k k-1 RR--3---------- babičce babička NNFS3-----A---- . . Z:------------- Už už-1 Db------------- se se_^(zvr._zájmeno/částice) P7-X4---------- těší těšit_:T VB-S---3P-AA--- . . Z:-------------
The features used in the tagger have major influence on tagging performance.
The feature file contains several feature sequences, each sequence
consisting of several elementary features. The elementary features are
computed by MorphoDiTa and different tagger models can have a different set of
elementary features. Here we describe elementary features of generic
tagger:
Form
: word form
Prefix1
.. Prefix9
: word form prefix of length 1..9 (measured in Unicode characters)
Suffix1
.. Suffix9
: word form suffix of length 1..9 (measured in Unicode characters)
Num
: whether the word form contains at least one numbers (Unicode category Number)
Cap
: whether the word form contains at least one uppercase or titlecase letter
Dash
: whether the word form contains at least one dash (Unicode category 'Punctuation, Dash')
Tag
: word form PoS tag
Tag1
.. Tag5
: letter 1..5 of word form PoS tag
Lemma
: word form lemma
FollowingVerbTag
: PoS tag of a nearest following verb, i.e., a nearest
following word form with at least one of the PoS tags starting with V
FollowingVerbLemma
: lemma of a nearest following verb, i.e., a nearest
following word form with at least one of the PoS tags starting with V
PreviousVerbTag
: PoS tag of a nearest previous verb, i.e., a nearest
previous word whose PoS tag (assigned by the tagger) starts with V
PreviousVerbTag
: lemma of a nearest previous verb, i.e., a nearest
previous word whose PoS tag (assigned by the tagger) starts with V
The feature file defines feature sequences which can be applied to a word form. A feature sequence consists of elementary features assigned to the given form or its neighbours.
Every line in the feature file defines one feature sequence. A feature sequence
consists of comma joined space separated pairs of elementary feature and an
offset to which does the elementary feature apply (i.e., Form 0
or
Tag 0,Lemma -1
). The file format is strict and does not allow any
additional spaces or commas.
Note that offset of some of the elementary features is affected by the order or
Viterbi decoding used. Notably, if Viterbi decoding of order N is utilized,
Tag
and Lemma
can be used inside the decoded window, i.e., only with
offsets -N+1 .. 0.
For inspiration, we present feature files used for releases Czech and English MorphoDiTa models. Both these feature files are slight modifications of feature files described in the paper Spoustová et al. 2009: Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab, Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association for Computational Linguistics.
Feature file for English:
Tag 0,Form 0 Tag 0,Prefix1 0 Tag 0,Prefix2 0 Tag 0,Prefix3 0 Tag 0,Prefix4 0 Tag 0,Prefix5 0 Tag 0,Prefix6 0 Tag 0,Prefix7 0 Tag 0,Prefix8 0 Tag 0,Prefix9 0 Tag 0,Suffix1 0 Tag 0,Suffix2 0 Tag 0,Suffix3 0 Tag 0,Suffix4 0 Tag 0,Suffix5 0 Tag 0,Suffix6 0 Tag 0,Suffix7 0 Tag 0,Suffix8 0 Tag 0,Suffix9 0 Tag 0,Num 0 Tag 0,Cap 0 Tag 0,Dash 0 Tag 0,Tag -1 Tag 0,Tag -1,Tag -2 Tag 0,Form -1 Tag 0,Form -2 Tag 0,Form -1,Form -2 Tag 0,Form 1 Tag 0,Form 1,Form 2 Tag 0,Tag1 -1 Tag 0,Lemma -1 Lemma 0,Tag -1
Feature file for Czech (note that some feature sequences predict only part of
PoS tags trying to overcome data sparseness; Tag2
is extended PoS, Tag3
is gender, Tag5
is case):
Tag 0 Tag 0,Tag -1 Tag 0,Tag -1,Tag -2 Tag 0,Tag -2 Tag 0,Form 0 Tag 0,Form 0,Form -1 Tag 0,Form -1 Tag 0,Form -2 Tag 0,PreviousVerbTag 0 Tag 0,PreviousVerbLemma 0 Tag 0,FollowingVerbTag 0 Tag 0,FollowingVerbLemma 0 Tag 0,Lemma -1 Lemma 0,Tag -1 Tag 0,Form 1 Tag2 0,Tag5 0 Tag2 0,Tag5 0,Tag2 -1,Tag5 -1 Tag2 0,Tag5 0,Tag2 -1,Tag5 -1,Tag2 -2,Tag5 -2 Tag5 0 Tag5 0,Tag -1 Tag5 0,Tag -1,Tag -2 Tag5 0,Tag -2 Tag5 0,Form 0 Tag5 0,Form 0,Form -1 Tag5 0,Form -1 Tag5 0,Form -2 Tag5 0,PreviousVerbTag 0 Tag5 0,PreviousVerbLemma 0 Tag5 0,FollowingVerbTag 0 Tag5 0,FollowingVerbLemma 0 Tag5 0,Lemma -1 Tag5 0,Form 1 Tag3 0 Tag3 0,Tag -1 Tag3 0,Tag -1,Tag -2 Tag3 0,Tag -2 Tag3 0,Form 0 Tag3 0,Form 0,Form -1 Tag3 0,Form -1 Tag3 0,Form -2 Tag3 0,PreviousVerbTag 0 Tag3 0,PreviousVerbLemma 0 Tag3 0,FollowingVerbTag 0 Tag3 0,FollowingVerbLemma 0 Tag3 0,Lemma -1 Tag3 0,Form 1 Tag 0,Prefix1 0 Tag 0,Prefix2 0 Tag 0,Prefix3 0 Tag 0,Prefix4 0 Tag 0,Suffix1 0 Tag 0,Suffix2 0 Tag 0,Suffix3 0 Tag 0,Suffix4 0 Tag 0,Num 0 Tag 0,Cap 0 Tag 0,Dash 0 Tag5 0,Suffix1 0 Tag5 0,Suffix2 0 Tag5 0,Suffix3 0 Tag5 0,Suffix4 0
Feature file for Czech, Part of Speech only variant:
Tag 0 Tag 0,Tag -1 Tag 0,Tag -1,Tag -2 Tag 0,Tag -2 Tag 0,Form 0 Tag 0,Form 0,Form -1 Tag 0,Form -1 Tag 0,Form -2 Tag 0,PreviousVerbTag 0 Tag 0,PreviousVerbLemma 0 Tag 0,FollowingVerbTag 0 Tag 0,FollowingVerbLemma 0 Tag 0,Lemma -1 Lemma 0,Tag -1 Tag 0,Form 1 Tag 0,Prefix1 0 Tag 0,Prefix2 0 Tag 0,Prefix3 0 Tag 0,Prefix4 0 Tag 0,Suffix1 0 Tag 0,Suffix2 0 Tag 0,Suffix3 0 Tag 0,Suffix4 0 Tag 0,Num 0 Tag 0,Cap 0 Tag 0,Dash 0
Measuring custom tagger accuracy can be performed by running:
tagger_accuracy tagger_model <test_data
This binary reads input in the same format as train_tagger
,
i.e., tab separated form-lemma-tag triplets, and evaluates the accuracy
of the tagger model on the given testing data.
The MorphoDiTa API is defined in header morphodita.h
and resides in
ufal::morphodita
namespace. The easiest way to use MorphoDita is therefore:
#include morphodita.h using namespace ufal::morphodita;
The main access to MorphoDiTa tagger is through class tagger
. An example
of this class usage can be found in program file run_tagger.cpp
. A typical
tagger usage may look like this:
#include tagger/tagger.h; using namespace ufal::morphodita; //... // load model to memory and construct tagger tagger* my_tagger = tagger::load("path_to_model"); if (!t) ... // create sample input vector<string> words; words.push_back("malý"); words.push_back("pes"); vector<string_piece> forms; for (auto& word : words) forms.emplace_back(word) // intialize output and tag vector<tagged_lemma> tags; my_tagger->tag(forms, tags); // access the output for (auto& tag : tags) printf("%s\t%s\n", tag.lemma.c_str(), tag.tag.c_str()); delete my_tagger;
The tagger is constructed by an overloaded factory method with one argument.
The constructor either accepts a C file pointer (FILE*
) pointing to a file
with the model or a C string (const char*
) with a file name of the model.
The constructor loads the linguistic model to memory and returns the tagger
pointer ready for tagging, returning NULL
if unsuccessful. If a file
pointer is used, it is not closed and is positioned right after the end of the
model.
The main tagging method is tagger::tag
:
void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags) const;
The input is a std::vector
of string_piece
which is a structure
referencing a string using const char* str
and size_t len
.
The tagger::tag
method returns the tagged output in it's second argument,
std::vector<tagged_lemma>
. The calling procedure must provide a result vector
and the tagger assigns the output to this vector. Obviously, the indexes in the
output vector correspond to indexes in input vector. tagged_lemma
has two
public members: std::string lemma
and std:string tag
, corresponding to
predicted lemma and tag, respectively.
The main access to MorphoDiTa morphological dictionary is through class
morpho
. An example of this interface usage can be found in a program file
run_morpho.cpp
.
Similarly to the tagger, MorphoDiTa morphological dictionary is constructed by an
overloaded factory method which accepts either a C file pointer (FILE*
)
or a C string const char*
with the file name of the dictionary.
The factory method returns a pointer to morphological dictionary or NULL
if
unsuccessful.
#include morpho/morpho.h using namespace ufal::morphodita; //... // load dictionary to memory morpho* my_morpho = morpho::load("path_to_dictionary"); //... delete(my_morpho);
Another way of obtaining a pointer to morphology dictionary is through an instance
of tagger
class – every tagger has a morphology dictionary, which is available
through the method
virtual const morpho* get_morpho() const = 0;
Please note that you should not delete this pointer as it is owned by the
tagger
class instance.
MorphoDiTa morphological dictionary offers two functionalities: It either analyzes the given word, that means it outputs all possible lemma-tag pairs candidates for the given form; or for a given lemma-tag pair, it generates a form or a whole list of possible forms.
In the first case, one performs morphological analysis for a given word by
calling a method morpho::analyze
:
int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>& lemmas) const;
An example (assuming that morphological dictionary is already constructed, see previous example):
vector<tagged_lemma> lemmas; // output my_morpho->analyze("pes", morpho::GUESSER, vector<tagged_lemma>& lemmas); for (auto& lemma: lemmas) printf ("%s %s\n, lemma.lemma.c_str(), lemma.tag.c_str())
The input is a form to analyze, then a Guesser mode (whether to use some kind
of guesser or strictly dictionary only, see question Guesser Mode in
Questions and Answers) and output std::vector<tagged_lemma>
. The
caller must provide an output vector std::vector<tagged_lemma>
and the
method morpho::analyze
assigns the output to this vector.
MorphoDiTa performs morphological generation from a given lemma:
int generate(string_piece lemma, const char* tag_wildcard, guesser_mode guesser, std::vector<tagged_lemma_forms>& forms) const;
Optionally, a tag wildcard can be specified (or be NULL
) and if so, results
are filtered using this wildcard. This method can be therefore used in more
ways: One may wish to generate all possible forms and their tags from a given
lemma. Then the tag_wildcard
is set to NULL
and the method generates
all possible combinations. One may also need a generate a specific form and tag
from a given lemma, then tag_wildcard
is set to this tag value.
Or even more, for example, in the Czech positional morphology tagging system
(Hajič 2004),
one may even wish to generate something like "all forms in fourth case",
then tag_wildcard
should be set to ????4
.
Please see Section "Czech Morphology" in User's Manual for more details about the Czech positional tagging system.
The previous example applies to morphological annotation applied to
PDT 2.5, however, the tag wildcards can be used in any
morphological tagging system.
Most characters of a tag wildcard match corresponding characters of a tag, with the following exceptions:
?
matches any character of a tag.
[chars]
matches any of the characters listed. The dash -
has no special meaning and if ]
is the first character in chars
, it is considered as one of the characters and does not end the group.
[^chars]
matches any of the characters not listed.
When the lemma is unknown, MorphoDiTa's generation behavior is defined by Guesser mode (see also
question Guesser Mode in Questions and Answers). If at least one lemma is found
in the dictionary, NO_GUESSER
is returned. If guesser == GUESSER
and the lemma
is found by the guesser, GUESSER
is returned. Otherwise, forms are cleared and
-1
is returned.
morpho::GUESSER
and off by
morpho::NO_GUESSER
.
const char*
or std::string
?\\0
padding or
string
conversion. Nevertheless, both const char*
and
std::string
can be used instead of a string_piece
because of existing
implicit conversion rules.
The MorphoDiTa API is defined in header morphodita.h
and resides in
ufal::morphodita
namespace.
The strings used in the MorphoDiTa API are always UTF-8 encoded (except from file paths, whose encoding is system dependent).
MorphoDiTa version consists of three numbers major.minor.patch with the following semantics:
Models created by MorphoDiTa have the same behaviour in all MorphoDiTa versions, apart from obvious bugfixes. On the other hand, models created from the same data by different major.minor MorphoDiTa versions may have different behaviour.
The lemmas used by MorphoDiTa consist of three parts:
These parts are stored in one string and the boundaries between them can be
determined by
morpho::raw_lemma_len
and
morpho::lemma_id_len
methods.
Analyzer and tagger always return lemma in this structured form. When
performing morphological generation, either raw lemma or both raw lemma and
lemma id can be specified, any lemma comments are ignored.
struct string_piece { const char* str; size_t len; string_piece(); string_piece(const char* str); string_piece(const char* str, size_t len); string_piece(const std::string& str); }
The string_piece
is used for efficient string passing. The string
referenced in string_piece
is not owned by it, so users have to make sure
the referenced string exists as long as the string_piece
.
struct tagged_form { std::string form; std::string tag; };
The tagged_form
is a pair of strings used when obtaining a form and tag
pair.
struct tagged_lemma { std::string lemma; std::string tag; };
The tagged_lemma
is a pair of strings used when obtaining a lemma and tag
pair.
struct tagged_lemma_forms { std::string lemma; std::vector<tagged_form> forms; };
The tagged_lemma_forms
represents a lemma and a list of tagged forms.
struct token_range { size_t start; size_t length; };
The token_range
represent a range of a token as returned by a tokenizer.
The start
and length
fields specify the token position in Unicode
characters, not in bytes of UTF-8 encoding.
class version { public: unsigned major; unsigned minor; unsigned patch; static version current(); };
The version
class represents MorphoDiTa version.
See MorphoDiTa Versioning for more information.
static version current();
Returns current MorphoDiTa version.
class tokenizer { public: virtual ~tokenizer() {} virtual void set_text(string_piece text, bool make_copy = false) = 0; virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0; static tokenizer* new_vertical_tokenizer(); static tokenizer* new_czech_tokenizer(); static tokenizer* new_english_tokenizer(); static tokenizer* new_generic_tokenizer(); };
The tokenizer
class performs segmentation and tokenization of given text.
The class is not threadsafe.
The tokenizer
instances can be obtained either directly using
static methods or through instances of morpho
and tagger
.
virtual void set_text(string_piece text, bool make_copy = false) = 0;
Set the text which is to be tokenized.
If make_copy
is false
, only a reference to the given text is
stored and the user has to make sure it exists until the tokenizer
is released or set_text
is called again. If make_copy
is true
, a copy of the given text is made and retained until the
tokenizer is released or set_text
is called again.
virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0;
Locate and return next sentence of the given text. Returns true
when successful and false
when
there are no more sentences in the given text. The arguments are filled with found tokens if not NULL
.
The forms
contain token ranges in bytes of UTF-8 encoding, the tokens
contain token ranges
in Unicode characters.
static tokenizer new_vertical_tokenizer();
Returns a new instance of a vertical tokenizer, which considers every line to be one token, with empty line denoting end of sentence. The user should delete the instance after use.
static tokenizer new_czech_tokenizer();
Returns a new instance of a Czech tokenizer. The user should delete it after use.
static tokenizer new_english_tokenizer();
Returns a new instance of a English tokenizer. The user should delete it after use.
static tokenizer new_generic_tokenizer();
Returns a new instance of a generic tokenizer. The user should delete it after use.
class morpho { public: virtual ~morpho() {} static morpho* load(const char* fname); static morpho* load(FILE* f); enum guesser_mode { NO_GUESSER = 0, GUESSER = 1 }; virtual int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>& lemmas) const = 0; virtual int generate(string_piece lemma, const char* tag_wildcard, guesser_mode guesser, std::vector<tagged_lemma_forms>& forms) const = 0; virtual int raw_lemma_len(string_piece lemma) const = 0; virtual int lemma_id_len(string_piece lemma) const = 0; virtual int raw_form_len(string_piece form) const = 0; virtual tokenizer* new_tokenizer() const = 0; };
A morpho
instance represents a morphological dictionary. Such a dictionary allow
morphological analysis, morphological generation provide information about lemma structure
and provides a suitable tokenizer. All methods are thread-safe.
static morpho* load(const char* fname);
Factory method constructor. Accepts C string with a file name of the model.
Returns a pointer to an instance of morpho
which the user should delete
after use.
static morpho* load(FILE* f);
Factory method constructor. Accepts C file pointer of an opened file with the
model. Returns a pointer to an instance of morpho
which the user should
delete after use.
enum guesser_mode { NO_GUESSER = 0, GUESSER = 1 };
Guesser mode defines behavior in case of unknown words. When set to
GUESSER
, morpho tries to guess unknown words. When set to NO_GUESSER
,
morpho does not guess unknown words.
virtual int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>& lemmas) const = 0;
Perform morphological analysis of a form. The guesser parameter specifies whether a guesser can be used if the form is not found in the dictionary. Output is assigned to the lemmas vector.
If the form is found in the dictionary, analyses are assigned to lemmas
and NO_GUESSER
returned. If guesser == GUESSER
and the form analyses are
found using a guesser, they are assigned to lemmas and GUESSER
is
returned. Otherwise -1
is returned and lemmas are filled with one
analysis containing given form as lemma and a tag for unknown word.
virtual int generate(string_piece lemmma, const char* tag_wildcard, guesser_mode guesser, std::vector<tagged_lemma_forms>& forms) const = 0;
Perform morphological generation of a lemma. Optionally a tag_wildcard can be
specified (or be NULL
) and if so, results are filtered using this wildcard.
The guesser parameter speficies whether a guesser can be used if the lemma is
not found in the dictionary. Output is assigned to the forms vector.
Tag_wildcard can be either NULL
or a wildcard applied to the results.
A ?
in the wildcard matches any character, [bytes]
matches any of the
bytes and [^bytes]
matches any byte different from the specified ones.
A -
has no special meaning inside the bytes and if ]
is first in bytes,
it does not end the bytes group.
If the given lemma is only a raw lemma, all lemma ids with this raw lemma are
returned. Otherwise only matching lemma ids are returned, ignoring any lemma
comments. For every found lemma, matching forms are filtered using the
tag_wildcard. If at least one lemma is found in the dictionary, NO_GUESSER
is returned. If guesser == GUESSER
and the lemma is found by the guesser,
GUESSER
is returned. Otherwise, forms are cleared and -1
is returned.
virtual int raw_lemma_len(string_piece lemma) const = 0;
When given a lemma returned by the dictionary, returns the length of a raw lemma (see Lemma Structure).
virtual int lemma_id_len(string_piece lemma) const = 0;
When given a lemma returned by the dictionary, returns the length of a raw lemma plus a lemma id (see Lemma Structure). Therefore, the substring of the original lemma of this length is a unique lemma identifier. The rest of the original lemma are lemma comments which do not identify the lemma.
virtual int raw_form_len(string_piece form) const = 0;
When given a form, returns the length of a raw form. This is used only in external morphology model, where form contains also morphological analyses, and this call can return the length of the form without the analyses.
virtual tokenizer* new_tokenizer() const = 0;
Returns a new instance of a suitable tokenizer or NULL
if no such tokenizer
exists. The user should delete it after use.
class tagger { public: virtual ~tagger() {} static tagger* load(const char* fname); static tagger* load(FILE* f); virtual const morpho* get_morpho() const = 0; virtual void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags) const = 0; tokenizer* new_tokenizer() const = 0; };
A tagger
instance represents a tagger, which perform disambiguation of
morphological analyses. All methods are thread-safe.
static tagger* load(const char* fname);
Factory method constructor. Accepts C string with a file name of the model.
Returns a pointer to an instance of tagger
which the user should delete
after use.
static tagger* load(FILE* f);
Factory method constructor. Accepts C file pointer of an opened file with the
model. Returns a pointer to an instance of tagger
which the user should
delete after use.
virtual const morpho* get_morpho() const = 0;
Returns a pointer to an instance of morpho
associated with the tagger. Do
not delete the pointer, it is owned by the tagger instance and deleted in the
tagger destructor.
virtual void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags) const = 0;
Perform morphological analysis and subsequent disambiguation. Accepts
a std::vector
of string_piece
and fills the output vector of tagged_lemma
.
virtual tokenizer* new_tokenizer() const = 0;
Returns a new instance of a suitable tokenizer or NULL
if no such tokenizer
exists. The user should delete it after use. The call is equal to
get_morpho()->new_tokenizer()
.
class tagset_converter { public: virtual ~tagset_converter() {} virtual void convert(tagged_lemma& tagged_lemma) const = 0; virtual void convert_analyzed(std::vector<tagged_lemma>& tagged_lemmas) const = 0; virtual void convert_generated(std::vector<tagged_lemma_forms>& forms) const = 0; static tagset_converter* new_identity_converter(); static tagset_converter* new_pdt_to_conll2009_converter(); static tagset_converter* new_strip_lemma_comment_converter(const morpho& dictionary); static tagset_converter* new_strip_lemma_id_converter(const morpho& dictionary); };
virtual void convert(tagged_lemma& tagged_lemma) const = 0;
Convert the given tagged lemma.
virtual void convert_analyzed(std::vector<tagged_lemma>& tagged_lemmas) const = 0;
Convert the given results of morpho::analyze. Apart from calling convert, any repeated entries are removed.
virtual void convert_generated(std::vector<tagged_lemma_forms>& forms) const = 0;
Convert the given results of morpho::generate. Apart from calling convert, any repeated entries are removed.
static tagset_converter* new_identity_converter();
Returns a new instance of an identity converter. All convert methods of an identity converter do nothing. The user should delete the instance after use.
static tagset_converter* new_pdt_to_conll2009_converter();
Returns a new instance of a Czech PDT tag set to CoNLL2009 tag set converter. The user should delete the instance after use.
CoNLL2009 tag set uses two columns for tags – one is a POS and the other one
are additional FEATs. Because we have only one tag field, we merge these fields
together by using Pos=?|FEAT
, i.e., the POS is stored as a first FEAT.
static tagset_converter* new_strip_lemma_comment_converter(const morpho& dictionary);
Returns a new instance of a tag set converter stripping
lemma comment using the given morpho
instance,
which must remain valid during existence of the tag set converter. The user
should delete the tag set converter instance after use.
static tagset_converter* new_strip_lemma_id_converter(const morpho& dictionary);
Returns a new instance of a tag set converter stripping
lemma id using the given morpho
instance,
which must remain valid during existence of the tag set converter. The user
should delete the tag set converter instance after use.
Bindings for other languages than C++ are created using SWIG from the C++
bindings API, which is a slightly modified version of the native C++ API.
Main changes are replacement of string_piece
type by native
strings and removal of methods using FILE
. Here is the C++ bindings API
declaration:
typedef vector<string> Forms; struct TaggedForm { string form; string tag; }; typedef vector<TaggedForm> TaggedForms; struct TaggedLemma { string lemma; string tag; }; typedef vector<TaggedLemma> TaggedLemmas; struct TaggedLemmaForms { string lemma; TaggedForms forms; }; typedef vector<TaggedLemmaForms> TaggedLemmasForms; struct TokenRange { size_t start; size_t length; }; typedef vector<TokenRange> TokenRanges;
class Version { public: unsigned major; unsigned minor; unsigned patch; static Version current(); }; class Tokenizer { public: virtual void setText(const char* text); virtual bool nextSentence(Forms* forms, TokenRanges* tokens); static Tokenizer* newVerticalTokenizer(); static Tokenizer* newCzechTokenizer(); static Tokenizer* newEnglishTokenizer(); static Tokenizer* newGenericTokenizer(); }; class Morpho { public: static Morpho* load(const char* fname); enum { NO_GUESSER = 0, GUESSER = 1 }; virtual int analyze(const char* form, int guesser, TaggedLemmas& lemmas) const; virtual int generate(const char* lemma, const char* tag_wildcard, int guesser, TaggedLemmasForms& forms) const; virtual string rawLemma(const char* lemma) const; virtual string lemmaId(const char* lemma) const; virtual string rawForm(const char* form) const; virtual Tokenizer* newTokenizer() const; }; class Tagger { public: static Tagger* load(const char* fname); virtual const Morpho* getMorpho() const; virtual void tag(Forms& forms, TaggedLemmas& tags) const; Tokenizer* newTokenizer() const; }; class TagsetConverter { public: static TagsetConverter* newIdentityConverter(); static TagsetConverter* newPdtToConll2009Converter(); static TagsetConverter* newStripLemmaCommentConverter(const Morpho& morpho); static TagsetConverter* newStripLemmaIdConverter(const Morpho& morpho); virtual void convert(TaggedLemma& lemma) const; virtual void convertAnalyzed(TaggedLemmas& lemmas) const; virtual void convertGenerated(TaggedLemmasForms& forms) const; };
MorphoDiTa library bindings is available in the cz.cuni.mff.ufal.morphodita
package.
The bindings is a straightforward conversion of the C++
bindings API.
Vectors do not have native Java interface, see
cz.cuni.mff.ufal.morphodita.Forms
class for reference. Also, class members
are accessible and modifiable using using getField
and setField
wrappers.
The bindings require native C++ library libmorphodita_java
(called
morphodita_java
on Windows). If the library is found in the current
directory, it is used, otherwise standard library search process is used.
MorphoDiTa library bindings is available in the
Ufal::MorphoDiTa
package.
The classes can be imported into the current namespace using the :all
export tag.
The bindings is a straightforward conversion of the C++
bindings API.
Vectors do not have native Perl interface, see Ufal::MorphoDiTa::Forms
for
reference. Static methods and enumerations are available only through the
module, not through object instance.
MorphoDiTa library bindings is available in the
ufal.morphodita
module.
The bindings is a straightforward conversion of the C++
bindings API.
In Python 2, strings can be both unicode
and UTF-8 encoded str
, and the
library always produces unicode
. In Python 3, strings must be only str
.
Authors:
MorphoDiTa LINDAT/CLARIN entry.
This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
Acknowledgements for individual language models are listed in MorphoDiTa User's Manual.
@InProceedings{strakova14, author = {Strakov\'{a}, Jana and Straka, Milan and Haji\v{c}, Jan}, title = {Open-{S}ource {T}ools for {M}orphology, {L}emmatization, {POS} {T}agging and {N}amed {E}ntity {R}ecognition}, booktitle = {Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, month = {June}, year = {2014}, address = {Baltimore, Maryland}, publisher = {Association for Computational Linguistics}, pages = {13--18}, url = {http://www.aclweb.org/anthology/P/P14/P14-5003} }