1. MorphoDiTa Versioning
  2. Lemma Structure
  3. Struct string_piece
  4. Struct tagged_form
  5. Struct tagged_lemma
  6. Struct tagged_lemma_forms
  7. Struct token_range
  8. Struct derivated_lemma
  9. Class version
  10. Class tokenizer
  11. Class derivator
  12. Class derivation_formatter
  13. Class morpho
  14. Class tagger
  15. Class tagset_converter
  16. C++ Bindings API
  17. C# Bindings
  18. Java Bindings
  19. Perl Bindings
  20. Python Bindings

The MorphoDiTa API is defined in header morphodita.h and resides in ufal::morphodita namespace.

The strings used in the MorphoDiTa API are always UTF-8 encoded (except from file paths, whose encoding is system dependent).

1. MorphoDiTa Versioning

MorphoDiTa is versioned using Semantic Versioning. Therefore, a version consists of three numbers major.minor.patch, optionally followed by a hyphen and pre-release version info, with the following semantics:

  • Stable versions have no pre-release version info, development have non-empty pre-release version info.
  • Two versions with the same major.minor have the same API with the same behaviour, apart from bugs. Therefore, if only patch is increased, the new version is only a bug-fix release.
  • If two versions v and u have the same major, but minor(v) is greater than minor(u), version v contains only additions to the API. In other words, the API of u is all present in v with the same behaviour (once again apart from bugs). It is therefore safe to upgrade to a newer MorphoDiTa version with the same major.
  • If two versions differ in major, their API may differ in any way.

Models created by MorphoDiTa have the same behaviour in all MorphoDiTa versions with same major, apart from obvious bugfixes. On the other hand, models created from the same data by different major.minor MorphoDiTa versions may have different behaviour.

2. Lemma Structure

The lemmas used by MorphoDiTa consist of three parts:

  1. raw lemma: text form of the lemma. May not uniquely distinguish lemma meanings, lemma use cases etc.
  2. lemma id: together with raw lemma provide a unique identifier of the lemma, possibly including lemma meanings or use cases.
  3. lemma comments: additional comments for the given lemma.

These parts are stored in one string and the boundaries between them can be determined by morpho::raw_lemma_len and morpho::lemma_id_len methods. Analyzer and tagger always return lemma in this structured form. When performing morphological generation, either raw lemma or both raw lemma and lemma id can be specified, any lemma comments are ignored.

3. Struct string_piece

struct string_piece {
  const char* str;
  size_t len;

  string_piece();
  string_piece(const char* str);
  string_piece(const char* str, size_t len);
  string_piece(const std::string& str);
}

The string_piece is used for efficient string passing. The string referenced in string_piece is not owned by it, so users have to make sure the referenced string exists as long as the string_piece.

4. Struct tagged_form

struct tagged_form {
  std::string form;
  std::string tag;
};

The tagged_form is a pair of strings used when obtaining a form and tag pair.

5. Struct tagged_lemma

struct tagged_lemma {
  std::string lemma;
  std::string tag;
};

The tagged_lemma is a pair of strings used when obtaining a lemma and tag pair.

6. Struct tagged_lemma_forms

struct tagged_lemma_forms {
  std::string lemma;
  std::vector<tagged_form> forms;
};

The tagged_lemma_forms represents a lemma and a list of tagged forms.

7. Struct token_range

struct token_range {
  size_t start;
  size_t length;
};

The token_range represent a range of a token as returned by a tokenizer. The start and length fields specify the token position in Unicode characters, not in bytes of UTF-8 encoding.

8. Struct derivated_lemma

struct derivated_lemma {
  std::string lemma;
};

The derivated_lemma structure stores information about a derivation. This information currently consists of lemma only, but a type of the derivation may be added later.

9. Class version

class version {
 public:
  unsigned major;
  unsigned minor;
  unsigned patch;
  std::string prerelease;

  static version current();
};

The version class represents MorphoDiTa version. See MorphoDiTa Versioning for more information.

9.1. version::current

static version current();

Returns current MorphoDiTa version.

10. Class tokenizer

class tokenizer {
 public:
  virtual ~tokenizer() {}

  virtual void set_text(string_piece text, bool make_copy = false) = 0;
  virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0;

  static tokenizer* new_vertical_tokenizer();
  static tokenizer* new_czech_tokenizer();
  static tokenizer* new_english_tokenizer();
  static tokenizer* new_generic_tokenizer();
};

The tokenizer class performs segmentation and tokenization of given text. The class is not threadsafe.

The tokenizer instances can be obtained either directly using static methods or through instances of morpho and tagger.

10.1. tokenizer::set_text

virtual void set_text(string_piece text, bool make_copy = false) = 0;

Set the text which is to be tokenized.

If make_copy is false, only a reference to the given text is stored and the user has to make sure it exists until the tokenizer is released or set_text is called again. If make_copy is true, a copy of the given text is made and retained until the tokenizer is released or set_text is called again.

10.2. tokenizer::next_sentence

virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0;

Locate and return next sentence of the given text. Returns true when successful and false when there are no more sentences in the given text. The arguments are filled with found tokens if not NULL. The forms contain token ranges in bytes of UTF-8 encoding, the tokens contain token ranges in Unicode characters.

10.3. tokenizer::new_vertical_tokenizer

static tokenizer new_vertical_tokenizer();

Returns a new instance of a vertical tokenizer, which considers every line to be one token, with empty line denoting end of sentence. The user should delete the instance after use.

10.4. tokenizer::new_czech_tokenizer

static tokenizer new_czech_tokenizer();

Returns a new instance of a Czech tokenizer. The user should delete it after use.

If two MorphoDiTa versions have the same major.minor, this tokenizer should behave identically (apart from obvious bugfixes). Nevertheless, the behaviour of this tokenizer might change in different major.minor version. If you need a tokenizer whose behaviour does not change, use tokenizer embedded in a morphological dictionary.

10.5. tokenizer::new_english_tokenizer

static tokenizer new_english_tokenizer();

Returns a new instance of a English tokenizer. The user should delete it after use.

If two MorphoDiTa versions have the same major.minor, this tokenizer should behave identically (apart from obvious bugfixes). Nevertheless, the behaviour of this tokenizer might change in different major.minor version. If you need a tokenizer whose behaviour does not change, use tokenizer embedded in a morphological dictionary.

10.6. tokenizer::new_generic_tokenizer

static tokenizer new_generic_tokenizer();

Returns a new instance of a generic tokenizer. The user should delete it after use.

If two MorphoDiTa versions have the same major.minor, this tokenizer should behave identically (apart from obvious bugfixes). Nevertheless, the behaviour of this tokenizer might change in different major.minor version. If you need a tokenizer whose behaviour does not change, use tokenizer embedded in a morphological dictionary.

11. Class derivator

class derivator {
 public:
  virtual ~derivator();

  virtual bool parent(string_piece lemma, derivated_lemma& parent) const = 0;
  virtual bool children(string_piece lemma, std::vector<derivated_lemma>& children) const = 0;
};

The derivator class perform morphological derivation on given lemmas. The derivation are computed using lemma ids, see Lemma Structure.

The derivator instances can be obtained through instances of morpho (and transitively through tagger).

11.1. derivator::parent

virtual bool parent(string_piece lemma, derivated_lemma& parent) const = 0;

Return the parent of a given lemma in the morphological derivation tree. The lemma is assumed to be lemma id (see Lemma Structure), so if it contains any lemma comments, they are ignored.

The returned lemma is a full lemma (lemma id plus appropriate lemma comments).

If no parent exists, the function empties the parent lemma and returns false.

11.2. derivator::children

virtual bool children(string_piece lemma, std::vector<derivated_lemma>& children) const = 0;

Return children of a given lemma in the morphological derivation tree. The lemma is assumed to be lemma id (see Lemma Structure), so if it contains any lemma comments, they are ignored.

The returned lemmas are full lemmas (lemma ids plus appropriate lemma comments).

If no children exist, the function empties the children vector and returns false.

12. Class derivation_formatter

class derivation_formatter {
 public:
  virtual ~derivation_formatter() {}

  virtual void format_derivation(std::string& lemma) const = 0;

  static derivation_formatter* new_none_derivation_formatter();
  static derivation_formatter* new_root_derivation_formatter(const derivator* derinet);
  static derivation_formatter* new_path_derivation_formatter(const derivator* derinet);
  static derivation_formatter* new_tree_derivation_formatter(const derivator* derinet);
  static derivation_formatter* new_derivation_formatter(string_piece name, const derivator* derinet);
};

The derivation_formatter class performs required morphological derivation and formats the results using a single string field (i.e., directly in the lemma).

12.1. derivation_formatter::format_derivation

virtual void format_derivation(std::string& lemma) const = 0;

Perform the required morphological derivation and format the result back directly in the lemma.

12.2. derivation_formatter::new_none_derivation_formatter

static derivation_formatter* new_none_derivation_formatter();

Return a new derivation_formatter instance which does nothing (i.e., it performs no derivation).

12.3. derivation_formatter::new_root_derivation_formatter

static derivation_formatter* new_root_derivation_formatter(const derivator* derinet);

Return a new derivation_formatter instance which replaces a lemma by the corresponding root in the derivation tree.

12.4. derivation_formatter::new_path_derivation_formatter

static derivation_formatter* new_path_derivation_formatter(const derivator* derinet);

Return a new derivation_formatter instance which replaces a lemma by a space separated path to the root in the morphological derivation tree (the original lemma is first, followed by its parent, with the root being the last one).

12.5. derivation_formatter::new_tree_derivation_formatter

static derivation_formatter* new_tree_derivation_formatter(const derivator* derinet);

Return a new derivation_formatter instance which appends to the lemma the whole morphological derivation tree which contains it.

The tree is encoded in the following way: root node is the first, then the subtrees of the root children are encoded recursively (each after one space), followed by a final space (which denotes that the children are complete).

12.6. derivation_formatter::new_derivation_formatter

static derivation_formatter* new_derivation_formatter(string_piece name, const derivator* derinet);

Return one of the available derivation_formatter instances according to the name parameter:

13. Class morpho

class morpho {
 public:
  virtual ~morpho() {}

  static morpho* load(const char* fname);
  static morpho* load(istream& is);

  enum guesser_mode { NO_GUESSER = 0, GUESSER = 1 };

  virtual int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>& lemmas) const = 0;
  virtual int generate(string_piece lemma, const char* tag_wildcard, guesser_mode guesser, std::vector<tagged_lemma_forms>& forms) const = 0;

  virtual int raw_lemma_len(string_piece lemma) const = 0;
  virtual int lemma_id_len(string_piece lemma) const = 0;
  virtual int raw_form_len(string_piece form) const = 0;

  virtual tokenizer* new_tokenizer() const = 0;

  virtual const derivator* get_derivator() const;
};

A morpho instance represents a morphological dictionary. Such a dictionary allow morphological analysis, morphological generation provide information about lemma structure and provides a suitable tokenizer. All methods are thread-safe.

13.1. morpho::load(const char*)

static morpho* load(const char* fname);

Factory method constructor. Accepts C string with a file name of the model. Returns a pointer to an instance of morpho which the user should delete after use.

13.2. morpho::load(istream&)

static morpho* load(istream& is);

Factory method constructor. Accepts an input stream with the model. Returns a pointer to an instance of morpho which the user should delete after use.

13.3. morpho::guesser_mode

enum guesser_mode { NO_GUESSER = 0, GUESSER = 1 };

Guesser mode defines behavior in case of unknown words. When set to GUESSER, morpho tries to guess unknown words. When set to NO_GUESSER, morpho does not guess unknown words.

13.4. morpho::analyze()

virtual int analyze(string_piece form, guesser_mode guesser, std::vector<tagged_lemma>& lemmas) const = 0;

Perform morphological analysis of a form. The guesser parameter specifies whether a guesser can be used if the form is not found in the dictionary. Output is assigned to the lemmas vector.

If the form is found in the dictionary, analyses are assigned to lemmas and NO_GUESSER returned. If guesser == GUESSER and the form analyses are found using a guesser, they are assigned to lemmas and GUESSER is returned. Otherwise -1 is returned and lemmas are filled with one analysis containing given form as lemma and a tag for unknown word.

13.5. morpho::generate()

virtual int generate(string_piece lemmma, const char* tag_wildcard, guesser_mode guesser, std::vector<tagged_lemma_forms>& forms) const = 0;

Perform morphological generation of a lemma. Optionally a tag_wildcard can be specified (or be NULL) and if so, results are filtered using this wildcard. The guesser parameter speficies whether a guesser can be used if the lemma is not found in the dictionary. Output is assigned to the forms vector.

Tag_wildcard can be either NULL or a wildcard applied to the results. A ? in the wildcard matches any character, [bytes] matches any of the bytes and [^bytes] matches any byte different from the specified ones. A - has no special meaning inside the bytes and if ] is first in bytes, it does not end the bytes group.

If the given lemma is only a raw lemma, all lemma ids with this raw lemma are returned. Otherwise only matching lemma ids are returned, ignoring any lemma comments. For every found lemma, matching forms are filtered using the tag_wildcard. If at least one lemma is found in the dictionary, NO_GUESSER is returned. If guesser == GUESSER and the lemma is found by the guesser, GUESSER is returned. Otherwise, forms are cleared and -1 is returned.

13.6. morpho::raw_lemma_len

virtual int raw_lemma_len(string_piece lemma) const = 0;

When given a lemma returned by the dictionary, returns the length of a raw lemma (see Lemma Structure).

13.7. morpho::lemma_id_len

virtual int lemma_id_len(string_piece lemma) const = 0;

When given a lemma returned by the dictionary, returns the length of a raw lemma plus a lemma id (see Lemma Structure). Therefore, the substring of the original lemma of this length is a unique lemma identifier. The rest of the original lemma are lemma comments which do not identify the lemma.

13.8. morpho::raw_form_len

virtual int raw_form_len(string_piece form) const = 0;

When given a form, returns the length of a raw form. This is used only in external morphology model, where form contains also morphological analyses, and this call can return the length of the form without the analyses.

13.9. morpho::new_tokenizer

virtual tokenizer* new_tokenizer() const = 0;

Returns a new instance of a suitable tokenizer or NULL if no such tokenizer exists. The user should delete it after use.

Note that the tokenizer might use the morpho instance, so the tokenizer must not be used after the morpho instance is destructed.

13.10. morpho::get_derivator

virtual const derivator* get_derivator() const;

Returns a derivator for the morphology, or NULL if not available.

The derivator is owned by the morphology, so the returned instance should not be freed and it cannot be used after the morpho instance is destructed.

14. Class tagger

class tagger {
 public:
  virtual ~tagger() {}

  static tagger* load(const char* fname);
  static tagger* load(istream& is);

  virtual const morpho* get_morpho() const = 0;

  virtual void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags, morpho::guesser_mode guesser = -1) const = 0;

  virtual void tag_analyzed(const std::vector<string_piece>& forms, std::vector<std::vector<tagged_lemma> >& analyses, std::vector<int>& tags) const = 0;

  tokenizer* new_tokenizer() const = 0;
};

A tagger instance represents a tagger, which perform disambiguation of morphological analyses. All methods are thread-safe.

14.1. tagger::load(const char*)

static tagger* load(const char* fname);

Factory method constructor. Accepts C string with a file name of the model. Returns a pointer to an instance of tagger which the user should delete after use.

14.2. tagger::load(istream&)

static tagger* load(istream& is);

Factory method constructor. Accepts an input stream with the model. Returns a pointer to an instance of tagger which the user should delete after use.

14.3. tagger::get_morpho()

virtual const morpho* get_morpho() const = 0;

Returns a pointer to an instance of morpho associated with the tagger. Do not delete the pointer, it is owned by the tagger instance and deleted in the tagger destructor.

14.4. tagger::tag()

virtual void tag(const std::vector<string_piece>& forms, std::vector<tagged_lemma>& tags, morpho::guesser_mode guesser = -1) const = 0;

Perform morphological analysis and subsequent disambiguation. Accepts a std::vector of string_piece and fills the output vector of tagged_lemma.

The `guesser` parameter defines whether morphological guesser should be used. If negative value is specified (which is the default), the guesser settings employed when the tagger model was trained is used.

14.5. tagger::tag_analyzed()

virtual void tag_analyzed(const std::vector<string_piece>& forms, std::vector<std::vector<tagged_lemma> >& analyses, std::vector<int>& tags) const = 0;

Perform morphological disambiguation using given morphological analyses. The indices of chosen analyses are stored in the output vector tags.

None of the analyses can be empty – in that case, no operation is performed and tags is empty. On the other hand, the analyses vector can be larger than forms – additional entries are ignored in that case.

Note that the tagger was trained with a specific morphology – the more your morphological analyses differ from the original ones, the worse the results will be. One of the usages of tag_analyzed is to consider only a subset of morphological analyses.

14.6. tagger::new_tokenizer

virtual tokenizer* new_tokenizer() const = 0;

Returns a new instance of a suitable tokenizer or NULL if no such tokenizer exists. The user should delete it after use. The call is equal to get_morpho()->new_tokenizer().

15. Class tagset_converter

class tagset_converter {
 public:
  virtual ~tagset_converter() {}

  virtual void convert(tagged_lemma& tagged_lemma) const = 0;
  virtual void convert_analyzed(std::vector<tagged_lemma>& tagged_lemmas) const = 0;
  virtual void convert_generated(std::vector<tagged_lemma_forms>& forms) const = 0;

  static tagset_converter* new_identity_converter();
  static tagset_converter* new_pdt_to_conll2009_converter();
  static tagset_converter* new_strip_lemma_comment_converter(const morpho& dictionary);
  static tagset_converter* new_strip_lemma_id_converter(const morpho& dictionary);
};

15.1. tagset_converter::convert()

virtual void convert(tagged_lemma& tagged_lemma) const = 0;

Convert the given tagged lemma.

15.2. tagset_converter::convert_analyzed()

virtual void convert_analyzed(std::vector<tagged_lemma>& tagged_lemmas) const = 0;

Convert the given results of morpho::analyze. Apart from calling convert, any repeated entries are removed.

15.3. tagset_converter::convert_generated()

virtual void convert_generated(std::vector<tagged_lemma_forms>& forms) const = 0;

Convert the given results of morpho::generate. Apart from calling convert, any repeated entries are removed.

15.4. tagset_converter::new_identity_converter()

static tagset_converter* new_identity_converter();

Returns a new instance of an identity converter. All convert methods of an identity converter do nothing. The user should delete the instance after use.

15.5. tagset_converter::new_pdt_to_conll2009_converter()

static tagset_converter* new_pdt_to_conll2009_converter();

Returns a new instance of a Czech PDT tag set to CoNLL2009 tag set converter. The user should delete the instance after use.

CoNLL2009 tag set uses two columns for tags – one is a POS and the other one are additional FEATs. Because we have only one tag field, we merge these fields together by using Pos=?|FEAT, i.e., the POS is stored as a first FEAT.

15.6. tagset_converter::new_strip_lemma_comment_converter()

static tagset_converter* new_strip_lemma_comment_converter(const morpho& dictionary);

Returns a new instance of a tag set converter stripping lemma comment using the given morpho instance, which must remain valid during existence of the tag set converter. The user should delete the tag set converter instance after use.

15.7. tagset_converter::new_strip_lemma_id_converter()

static tagset_converter* new_strip_lemma_id_converter(const morpho& dictionary);

Returns a new instance of a tag set converter stripping lemma id using the given morpho instance, which must remain valid during existence of the tag set converter. The user should delete the tag set converter instance after use.

16. C++ Bindings API

Bindings for other languages than C++ are created using SWIG from the C++ bindings API, which is a slightly modified version of the native C++ API. Main changes are replacement of string_piece type by native strings and removal of methods using istream. Here is the C++ bindings API declaration:

16.1. Helper Structures

typedef vector<int> Indices;

typedef vector<string> Forms;

struct TaggedForm {
  string form;
  string tag;
};
typedef vector<TaggedForm> TaggedForms;

struct TaggedLemma {
  string lemma;
  string tag;
};
typedef vector<TaggedLemma> TaggedLemmas;
typedef vector<TaggedLemmas> Analyses;

struct TaggedLemmaForms {
  string lemma;
  TaggedForms forms;
};
typedef vector<TaggedLemmaForms> TaggedLemmasForms;

struct TokenRange {
  size_t start;
  size_t length;
};
typedef vector<TokenRange> TokenRanges;

struct DerivatedLemma {
  std::string lemma;
};
typedef vector<DerivatedLemma> DerivatedLemmas;

16.2. Main Classes

class Version {
 public:
  unsigned major;
  unsigned minor;
  unsigned patch;
  string prerelease;

  static Version current();
};

class Tokenizer {
 public:
  virtual void setText(const char* text);
  virtual bool nextSentence(Forms* forms, TokenRanges* tokens);

  static Tokenizer* newVerticalTokenizer();
  static Tokenizer* newCzechTokenizer();
  static Tokenizer* newEnglishTokenizer();
  static Tokenizer* newGenericTokenizer();
};

class Derivator {
 public:
  virtual bool parent(const char* lemma, DerivatedLemma& parent) const;
  virtual bool children(const char* lemma, DerivatedLemmas& children) const;
};

class DerivationFormatter {
 public:
  virtual string formatDerivation(const char* lemma) const;

  static DerivationFormatter* newNoneDerivationFormatter();
  static DerivationFormatter* newRootDerivationFormatter(const Derivator* derivator);
  static DerivationFormatter* newPathDerivationFormatter(const Derivator* derivator);
  static DerivationFormatter* newTreeDerivationFormatter(const Derivator* derivator);
  static DerivationFormatter* newDerivationFormatter(const char* name, const Derivator* derivator);
};

class Morpho {
 public:
  static Morpho* load(const char* fname);

  enum { NO_GUESSER = 0, GUESSER = 1 };

  virtual int analyze(const char* form, int guesser, TaggedLemmas& lemmas) const;
  virtual int generate(const char* lemma, const char* tag_wildcard, int guesser, TaggedLemmasForms& forms) const;
  virtual string rawLemma(const char* lemma) const;
  virtual string lemmaId(const char* lemma) const;
  virtual string rawForm(const char* form) const;

  virtual Tokenizer* newTokenizer() const;

  virtual Derivator* getDerivator() const;
};

class Tagger {
 public:
  static Tagger* load(const char* fname);

  virtual const Morpho* getMorpho() const;

  virtual void tag(const Forms& forms, TaggedLemmas& tags, int guesser = -1) const;

  virtual void tagAnalyzed(const Forms& forms, const Analyses& analyses, Indices& tags) const;

  Tokenizer* newTokenizer() const;
};

class TagsetConverter {
 public:
  static TagsetConverter* newIdentityConverter();
  static TagsetConverter* newPdtToConll2009Converter();
  static TagsetConverter* newStripLemmaCommentConverter(const Morpho& morpho);
  static TagsetConverter* newStripLemmaIdConverter(const Morpho& morpho);

  virtual void convert(TaggedLemma& lemma) const;
  virtual void convertAnalyzed(TaggedLemmas& lemmas) const;
  virtual void convertGenerated(TaggedLemmasForms& forms) const;
};

17. C# Bindings

MorphoDiTa library bindings is available in the Ufal.MorphoDiTa namespace.

The bindings is a straightforward conversion of the C++ bindings API. The bindings requires native C++ library libmorphodita_csharp (called morphodita_csharp on Windows).

18. Java Bindings

MorphoDiTa library bindings is available in the cz.cuni.mff.ufal.morphodita package.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Java interface, see cz.cuni.mff.ufal.morphodita.Forms class for reference. Also, class members are accessible and modifiable using using getField and setField wrappers.

The bindings require native C++ library libmorphodita_java (called morphodita_java on Windows). If the library is found in the current directory, it is used, otherwise standard library search process is used. The path to the C++ library can also be specified using static morphodita_java.setLibraryPath(String path) call (before the first call inside the C++ library, of course).

19. Perl Bindings

MorphoDiTa library bindings is available in the Ufal::MorphoDiTa package. The classes can be imported into the current namespace using the :all export tag.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Perl interface, see Ufal::MorphoDiTa::Forms for reference. Static methods and enumerations are available only through the module, not through object instance.

20. Python Bindings

MorphoDiTa library bindings is available in the ufal.morphodita module.

The bindings is a straightforward conversion of the C++ bindings API. In Python 2, strings can be both unicode and UTF-8 encoded str, and the library always produces unicode. In Python 3, strings must be only str.