The Parsito API is defined in header parsito.h and resides in ufal::parsito namespace. The API allows only using existing models, for custom model creation you have to use the train_parser binary.

The strings used in the Parsito API are always UTF-8 encoded (except from file paths, whose encoding is system dependent).

1. Parsito Versioning

Parsito is versioned using Semantic Versioning. Therefore, a version consists of three numbers major.minor.patch, optionally followed by a hyphen and pre-release version info, with the following semantics:

  • Stable versions have no pre-release version info, development have non-empty pre-release version info.
  • Two versions with the same major.minor have the same API with the same behaviour, apart from bugs. Therefore, if only patch is increased, the new version is only a bug-fix release.
  • If two versions v and u have the same major, but minor(v) is greater than minor(u), version v contains only additions to the API. In other words, the API of u is all present in v with the same behaviour (once again apart from bugs). It is therefore safe to upgrade to a newer Parsito version with the same major.
  • If two versions differ in major, their API may differ in any way.

Models created by Parsito have the same behaviour in all Parsito versions with same major, apart from obvious bugfixes. On the other hand, models created from the same data by different major.minor Parsito versions may have different behaviour.

2. Struct string_piece

struct string_piece {
  const char* str;
  size_t len;

  string_piece();
  string_piece(const char* str);
  string_piece(const char* str, size_t len);
  string_piece(const std::string& str);
}

The string_piece is used for efficient string passing. The string referenced in string_piece is not owned by it, so users have to make sure the referenced string exists as long as the string_piece.

3. Class node

class node {
 public:
  int id;              // 0 is root, >0 is sentence node, <0 is undefined
  std::string form;    // form
  std::string lemma;   // lemma
  std::string upostag; // universal part-of-speech tag
  std::string xpostag; // language-specific part-of-speech tag
  std::string feats;   // list of morphological features
  int head;            // head, 0 is root, <0 is without parent
  std::string deprel;  // dependency relation to the head
  std::string deps;    // secondary dependencies
  std::string misc;    // miscellaneous information

  std::vector<int> children;

  node(int id = -1, const std::string& form = std::string())
};

The node class represents a word in the dependency tree. The node fields correspond to CoNLL-U fields, which are documented here, with the children field representing the opposite direction of head links.

4. Class tree

class tree {
 public:
  tree();

  std::vector<node> nodes;

  bool empty();
  void clear();
  node& add_node(const std::string& form);
  void set_head(int id, int head, const std::string& deprel);
  void unlink_all_nodes();

  static const std::string root_form;
};

The tree class represents dependency trees of word nodes. Note that the first node (with index 0) is always a technical root, whose form is root_form.

Although you can manipulate with the nodes directly, the tree class offers several simple node manipulation methods.

4.1. tree::empty()

bool empty();

Returns true if the tree is empty. i.e., if it contains only a technical root node.

4.2. tree::clear()

void clear();

Removes all tree nodes but the technical root node.

4.3. tree::add_node()

node& add_node(const std::string& form);

Adds a new node to the tree. The new node has first unused id, specified form and is not linked to any other node. Reference to the new node is returned so that other fields can be also filled.

4.4. tree:set_head()

void set_head(int id, int head, const std::string& deprel);

Link the node id to the node head, with the specified dependency relation. If the head is negative, the node id is unlinked from its current head, if any.

4.5. tree::unlink_all_nodes()

void unlink_all_nodes();

Unlink all nodes.

5. Class tree_input_format

class tree_input_format {
 public:
  virtual ~tree_input_format() {}

  virtual bool read_block(std::istream& in, std::string& block) const = 0;
  virtual void set_text(string_piece text, bool make_copy = false) = 0;
  virtual bool next_tree(tree& t) = 0;
  const std::string& last_error() const;

  // Static factory methods
  static tree_input_format* new_input_format(const std::string& name);
  static tree_input_format* new_conllu_input_format();
};

The tree_input_format class allows loading dependency trees in various formats.

5.1. tree_input_format::read_block()

virtual bool read_block(std::istream& in, std::string& block) const = 0;

Load from a specified input stream reasonably small text block, which contains complete trees (i.e., the last tree in the block is not incomplete).

Such a text block might be for example a paragraph separated by an empty line.

5.2. tree_input_format::set_text()

virtual void set_text(string_piece text, bool make_copy = false) = 0;

Set the text from which the dependency trees will be read.

If make_copy is false, only a reference to the given text is stored and the user has to make sure it exists until the instance is destroyed or set_text is called again. If make_copy is true, a copy of the given text is made and retained until the instance is destroyed or set_text is called again.

5.3. tree_input_format::next_tree()

virtual bool next_tree(tree& t) = 0;

Try reading another dependency tree from the text specified by set_text. Returns true if a tree was read and false if the text ended of there was a read error.

If the format contains additional information in addition to the fields stored in the tree, it is stored in the tree_input_format instance, and can be printed using a corresponding tree_output_format. Note that this additional information is stored only for the last tree read.

5.4. tree_input_format::last_error()

const std::string& last_error() const;

Returns an error which occurred during the last next_tree. If no error occurred, the returned string is empty.

5.5. tree_input_format::new_input_format()

static tree_input_format* new_input_format(const std::string& name);

Create new tree_input_format instance, given its name. The following input formats are currently supported:

  • conllu

The new instance must be deleted after use.

5.6. tree_input_format::new_conllu_input_format()

static tree_input_format* new_conllu_input_format();

Creates tree_input_format instance which loads dependency trees in the CoNLL-U format. The new instance must be deleted after use.

Note that even if sentence comments and multi-word tokens are not stored in the tree instance, they can be printed using a corresponding CoNLL-U tree_output_format instance.

6. Class tree_output_format

class tree_output_format {
 public:
  virtual ~tree_output_format() {}

  virtual void write_tree(const tree& t, std::string& output, const tree_input_format* additional_info = nullptr) const = 0;

  // Static factory methods
  static tree_output_format* new_output_format(const std::string& name);
  static tree_output_format* new_conllu_output_format();
};

The tree_output_format class allows printing dependency trees in various formats. If the format contains additional information in addition to the fields stored in the tree, it can be printed using a corresponding tree_output_format.

6.1. tree_output_format::write_tree()

virtual void write_tree(const tree& t, std::string& output, const tree_input_format* additional_info = nullptr) const = 0;

Prints a dependency tree to the specified string.

If the tree was read using a tree_input_format instance, this instance may store additional information, which may be printed by the tree_output_format instance. Note that this additional information is stored only for the tree last read with tree_input_format::next_tree.

6.2. tree_output_format::new_output_format()

static tree_output_format* new_output_format(const std::string& name);

Create new tree_output_format instance, given its name. The following output formats are currently supported:

  • conllu

The new instance must be deleted after use.

6.3. tree_output_format::new_conllu_output_format()

static tree_output_format* new_conllu_output_format();

Creates tree_output_format instance which loads dependency trees in the CoNLL-U format. The new instance must be deleted after use.

Note that even if sentence comments and multi-word tokens are not stored in the tree instance, they can be printed using this instance.

7. Class parser

class parser {
 public:
  virtual ~parser() {};

  virtual void parse(tree& t, unsigned beam_size = 0) const = 0;

  enum { NO_CACHE = 0, FULL_CACHE = 2147483647};
  static parser* load(const char* file, unsigned cache = 1000);
  static parser* load(std::istream& in, unsigned cache = 1000);
};

The parser class allows parsing given sentence, using an existing parser model.

7.1. parser::parse()

virtual void parse(tree& t, unsigned beam_size = 0) const = 0;

Parses the sentence (passed in the tree instance) and returns a dependency tree. If there are any links in the input tree, they are discarded using tree::unlink_all_nodes first.

The beam size of the decoding can optionally be specified, with the value 0 representing parser model default. If the parser model does not support beam search, the argument is ignored.

7.2. parser::load(const char*)

static parser* load(const char* file, unsigned cache = 1000);

Loads parser model from a specified file. Returns a pointer to a new instance of parser which must be deleted after use.

The cache argument specifies caching level, with NO_CACHE representing no caching and FULL_CACHE maximum caching. Although the interpretation of this argument depends on the parser used, you can consider it as a number of most frequent forms/lemmas/tags to cache (either during model loading or during parsing).

7.3. parser::load(istream&)

static parser* load(std::istream& in, unsigned cache = 1000);

Loads parser model from the given input stream. The input stream is not closed after loading. Returns a pointer to a new instance of [parser #parser] which must be deleted after use.

The cache argument specifies caching level, with NO_CACHE representing no caching and FULL_CACHE maximum caching. Although the interpretation of this argument depends on the parser used, you can consider it as a number of most frequent forms/lemmas/tags to cache (either during model loading or during parsing).

8. Class version

class version {
 public:
  unsigned major;
  unsigned minor;
  unsigned patch;
  std::string prerelease;

  static version current();
};

The version class represents Parsito version. See Parsito Versioning for more information.

8.1. version::current

static version current();

Returns current Parsito version.

9. C++ Bindings API

Bindings for other languages than C++ are created using SWIG from the C++ bindings API, which is a slightly modified version of the native C++ API. Main changes are replacement of string_piece type by native strings and removal of methods using istream. Here is the C++ bindings API declaration:

9.1. Helper Structures

typedef vector<int> Children;

class Node {
 public:
  int id;          // 0 is root, >0 is sentence node, <0 is undefined
  string form;    // form
  string lemma;   // lemma
  string upostag; // universal part-of-speech tag
  string xpostag; // language-specific part-of-speech tag
  string feats;   // list of morphological features
  int head;       // head, 0 is root, <0 is without parent
  string deprel;  // dependency relation to the head
  string deps;    // secondary dependencies
  string misc;    // miscellaneous information

  Children children;

  node(int id = -1, string form = string());
};
typedef std::vector<node> Nodes;

9.2. Main Classes

class Tree {
 public:
  Tree();

  Nodes nodes;

  bool empty();
  void clear();
  node& addNode(string form);
  void setHead(int id, int head, string deprel);
  void unlinkAllNodes();

  static const std::string root_form;
}

class TreeInputFormat {
 public:
  virtual void setText(string text);
  virtual bool nextTree(tree& t) = 0;
  string lastError() const;

  // Static factory methods
  static TreeInputFormat* newInputFormat(string name);
  static TreeInputFormat* newConlluInputFormat();
};

class TreeOutputFormat {
 public:

  virtual string writeTree(const tree& t, const tree_input_format* additional_info = nullptr);

  // Static factory methods
  static TreeOutputFormat* newOutputFormat(string name);
  static TreeOutputFormat* newConlluOutputFormat();
};

class Parser {
 public:
  virtual void parse(tree& t, unsigned beam_size = 0) const;

  enum { NO_CACHE = 0, FULL_CACHE = 2147483647};
  static Parser* load(string file, unsigned cache = 1000);
};

class Version {
 public:
  unsigned major;
  unsigned minor;
  unsigned patch;
  string prerelease;

  static Version current();
};

10. C# Bindings

Parsito library bindings is available in the Ufal.Parsito namespace.

The bindings is a straightforward conversion of the C++ bindings API. The bindings requires native C++ library libparsito_csharp (called parsito_csharp on Windows).

11. Java Bindings

Parsito library bindings is available in the cz.cuni.mff.ufal.parsito package.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Java interface, see cz.cuni.mff.ufal.parsito.Children class for reference. Also, class members are accessible and modifiable using using getField and setField wrappers.

The bindings require native C++ library libparsito_java (called parsito_java on Windows). If the library is found in the current directory, it is used, otherwise standard library search process is used. The path to the C++ library can also be specified using static parsito_java.setLibraryPath(String path) call (before the first call inside the C++ library, of course).

12. Perl Bindings

Parsito library bindings is available in the Ufal::Parsito package. The classes can be imported into the current namespace using the :all export tag.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Perl interface, see Ufal::Parsito::Children for reference. Static methods and enumerations are available only through the module, not through object instance.

13. Python Bindings

Parsito library bindings is available in the ufal.parsito module.

The bindings is a straightforward conversion of the C++ bindings API. In Python 2, strings can be both unicode and UTF-8 encoded str, and the library always produces unicode. In Python 3, strings must be only str.