Parsito API Reference
The Parsito API is defined in header parsito.h
and resides in
ufal::parsito
namespace. The API allows only using existing models,
for custom model creation you have to use the train_parser
binary.
The strings used in the Parsito API are always UTF-8 encoded (except from file paths, whose encoding is system dependent).
1. Parsito Versioning
Parsito is versioned using Semantic Versioning. Therefore, a version consists of three numbers major.minor.patch, optionally followed by a hyphen and pre-release version info, with the following semantics:
- Stable versions have no pre-release version info, development have non-empty pre-release version info.
- Two versions with the same major.minor have the same API with the same behaviour, apart from bugs. Therefore, if only patch is increased, the new version is only a bug-fix release.
- If two versions v and u have the same major, but minor(v) is greater than minor(u), version v contains only additions to the API. In other words, the API of u is all present in v with the same behaviour (once again apart from bugs). It is therefore safe to upgrade to a newer Parsito version with the same major.
- If two versions differ in major, their API may differ in any way.
Models created by Parsito have the same behaviour in all Parsito versions with same major, apart from obvious bugfixes. On the other hand, models created from the same data by different major.minor Parsito versions may have different behaviour.
2. Struct string_piece
struct string_piece { const char* str; size_t len; string_piece(); string_piece(const char* str); string_piece(const char* str, size_t len); string_piece(const std::string& str); }
The string_piece
is used for efficient string passing. The string
referenced in string_piece
is not owned by it, so users have to make sure
the referenced string exists as long as the string_piece
.
3. Class node
class node { public: int id; // 0 is root, >0 is sentence node, <0 is undefined std::string form; // form std::string lemma; // lemma std::string upostag; // universal part-of-speech tag std::string xpostag; // language-specific part-of-speech tag std::string feats; // list of morphological features int head; // head, 0 is root, <0 is without parent std::string deprel; // dependency relation to the head std::string deps; // secondary dependencies std::string misc; // miscellaneous information std::vector<int> children; node(int id = -1, const std::string& form = std::string()) };
The node
class represents a word in the dependency tree.
The node
fields correspond to CoNLL-U fields, which are documented
here, with
the children
field representing the opposite direction of head
links.
4. Class tree
class tree { public: tree(); std::vector<node> nodes; bool empty(); void clear(); node& add_node(const std::string& form); void set_head(int id, int head, const std::string& deprel); void unlink_all_nodes(); static const std::string root_form; };
The tree
class represents dependency trees of word nodes.
Note that the first node (with index 0) is always a technical root, whose
form is root_form
.
Although you can manipulate with the nodes
directly, the tree
class offers several simple node manipulation methods.
4.1. tree::empty()
bool empty();
Returns true
if the tree is empty. i.e., if it contains only a technical root node.
4.2. tree::clear()
void clear();
Removes all tree nodes but the technical root node.
4.3. tree::add_node()
node& add_node(const std::string& form);
Adds a new node to the tree. The new node has first unused id
, specified form
and is not linked to any other node. Reference to the new node is returned
so that other fields can be also filled.
4.4. tree:set_head()
void set_head(int id, int head, const std::string& deprel);
Link the node id
to the node head
, with the specified dependency relation.
If the head
is negative, the node id
is unlinked from its current head,
if any.
4.5. tree::unlink_all_nodes()
void unlink_all_nodes();
Unlink all nodes.
5. Class tree_input_format
class tree_input_format { public: virtual ~tree_input_format() {} virtual bool read_block(std::istream& in, std::string& block) const = 0; virtual void set_text(string_piece text, bool make_copy = false) = 0; virtual bool next_tree(tree& t) = 0; const std::string& last_error() const; // Static factory methods static tree_input_format* new_input_format(const std::string& name); static tree_input_format* new_conllu_input_format(); };
The tree_input_format
class allows loading dependency trees
in various formats.
5.1. tree_input_format::read_block()
virtual bool read_block(std::istream& in, std::string& block) const = 0;
Load from a specified input stream reasonably small text block, which contains complete trees (i.e., the last tree in the block is not incomplete).
Such a text block might be for example a paragraph separated by an empty line.
5.2. tree_input_format::set_text()
virtual void set_text(string_piece text, bool make_copy = false) = 0;
Set the text from which the dependency trees will be read.
If make_copy
is false
, only a reference to the given text is
stored and the user has to make sure it exists until the instance
is destroyed or set_text
is called again. If make_copy
is true
, a copy of the given text is made and retained until the
instance is destroyed or set_text
is called again.
5.3. tree_input_format::next_tree()
virtual bool next_tree(tree& t) = 0;
Try reading another dependency tree from the text specified by
set_text
. Returns true
if
a tree was read and false
if the text ended of there was a read error.
If the format contains additional information in addition to the fields stored
in the tree
, it is stored in the
tree_input_format
instance, and can be printed using
a corresponding tree_output_format
.
Note that this additional information is stored only for the
last tree read.
5.4. tree_input_format::last_error()
const std::string& last_error() const;
Returns an error which occurred during the last
next_tree
. If no error occurred,
the returned string is empty.
5.5. tree_input_format::new_input_format()
static tree_input_format* new_input_format(const std::string& name);
Create new tree_input_format
instance, given its name.
The following input formats are currently supported:
conllu
The new instance must be deleted after use.
5.6. tree_input_format::new_conllu_input_format()
static tree_input_format* new_conllu_input_format();
Creates tree_input_format
instance which loads
dependency trees in the
CoNLL-U format.
The new instance must be deleted after use.
Note that even if sentence comments and multi-word tokens are not stored in the
tree
instance, they can be printed using a corresponding
CoNLL-U tree_output_format
instance.
6. Class tree_output_format
class tree_output_format { public: virtual ~tree_output_format() {} virtual void write_tree(const tree& t, std::string& output, const tree_input_format* additional_info = nullptr) const = 0; // Static factory methods static tree_output_format* new_output_format(const std::string& name); static tree_output_format* new_conllu_output_format(); };
The tree_output_format
class allows printing
dependency trees in various formats. If the format contains additional
information in addition to the fields stored in the tree
,
it can be printed using a corresponding tree_output_format
.
6.1. tree_output_format::write_tree()
virtual void write_tree(const tree& t, std::string& output, const tree_input_format* additional_info = nullptr) const = 0;
Prints a dependency tree
to the specified string.
If the tree was read using a tree_input_format
instance,
this instance may store additional information, which may be printed by the
tree_output_format
instance. Note that this additional
information is stored only for the tree last read with
tree_input_format::next_tree
.
6.2. tree_output_format::new_output_format()
static tree_output_format* new_output_format(const std::string& name);
Create new tree_output_format
instance, given its name.
The following output formats are currently supported:
conllu
The new instance must be deleted after use.
6.3. tree_output_format::new_conllu_output_format()
static tree_output_format* new_conllu_output_format();
Creates tree_output_format
instance which loads
dependency trees in the
CoNLL-U format.
The new instance must be deleted after use.
Note that even if sentence comments and multi-word tokens are not stored in the
tree
instance, they can be printed using this instance.
7. Class parser
class parser { public: virtual ~parser() {}; virtual void parse(tree& t, unsigned beam_size = 0) const = 0; enum { NO_CACHE = 0, FULL_CACHE = 2147483647}; static parser* load(const char* file, unsigned cache = 1000); static parser* load(std::istream& in, unsigned cache = 1000); };
The parser
class allows parsing given sentence,
using an existing parser model.
7.1. parser::parse()
virtual void parse(tree& t, unsigned beam_size = 0) const = 0;
Parses the sentence (passed in the tree
instance)
and returns a dependency tree. If there are any links in the
input tree, they are discarded using
tree::unlink_all_nodes
first.
The beam size of the decoding can optionally be specified, with the value
0
representing parser model default. If the parser model does not
support beam search, the argument is ignored.
7.2. parser::load(const char*)
static parser* load(const char* file, unsigned cache = 1000);
Loads parser model from a specified file. Returns a pointer to a new
instance of parser
which must be deleted after use.
The cache
argument specifies caching level, with NO_CACHE
representing
no caching and FULL_CACHE
maximum caching. Although the interpretation
of this argument depends on the parser used, you can consider it as a number
of most frequent forms/lemmas/tags to cache (either during model loading
or during parsing).
7.3. parser::load(istream&)
static parser* load(std::istream& in, unsigned cache = 1000);
Loads parser model from the given input stream. The input stream is not
closed after loading. Returns a pointer to a new instance of [parser
#parser] which must be deleted after use.
The cache
argument specifies caching level, with NO_CACHE
representing
no caching and FULL_CACHE
maximum caching. Although the interpretation
of this argument depends on the parser used, you can consider it as a number
of most frequent forms/lemmas/tags to cache (either during model loading
or during parsing).
8. Class version
class version { public: unsigned major; unsigned minor; unsigned patch; std::string prerelease; static version current(); };
The version
class represents Parsito version.
See Parsito Versioning for more information.
8.1. version::current
static version current();
Returns current Parsito version.
9. C++ Bindings API
Bindings for other languages than C++ are created using SWIG from the C++
bindings API, which is a slightly modified version of the native C++ API.
Main changes are replacement of string_piece
type by native
strings and removal of methods using istream
. Here is the C++ bindings API
declaration:
9.1. Helper Structures
typedef vector<int> Children; class Node { public: int id; // 0 is root, >0 is sentence node, <0 is undefined string form; // form string lemma; // lemma string upostag; // universal part-of-speech tag string xpostag; // language-specific part-of-speech tag string feats; // list of morphological features int head; // head, 0 is root, <0 is without parent string deprel; // dependency relation to the head string deps; // secondary dependencies string misc; // miscellaneous information Children children; node(int id = -1, string form = string()); }; typedef std::vector<node> Nodes;
9.2. Main Classes
class Tree { public: Tree(); Nodes nodes; bool empty(); void clear(); node& addNode(string form); void setHead(int id, int head, string deprel); void unlinkAllNodes(); static const std::string root_form; } class TreeInputFormat { public: virtual void setText(string text); virtual bool nextTree(tree& t) = 0; string lastError() const; // Static factory methods static TreeInputFormat* newInputFormat(string name); static TreeInputFormat* newConlluInputFormat(); }; class TreeOutputFormat { public: virtual string writeTree(const tree& t, const tree_input_format* additional_info = nullptr); // Static factory methods static TreeOutputFormat* newOutputFormat(string name); static TreeOutputFormat* newConlluOutputFormat(); }; class Parser { public: virtual void parse(tree& t, unsigned beam_size = 0) const; enum { NO_CACHE = 0, FULL_CACHE = 2147483647}; static Parser* load(string file, unsigned cache = 1000); }; class Version { public: unsigned major; unsigned minor; unsigned patch; string prerelease; static Version current(); };
10. C# Bindings
Parsito library bindings is available in the Ufal.Parsito
namespace.
The bindings is a straightforward conversion of the C++
bindings API.
The bindings requires native C++ library libparsito_csharp
(called
parsito_csharp
on Windows).
11. Java Bindings
Parsito library bindings is available in the cz.cuni.mff.ufal.parsito
package.
The bindings is a straightforward conversion of the C++
bindings API.
Vectors do not have native Java interface, see
cz.cuni.mff.ufal.parsito.Children
class for reference. Also, class members
are accessible and modifiable using using getField
and setField
wrappers.
The bindings require native C++ library libparsito_java
(called
parsito_java
on Windows). If the library is found in the current
directory, it is used, otherwise standard library search process is used.
The path to the C++ library can also be specified using static
parsito_java.setLibraryPath(String path)
call (before the first call
inside the C++ library, of course).
12. Perl Bindings
Parsito library bindings is available in the
Ufal::Parsito
package.
The classes can be imported into the current namespace using the :all
export tag.
The bindings is a straightforward conversion of the C++
bindings API.
Vectors do not have native Perl interface, see Ufal::Parsito::Children
for
reference. Static methods and enumerations are available only through the
module, not through object instance.
13. Python Bindings
Parsito library bindings is available in the
ufal.parsito
module.
The bindings is a straightforward conversion of the C++
bindings API.
In Python 2, strings can be both unicode
and UTF-8 encoded str
, and the
library always produces unicode
. In Python 3, strings must be only str
.