UDPipe 1 API Reference
- UDPipe Versioning
- Struct string_piece
- Class token
- 3.1. token::get_space_after()
- 3.2. token::set_space_after()
- 3.3. token::get_spaces_before()
- 3.4. token::set_spaces_before()
- 3.5. token::get_spaces_after()
- 3.6. token::set_spaces_after()
- 3.7. token::get_spaces_in_token()
- 3.8. token::set_spaces_in_token()
- 3.9. token::get_token_range()
- 3.10. token::set_token_range()
- Class word
- Class multiword_token
- Class empty_node
- Class sentence
- 7.1. sentence::empty()
- 7.2. sentence::clear()
- 7.3. sentence::add_word()
- 7.4. sentence:set_head()
- 7.5. sentence::unlink_all_words()
- 7.6. sentence::get_new_doc()
- 7.7. sentence::set_new_doc()
- 7.8. sentence::get_new_par()
- 7.9. sentence::set_new_par()
- 7.10. sentence::get_sent_id()
- 7.11. sentence::set_sent_id()
- 7.12. sentence::get_text()
- 7.13. sentence::set_text()
- Class input_format
- 8.1. input_format::read_block()
- 8.2. input_format::reset_document()
- 8.3. input_format::set_text()
- 8.4. input_format::next_sentence()
- 8.5. input_format::new_input_format()
- 8.6. input_format::new_conllu_input_format()
- 8.7. input_format::new_generic_tokenizer_input_format()
- 8.8. input_format::new_horizontal_input_format()
- 8.9. input_format::new_vertical_input_format()
- 8.10. input_format::new_presegmented_tokenizer()
- Class output_format
- 9.1. output_format::write_sentence()
- 9.2. output_format::finish_document()
- 9.3. output_format::new_output_format()
- 9.4. output_format::new_conllu_output_format()
- 9.5. output_format::new_epe_output_format()
- 9.6. output_format::new_matxin_output_format()
- 9.7. output_format::new_plaintext_output_format()
- 9.8. output_format::new_horizontal_output_format()
- 9.9. output_format::new_vertical_output_format()
- Class model
- Class pipeline
- Class trainer
- Class evaluator
- Class version
- C++ Bindings API
- C# Bindings
- Java Bindings
- Perl Bindings
- Python Bindings
This section describes available API. The command line tools are described on the User's Manual page.
The UDPipe API is defined in header udpipe.h and resides in
ufal::udpipe namespace. The API allows only using existing models,
for custom model creation you have to use the train_parser binary.
The strings used in the UDPipe API are always UTF-8 encoded (except from file paths, whose encoding is system dependent).
1. UDPipe Versioning
UDPipe is versioned using Semantic Versioning. Therefore, a version consists of three numbers major.minor.patch, optionally followed by a hyphen and pre-release version info, with the following semantics:
- Stable versions have no pre-release version info, development have non-empty pre-release version info.
- Two versions with the same major.minor have the same API with the same behaviour, apart from bugs. Therefore, if only patch is increased, the new version is only a bug-fix release.
- If two versions v and u have the same major, but minor(v) is greater than minor(u), version v contains only additions to the API. In other words, the API of u is all present in v with the same behaviour (once again apart from bugs). It is therefore safe to upgrade to a newer UDPipe version with the same major.
- If two versions differ in major, their API may differ in any way.
Models created by UDPipe have the same behaviour in all UDPipe versions with same major, apart from obvious bugfixes. On the other hand, models created from the same data by different major.minor UDPipe versions may have different behaviour.
2. Struct string_piece
struct string_piece {
const char* str;
size_t len;
string_piece();
string_piece(const char* str);
string_piece(const char* str, size_t len);
string_piece(const std::string& str);
}
The string_piece is used for efficient string passing. The string
referenced in string_piece is not owned by it, so users have to make sure
the referenced string exists as long as the string_piece.
3. Class token
class token {
public:
string form;
string misc;
token(string_piece form = string_piece(), string_piece misc = string_piece());
// CoNLL-U defined SpaceAfter=No feature
bool get_space_after() const;
void set_space_after(bool space_after);
// UDPipe-specific all-spaces-preserving SpacesBefore and SpacesAfter features
void get_spaces_before(string& spaces_before) const;
void set_spaces_before(string_piece spaces_before);
void get_spaces_after(string& spaces_after) const;
void set_spaces_after(string_piece spaces_after);
void get_spaces_in_token(string& spaces_in_token) const;
void set_spaces_in_token(string_piece spaces_in_token);
// UDPipe-specific TokenRange feature
bool get_token_range(size_t& start, size_t& end) const;
void set_token_range(size_t start, size_t end);
};
The token class represents a sentence token,
with form and misc fields corresponding to CoNLL-U fields.
The token class acts mostly as a parent to word
and multiword_token classes.
The class also offers several methods for manipulating features in the misc field.
Notably, UDPipe uses custom misc fields to store all spaces in the original
document. This markup is backward compatible with CoNLL-U v2 SpaceAfter=No feature.
This markup can be utilized by plaintext output format, which allows reconstructing
the original document.
The markup uses the following misc fields:
SpacesBefore=content(by default empty): spaces/other content preceding the tokenSpacesAfter=content(by default a space ifSpaceAfter=Nofeature is not present, empty otherwise): spaces/other content following the tokenSpacesInToken=content(by default equal to the FORM of the token): FORM of the token including original spaces (this is needed only if tokens are allowed to contain spaces and a token contains a tab or newline characters)
The content of all above three fields must be escaped to allow storing tabs and newlines.
The following C-like schema is used:
\s: space\t: tab\r: CR character\n: LF character\p: | (pipe character)\\: \ (backslash character)
3.1. token::get_space_after()
bool get_space_after() const;
Returns true if the token should be followed by a spaces, false if not,
according to the absence or presence of the SpaceAfter=No feature in the misc field.
3.2. token::set_space_after()
void set_space_after(bool space_after);
Adds or removes the SpaceAfter=No feature in the misc field.
3.3. token::get_spaces_before()
void get_spaces_before(string& spaces_before) const;
Return spaces preceding current token, stored in the SpacesBefore
feature in the misc field. If SpacesBefore is not present, empty string
is returned.
3.4. token::set_spaces_before()
void set_spaces_before(string_piece spaces_before);
Set the SpacesBefore feature in the misc field.
3.5. token::get_spaces_after()
void get_spaces_after(string& spaces_after) const;
Return spaces after current token, stored in the SpacesAfter
feature in the misc field.
If SpacesAfter is not present and SpaceAfter=No is present,
return an empty string; if neither feature is present, one space is returned.
3.6. token::set_spaces_after()
void set_spaces_after(string_piece spaces_after);
Set the SpacesAfter and SpaceAfter=No features in the misc field.
3.7. token::get_spaces_in_token()
void get_spaces_in_token(string& spaces_in_token) const;
Return the value of the SpacesInToken feature, if present.
Otherwise, empty string is returned.
3.8. token::set_spaces_in_token()
void set_spaces_in_token(string_piece spaces_in_token);
Set the SpacesInToken feature in the misc field.
3.9. token::get_token_range()
bool get_token_range(size_t& start, size_t& end) const;
If present, return the value of the TokenRange feature in the misc field.
The format of the feature (inspired by Python) is TokenRange=start:end,
where start is zero-based document-level index of the start of the token
(counted in Unicode characters) and end is zero-based document-level index
of the first character following the token (i.e., the length of the token is end-start).
3.10. token::set_token_range()
void set_token_range(size_t start, size_t end);
Set the TokenRange feature in the misc field. If string::npos
is passed in the start argument, TokenRange feature is removed
from the misc field.
4. Class word
class word : public token {
public:
// form and misc are inherited from token
int id; // 0 is root, >0 is sentence word, <0 is undefined
string lemma; // lemma
string upostag; // universal part-of-speech tag
string xpostag; // language-specific part-of-speech tag
string feats; // list of morphological features
int head; // head, 0 is root, <0 is undefined
string deprel; // dependency relation to the head
string deps; // secondary dependencies
vector<int> children;
word(int id = -1, string_piece form = string_piece());
};
The word class represents a sentence word.
The word fields correspond to CoNLL-U fields,
with the children field representing the opposite direction of
head links (the elements of the children array are in ascending order).
5. Class multiword_token
class multiword_token : public token {
public:
// form and misc are inherited from token
int id_first, id_last;
multiword_token(int id_first = -1, int id_last = -1, string_piece form = string_piece(), string_piece misc = string_piece());
};
The multiword_token represents a multi-word token
described in CoNLL-U format.
The multi-word token has a form and a misc field, other CoNLL-U word
fields are guaranteed to be empty.
6. Class empty_node
class empty_node {
public:
int id; // 0 is root, >0 is sentence word, <0 is undefined
int index; // index for the current id, should be numbered from 1, 0=undefined
string form; // form
string lemma; // lemma
string upostag; // universal part-of-speech tag
string xpostag; // language-specific part-of-speech tag
string feats; // list of morphological features
string deps; // secondary dependencies
string misc; // miscellaneous information
empty_node(int id = -1, int index = 0) : id(id), index(index) {}
};
The empty_node class represents an empty node from CoNLL-U 2.0,
with the fields corresponding to CoNLL-U fields.
For a specified id, the index are numbered sequentially from 1.
7. Class sentence
class sentence {
public:
sentence();
vector<word> words;
vector<multiword_token> multiword_tokens;
vector<empty_node> empty_nodes;
vector<string> comments;
static const string root_form;
// Basic sentence modifications
bool empty();
void clear();
word& add_word(string_piece form = string_piece());
void set_head(int id, int head, const string& deprel);
void unlink_all_words();
// CoNLL-U defined comments
bool get_new_doc(string* id = nullptr) const;
void set_new_doc(bool new_doc, string_piece id = string_piece());
bool get_new_par(string* id = nullptr) const;
void set_new_par(bool new_par, string_piece id = string_piece());
bool get_sent_id(string& id) const;
void set_sent_id(string_piece id);
bool get_text(string& text) const;
void set_text(string_piece text);
};
The sentence class represents a sentence CoNLL-U sentence,
which consists of:
- sequence of
words stored in ascending order, with the first word (with index 0) always being a technical root with formroot_form - sequence of
multiword_tokens also stored in ascending order - sequence of
empty_nodes also stored in ascending order - comments
Although you can manipulate the words directly, the
sentence class offers several simple node manipulation methods.
There are also several methods manipulating CoNLL-U v2 comments.
7.1. sentence::empty()
bool empty();
Returns true if the sentence is empty. i.e., if it contains only a technical root node.
7.2. sentence::clear()
void clear();
Removes all words, multi-word tokens and comments (only the technical root word is kept).
7.3. sentence::add_word()
word& add_word(string_piece form = string_piece());
Adds a new word to the sentence. The new word has first unused id,
specified form and is not linked to any other node. Reference to the new
word is returned so that other fields can be also filled.
7.4. sentence:set_head()
void set_head(int id, int head, const std::string& deprel);
Link the word id to the word head, with the specified dependency relation.
If the head is negative, the word id is unlinked from its current head,
if any.
7.5. sentence::unlink_all_words()
void unlink_all_words();
Unlink all words.
7.6. sentence::get_new_doc()
bool get_new_doc(string* id = nullptr) const;
Return true if # newdoc comment is present. Optionally,
document id is also returned (in # newdoc id = ... format).
7.7. sentence::set_new_doc()
void set_new_doc(bool new_doc, string_piece id = string_piece());
Adds/removes # newdoc comment, optionally with a given
document id.
7.8. sentence::get_new_par()
bool get_new_par(string* id = nullptr) const;
Return true if # newpar comment is present. Optionally,
paragraph id is also returned (in # newpar id = ... format).
7.9. sentence::set_new_par()
void set_new_par(bool new_par, string_piece id = string_piece());
Adds/removes # newpar comment, optionally with a given
paragraph id.
7.10. sentence::get_sent_id()
bool get_sent_id(string& id) const;
Return true if # sent_id = ... comment is present,
and fill given id with sentence id. Otherwise, return false
and clear id.
7.11. sentence::set_sent_id()
void set_sent_id(string_piece id);
Set the # sent_id = ... comment using given sentence id;
if the sentence id is empty, remove all present # sent_id comment.
7.12. sentence::get_text()
bool get_text(string& text) const;
Return true if # text = ... comment is present,
and fill given text with sentence text. Otherwise, return false
and clear text.
7.13. sentence::set_text()
void set_text(string_piece text);
Set the # text = ... comment using given text;
if the given text is empty, remove all present # text comment.
8. Class input_format
class input_format {
public:
virtual ~input_format() {}
virtual bool read_block(istream& is, string& block) const = 0;
virtual void reset_document(string_piece id = string_piece()) = 0;
virtual void set_text(string_piece text, bool make_copy = false) = 0;
virtual bool next_sentence(sentence& s, string& error) = 0;
// Static factory methods
static input_format* new_input_format(const string& name);
static input_format* new_conllu_input_format(const string& options = std::string());
static input_format* new_generic_tokenizer_input_format(const string& options = std::string());
static input_format* new_horizontal_input_format(const string& options = std::string());
static input_format* new_vertical_input_format(const string& options = std::string());
static input_format* new_presegmented_tokenizer(input_format* tokenizer);
static const string CONLLU_V1;
static const string CONLLU_V2;
static const string GENERIC_TOKENIZER_NORMALIZED_SPACES;
static const string GENERIC_TOKENIZER_PRESEGMENTED;
static const string GENERIC_TOKENIZER_RANGES;
};
The input_format class allows loading sentences in various formats.
Th class instances may store internal state and are not thread-safe.
8.1. input_format::read_block()
virtual bool read_block(istream& is, string& block) const = 0;
Read a portion of input, which is guaranteed to contain only complete sentences. Such portion is usually a paragraph (text followed by an empty line) or a line, but it may be more complex (i.e., in a XML-like format).
8.2. input_format::reset_document()
virtual void reset_document(string_piece id = string_piece()) = 0;
Resets the input_format instance state. Such state
is needed not only for remembering unprocessed text of the last
set_text call, but also for correct inter-block
state tracking (for example to track document-level ranges or inter-sentence spaces
-- if you pass only spaces to set_text, these
spaces has to accumulate and be returned as preceding spaces of the next
sentence).
If applicable, first read sentence will have the # newdoc comment, optionally
with given document id.
8.3. input_format::set_text()
virtual void set_text(string_piece text, bool make_copy = false) = 0;
Set the text from which the sentences will be read.
If make_copy is false, only a reference to the given text is
stored and the user has to make sure it exists until the instance
is destroyed or set_text is called again. If make_copy
is true, a copy of the given text is made and retained until the
instance is destroyed or set_text is called again.
8.4. input_format::next_sentence()
virtual bool next_sentence(sentence& s, string& error) = 0;
Try reading another sentence from the text specified by
set_text. Returns true if the sentence was
read and false if the text ended or there was a read error. The latter
two conditions can be distinguished by the error parameter – if it is
empty, the text ended, if it is nonempty, it contains a description of the
read error.
8.5. input_format::new_input_format()
static input_format* new_input_format(const string& name);
Create new input_format instance, given its name.
The individual input formats can be parametrized by using format=data
syntax. The following input formats are currently supported:
conllu: return thenew_conllu_input_formatgeneric_tokenizer: return thenew_generic_tokenizer_input_formathorizontal: return thenew_horizontal_input_formatvertical: return thenew_vertical_input_format
The new instance must be deleted after use.
8.6. input_format::new_conllu_input_format()
static input_format* new_conllu_input_format(const string() options = std::string());
Create input_format instance which loads sentences
in the CoNLL-U format.
The new instance must be deleted after use.
Supported options:
v2(default): use CoNLL-U v2v1: allow loading only CoNLL-U v1 (i.e., no empty nodes and no spaces in forms and lemmas)
8.7. input_format::new_generic_tokenizer_input_format()
static input_format* new_generic_tokenizer_input_format(const string() options = std::string());
Create rule-based generic tokenizer for English-like languages (with spaces separating tokens and English-like punctuation). The new instance must be deleted after use.
Supported options:
normalized_spaces: by default, UDPipe uses custommiscfields to exactly encode spaces in the original document. Ifnormalized_spacesoption is given, only standard CoNLL-U v2 markup (SpaceAfter=Noand# newpar) is used.presegmented: input is assumed to be already segmented, with every sentence on a line, and is only tokenized (respecting sentence breaks)ranges: for every token, range in the original document is stored in a format described intokenclass
8.8. input_format::new_horizontal_input_format()
static input_format* new_horizontal_input_format(const string() options = std::string());
Create input_format instance which loads forms from a simple
horizontal format – each sentence on a line, with word forms separated by spaces.
The new instance must be deleted after use.
In order to allow spaces in tokens, Unicode character 'NO-BREAK SPACE' (U+00A0) is considered part of token and converted to a space during loading.
8.9. input_format::new_vertical_input_format()
static input_format* new_vertical_input_format(const string() options = std::string());
Create input_format instance which loads forms from a simple
vertical format – each word on a line, with empty line denoting end of sentence.
The new instance must be deleted after use.
8.10. input_format::new_presegmented_tokenizer()
static input_format* new_presegmented_tokenizer(input_format* tokenizer);
Create input_format instance which acts as a tokenizer
adapter – given a tokenizer which segments anywhere, it creates a tokenizer
which segments on newline characters (by calling the tokenizer on individual lines,
and if the tokenizer segments in the middle of the line, it calls it repeatedly
and merges the results).
The new instance must be deleted after use. Note that the new instance
takes ownership of the given tokenizer and deletes it during
its own deletion.
9. Class output_format
class output_format {
public:
virtual ~output_format() {}
virtual void write_sentence(const sentence& s, ostream& os) = 0;
virtual void finish_document(ostream& os) {};
// Static factory methods
static output_format* new_output_format(const string& name);
static output_format* new_conllu_output_format(const string() options = std::string());
static output_format* new_epe_output_format(const string() options = std::string());
static output_format* new_matxin_output_format(const string() options = std::string());
static output_format* new_horizontal_output_format(const string() options = std::string());
static output_format* new_plaintext_output_format(const string() options = std::string());
static output_format* new_vertical_output_format(const string() options = std::string());
static const string CONLLU_V1;
static const string CONLLU_V2;
static const string HORIZONTAL_PARAGRAPHS;
static const string PLAINTEXT_NORMALIZED_SPACES;
static const string VERTICAL_PARAGRAPHS;
};
The output_format class allows printing sentences
in various formats.
The class instances may store internal state and are not thread-safe.
9.1. output_format::write_sentence()
virtual void write_sentence(const sentence& s, ostream& os) = 0;
Write given sentence to the given output stream.
When the output format requires document-level markup, it is written
automatically when the first sentence is written using this
output_format instance (or after
finish_document call).
9.2. output_format::finish_document()
virtual void finish_document(ostream& os) {};
When the output format requires document-level markup, write
the end-of-document mark and reset the output_format
instance state (i.e., the next write_sentence
will write start-of-document mark).
9.3. output_format::new_output_format()
static output_format* new_output_format(const string& name);
Create new output_format instance, given its name.
The following output formats are currently supported:
conllu: return thenew_conllu_output_formatepe: return thenew_epe_output_formatmatxin: return thenew_matxin_output_formathorizontal: return thenew_horizontal_output_formatplaintext: return thenew_plaintext_output_formatvertical: return thenew_vertical_output_format
The new instance must be deleted after use.
9.4. output_format::new_conllu_output_format()
static output_format* new_conllu_output_format(const string() options = std::string());
Creates output_format instance for writing sentences
in the CoNLL-U format.
The new instance must be deleted after use.
Supported options:
v2(default): use CoNLL-U v2v1: produce output in CoNLL-U v1 format. Note that this is a lossy process, as empty nodes are ignored and spaces in forms and lemmas are converted to underscores.
9.5. output_format::new_epe_output_format()
static output_format* new_epe_output_format(const string() options = std::string());
Creates output_format instance for writing sentences
in the EPE (Extrinsic Parser Evaluation 2017) interchange format.
The new instance must be deleted after use.
9.6. output_format::new_matxin_output_format()
static output_format* new_matxin_output_format(const string() options = std::string());
Creates output_format instance for writing sentences
in the Matxin format – UDPipe produces a XML with the following DTD:
<!ELEMENT corpus (SENTENCE*)>
<!ELEMENT SENTENCE (NODE*)>
<!ATTLIST SENTENCE ord CDATA #REQUIRED
alloc CDATA #REQUIRED>
<!ELEMENT NODE (NODE*)>
<!ATTLIST NODE ord CDATA #REQUIRED
alloc CDATA #REQUIRED
form CDATA #REQUIRED
lem CDATA #REQUIRED
mi CDATA #REQUIRED
si CDATA #REQUIRED
sub CDATA #REQUIRED>
The new instance must be deleted after use.
9.7. output_format::new_plaintext_output_format()
static output_format* new_plaintext_output_format(const string() options = std::string());
Creates output_format instance for writing sentence
tokens (in the UD sense) using original spacing.
By default, UDPipe custom misc features (see description of
token class) are used to reconstruct the exact original spaces.
However, if the document does not contain these features or if only
normalized spacing is wanted, you can use the following option:
normalized_spaces: write one sentence on a line, and either one or no space between tokens, using theSpaceAfter=Nofeature
9.8. output_format::new_horizontal_output_format()
static output_format* new_horizontal_output_format(const string() options = std::string());
Creates output_format instance for writing sentences
in a simple horizontal format – each sentence on a line, with word forms separated
by spaces. The new instance must be deleted after use.
Because words can contain spaces in CoNLL-U v2, the spaces in words are converted to Unicode character 'NO-BREAK SPACE' (U+00A0).
Supported options:
paragraphs: if given, an empty line is printed after the end of a paragraph or a document (recognized by# newparor# newdoccomments)
9.9. output_format::new_vertical_output_format()
static output_format* new_vertical_output_format(const string() options = std::string());
Creates output_format instance for writing sentences
in a simple vertical format – each word form on a line, with empty line
denoting end of sentence. The new instance must be deleted after use.
Supported options:
paragraphs: if given, an empty line is printed after the end of a paragraph or a document (recognized by# newparor# newdoccomments)
10. Class model
class model {
public:
virtual ~model() {}
static model* load(const char* fname);
static model* load(istream& is);
virtual input_format* new_tokenizer(const string& options) const = 0;
virtual bool tag(sentence& s, const string& options, string& error) const = 0;
virtual bool parse(sentence& s, const string& options, string& error) const = 0;
static const string DEFAULT;
static const string TOKENIZER_NORMALIZED_SPACES;
static const string TOKENIZER_PRESEGMENTED;
static const string TOKENIZER_RANGES;
};
Class representing UDPipe model, allowing to perform tokenization, tagging and parsing.
10.1. model::load(const char*)
static model* load(const char* fname);
Load a new model from a given file, returning NULL on failure.
The new instance must be deleted after use.
10.2. model::load(istream&)
static model* load(istream& is);
Load a new model from a given input stream, returning NULL on failure.
The new instance must be deleted after use.
10.3. model::new_tokenizer()
virtual input_format* new_tokenizer(const string& options) const = 0;
Construct a new tokenizer (or NULL if no tokenizer is specified by the model).
The new instance must be deleted after use.
10.4. model::tag()
virtual bool tag(sentence& s, const string& options, string& error) const = 0;
Tag the given sentence.
10.5. model::parse()
virtual bool parse(sentence& s, const string& options, string& error) const = 0;
Parse the given sentence.
11. Class pipeline
class pipeline {
public:
pipeline(const model* m, const string& input, const string& tagger, const string& parser, const string& output);
void set_model(const model* m);
void set_input(const string& input);
void set_tagger(const string& tagger);
void set_parser(const string& parser);
void set_output(const string& output);
void set_immediate(bool immediate);
void [set_document_id #pipeline_set_document_id[(const string& document_id);
bool process(istream& is, ostream& os, string& error) const;
static const string DEFAULT;
static const string NONE;
};
The pipeline class allows simple file-to-file processing.
A model and input/tagger/parser/output options can be specified in the pipeline.
The input file can be processed either after fully loaded (default),
or in immediate mode, in which case is the input processed and printed as soon
as a block of input guaranteed to contain whole sentences is loaded.
Specifically, for most input formats the input is processed after loading an
empty line (with the exception of horizontal input format and
presegmented tokenizer, where the input is processed after loading every
line).
11.1. pipeline::set_model()
void set_model(const model* m);
Use the given model.
11.2. pipeline::set_input()
void set_input(const string& input);
Use the given input format. In addition to formats described in
new_input_format, a special
tokenizer or tokenizer=options format allows using the
model tokenizer.
11.3. pipeline::set_tagger()
void set_tagger(const string& tagger);
Use the given tagger options.
11.4. pipeline::set_parser()
void set_parser(const string& parser);
Use the given parser options.
11.5. pipeline::set_output()
void set_output(const string& output);
Use the given output format (see
new_output_format for a list).
11.6. pipeline::set_immediate()
void set_immediate(bool immediate);
Set or reset the immediate mode (default is immediate=false).
11.7. pipeline::set_document_id()
void set_document_id(const string& document_id);
Set document id, which is passed to
input_format::reset_document).
11.8. pipeline::process()
bool process(istream& is, ostream& os, string& error) const;
Process the given input stream, writing results to the given output stream.
If the processing succeeded, true is returned; otherwise, false
is returned with an error stored in the error argument.
12. Class trainer
class trainer {
public:
static bool train(const string& method, const vector<sentence>& train, const vector<sentence>& heldout,
const string& tokenizer, const string& tagger, const string& parser,
ostream& os, string& error);
static const string DEFAULT;
static const string NONE;
};
Class allowing training a UDPipe model.
12.1. trainer::train()
static bool train(const string& method, const vector<sentence>& train, const vector<sentence>& heldout, const string& tokenizer, const string& tagger, const string& parser, ostream& os, string& error);
Train a UDPipe model. The only supported method is currently morphodita_parsito.
Use the supplied train and heldout data, and given tokenizer, tagger and parser
options (see the Training UDPipe Models section in the User's Manual).
If the training succeeded, true is returned and the model is saved to the
given os stream; otherwise, false is returned with an error stored in
the error argument.
13. Class evaluator
class evaluator {
public:
evaluator(const model* m, const string& tokenizer, const string& tagger, const string& parser);
void set_model(const model* m);
void set_tokenizer(const string& tokenizer);
void set_tagger(const string& tagger);
void set_parser(const string& parser);
bool evaluate(istream& is, ostream& os, string& error) const;
static const string DEFAULT;
static const string NONE;
};
Class evaluating performance of given model on CoNLL-U file.
Three different settings (depending on whether tokenizer, tagger and parser is used) can be evaluated. For details, see Measuring Model Accuracy in User's Manual.
13.1. evaluator::set_model()
void set_model(const model* m);
Use the given model.
13.2. evaluator::set_tokenizer()
void set_tokenizer(const string& tokenizer);
Use the given tokenizer options; pass DEFAULT to use default
options or NONE not to use a tokenizer.
13.3. evaluator::set_tagger()
void set_tagger(const string& tagger);
Use the given tagger options; pass DEFAULT to use default
options or NONE not to use a tagger.
13.4. evaluator::set_parser()
void set_parser(const string& parser);
Use the given parser options; pass DEFAULT to use default
options or NONE not to use a parser.
13.5. evaluator::evaluate()
bool evaluate(istream& is, ostream& os, string& error) const;
Evaluate the specified model on the given CoNLL-U input read
from is stream.
If the evaluation succeeded, true is returned and the evaluation
results are written to the os stream in a plain text format;
otherwise, false is returned with an error stored in
the error argument.
14. Class version
class version {
public:
unsigned major;
unsigned minor;
unsigned patch;
string prerelease;
static version current();
};
The version class represents UDPipe version.
See UDPipe Versioning for more information.
14.1. version::current
static version current();
Returns current UDPipe version.
15. C++ Bindings API
Bindings for other languages than C++ are created using SWIG from the C++
bindings API, which is a slightly modified version of the native C++ API.
Main changes are replacement of string_piece type by native
strings and removal of methods using istream. Here is the C++ bindings API
declaration:
15.1. Helper Structures
typedef vector<int> Children;
typedef vector<uint8_t> Bytes;
typedef vector<string> Comments;
class ProcessingError {
public:
bool occurred();
string message;
};
class Token {
public:
string form;
string misc;
Token(const string& form = string(), const string& misc = string());
// CoNLL-U defined SpaceAfter=No feature
bool getSpaceAfter() const;
void setSpaceAfter(bool space_after);
// UDPipe-specific all-spaces-preserving SpacesBefore and SpacesAfter features
string getSpacesBefore() const;
void setSpacesBefore(const string& spaces_before);
string getSpacesAfter() const;
void setSpacesAfter(const string& spaces_after);
string getSpacesInToken() const;
void setSpacesInToken(const string& spaces_in_token);
// UDPipe-specific TokenRange feature
bool getTokenRange() const;
size_t getTokenRangeStart() const;
size_t getTokenRangeEnd() const;
void setTokenRange(size_t start, size_t end);
};
class Word : public Token {
public:
// form and misc are inherited from token
int id; // 0 is root, >0 is sentence word, <0 is undefined
string lemma; // lemma
string upostag; // universal part-of-speech tag
string xpostag; // language-specific part-of-speech tag
string feats; // list of morphological features
int head; // head, 0 is root, <0 is undefined
string deprel; // dependency relation to the head
string deps; // secondary dependencies
Children children;
Word(int id = -1, const string& form = string());
};
typedef vector<Word> Words;
class MultiwordToken : public Token {
public:
// form and misc are inherited from token
int idFirst, idLast;
MultiwordToken(int id_first = -1, int id_last = -1, const string& form = string(), const string& misc = string());
};
typedef vector<MultiwordToken> MultiwordTokens;
class EmptyNode {
public:
int id; // 0 is root, >0 is sentence word, <0 is undefined
int index; // index for the current id, should be numbered from 1, 0=undefined
string form; // form
string lemma; // lemma
string upostag; // universal part-of-speech tag
string xpostag; // language-specific part-of-speech tag
string feats; // list of morphological features
string deps; // secondary dependencies
string misc; // miscellaneous information
EmptyNode(int id = -1, int index = 0) : id(id), index(index) {}
};
typedef vector<empty_node> EmptyNodes;
class Sentence {
public:
Sentence();
Words words;
MultiwordTokens multiwordTokens;
EmptyNodes emptyNodes;
Comments comments;
static const string rootForm;
// Basic sentence modifications
bool empty();
void clear();
virtual Word& addWord(const char* form);
void setHead(int id, int head, const string& deprel);
void unlinkAllWords();
// CoNLL-U defined comments
bool getNewDoc() const;
string getNewDocId() const;
void setNewDoc(bool new_doc, const string& id = string());
bool getNewPar() const;
string getNewParId() const;
void setNewPar(bool new_par, const string& id = string());
string getSentId() const;
void setSentId(const string& id);
string getText() const;
void setText(const string& id);
};
typedef vector<Sentence> Sentences;
15.2. Main Classes
class InputFormat {
public:
virtual void resetDocument(const string& id = string());
virtual void setText(const char* text);
virtual bool nextSentence(Sentence& s, ProcessingError* error = nullptr);
static InputFormat* newInputFormat(const string& name);
static InputFormat* newConlluInputFormat(const string& id = string());
static InputFormat* newGenericTokenizerInputFormat(const string& id = string());
static InputFormat* newHorizontalInputFormat(const string& id = string());
static InputFormat* newVerticalInputFormat(const string& id = string());
static InputFormat* newPresegmentedTokenizer(InputFormat tokenizer);
static const string CONLLU_V1;
static const string CONLLU_V2;
static const string GENERIC_TOKENIZER_NORMALIZED_SPACES;
static const string GENERIC_TOKENIZER_PRESEGMENTED;
static const string GENERIC_TOKENIZER_RANGES;
};
class OutputFormat {
public:
virtual string writeSentence(const Sentence& s);
virtual string finishDocument();
static OutputFormat* newOutputFormat(const string& name);
static OutputFormat* newConlluOutputFormat(const string& options = string());
static OutputFormat* newEpeOutputFormat(const string& options = string());
static OutputFormat* newMatxinOutputFormat(const string& options = string());
static OutputFormat* newHorizontalOutputFormat(const string& options = string());
static OutputFormat* newPlaintextOutputFormat(const string& options = string());
static OutputFormat* newVerticalOutputFormat(const string& options = string());
static const string CONLLU_V1;
static const string CONLLU_V2;
static const string HORIZONTAL_PARAGRAPHS;
static const string PLAINTEXT_NORMALIZED_SPACES;
static const string VERTICAL_PARAGRAPHS;
};
class Model {
public:
static Model* load(const char* fname);
virtual InputFormat* newTokenizer(const string& options) const;
virtual bool tag(Sentence& s, const string& options, ProcessingError* error = nullptr) const;
virtual bool parse(Sentence& s, const string& options, ProcessingError* error) const;
static const string DEFAULT;
static const string TOKENIZER_PRESEGMENTED;
};
class Pipeline {
public:
Pipeline(const Model* m, const string& input, const string& tagger, const string& parser, const string& output);
void setModel(const Model* m);
void setInput(const string& input);
void setTagger(const string& tagger);
void setParser(const string& parser);
void setOutput(const string& output);
void setImmediate(bool immediate);
void setDocumentId(const string& document_id);
string process(const string& data, ProcessingError* error = nullptr) const;
static const string DEFAULT;
static const string NONE;
};
class Trainer {
public:
static Bytes* train(const string& method, const Sentences& train, const Sentences& heldout,
const string& tokenizer, const string& tagger, const string& parser,
ProcessingError* error = nullptr);
static const string DEFAULT;
static const string NONE;
};
class Evaluator {
public:
Evaluator(const Model* m, const string& tokenizer, const string& tagger, const string& parser);
void setModel(const Model* m);
void setTokenizer(const string& tokenizer);
void setTagger(const string& tagger);
void setParser(const string& parser);
string evaluate(const string& data, ProcessingError* error = nullptr) const;
static const string DEFAULT;
static const string NONE;
};
class Version {
public:
unsigned major;
unsigned minor;
unsigned patch;
string prerelease;
// Returns current version.
static version current();
};
16. C# Bindings
UDPipe library bindings is available in the Ufal.UDPipe namespace.
The bindings is a straightforward conversion of the C++ bindings API.
The bindings requires native C++ library libudpipe_csharp (called
udpipe_csharp on Windows).
17. Java Bindings
UDPipe library bindings is available in the cz.cuni.mff.ufal.udpipe
package.
The bindings is a straightforward conversion of the C++ bindings API.
Vectors do not have native Java interface, see
cz.cuni.mff.ufal.udpipe.Words class for reference. Also, class members
are accessible and modifiable using using getField and setField
wrappers.
The bindings require native C++ library libudpipe_java (called
udpipe_java on Windows). If the library is found in the current
directory, it is used, otherwise standard library search process is used.
The path to the C++ library can also be specified using static
udpipe_java.setLibraryPath(String path) call (before the first call
inside the C++ library, of course).
18. Perl Bindings
UDPipe library bindings is available in the
Ufal::UDPipe package.
The classes can be imported into the current namespace using the :all
export tag.
The bindings is a straightforward conversion of the C++ bindings API.
Vectors do not have native Perl interface, see Ufal::UDPipe::Words for
reference. Static methods and enumerations are available only through the
module, not through object instance.
19. Python Bindings
UDPipe library bindings is available in the
ufal.udpipe module.
The bindings is a straightforward conversion of the C++ bindings API,
just native bytes type is used instead of the C++ Bytes type.
You might also be interested in a contributed package spacy-udpipe which wraps UDPipe with spaCy API.


