NameTag API Reference
The NameTag API is defined in header nametag.h
and resides in
ufal::nametag
namespace.
The strings used in the NameTag API are always UTF-8 encoded (except from file paths, whose encoding is system dependent).
1. NameTag Versioning
NameTag is versioned using Semantic Versioning. Therefore, a version consists of three numbers major.minor.patch, optionally followed by a hyphen and pre-release version info, with the following semantics:
- Stable versions have no pre-release version info, development have non-empty pre-release version info.
- Two versions with the same major.minor have the same API with the same behaviour, apart from bugs. Therefore, if only patch is increased, the new version is only a bug-fix release.
- If two versions v and u have the same major, but minor(v) is greater than minor(u), version v contains only additions to the API. In other words, the API of u is all present in v with the same behaviour (once again apart from bugs). It is therefore safe to upgrade to a newer NameTag version with the same major.
- If two versions differ in major, their API may differ in any way.
Models created by NameTag have the same behaviour in all NameTag versions with same major, apart from obvious bugfixes. On the other hand, models created from the same data by different major.minor NameTag versions may have different behaviour.
2. Struct string_piece
struct string_piece { const char* str; size_t len; string_piece(); string_piece(const char* str); string_piece(const char* str, size_t len); string_piece(const std::string& str); }
The string_piece
is used for efficient string passing. The string
referenced in string_piece
is not owned by it, so users have to make sure
the referenced string exists as long as the string_piece
.
3. Struct token_range
struct token_range { size_t start; size_t length; };
The token_range
represent a range of a token as returned by a tokenizer.
The start
and length
fields specify the token position in Unicode
characters, not in bytes of UTF-8 encoding.
4. Struct named_entity
struct named_entity { size_t start; size_t length; std::string type; named_entity(); named_entity(size_t start, size_t length, const std::string& type); };
The named_entity
is used to represend a named entity. The
start
and length
fields represent the entity range in either tokens,
unicode characters or bytes, depending on the usage. The type
represents
the entity type.
5. Class version
class version { public: unsigned major; unsigned minor; unsigned patch; static version current(); };
The version
class represents NameTag version.
See NameTag Versioning for more information.
5.1. version::current
static version current();
Returns current NameTag version.
6. Class tokenizer
class tokenizer { public: virtual ~tokenizer() {} virtual void set_text(string_piece text, bool make_copy = false) = 0; virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0; static tokenizer* new_vertical_tokenizer(); };
The tokenizer
class performs segmentation and tokenization of given text.
The class is not threadsafe.
The tokenizer
instances can be obtained either directly using the
static method new_vertical_tokenizer
or
through instances of ner
.
6.1. tokenizer::set_text
virtual void set_text(string_piece text, bool make_copy = false) = 0;
Set the text which is to be tokenized.
If make_copy
is false
, only a reference to the given text is
stored and the user has to make sure it exists until the tokenizer
is released or set_text
is called again. If make_copy
is true
, a copy of the given text is made and retained until the
tokenizer is released or set_text
is called again.
6.2. tokenizer::next_sentence
virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0;
Locate and return next sentence of the given text. Returns true
when successful and false
when
there are no more sentences in the given text. The arguments are filled with found tokens if not NULL
.
The forms
contain token ranges in bytes of UTF-8 encoding, the tokens
contain token ranges
in Unicode characters.
6.3. tokenizer::new_vertical_tokenizer
static tokenizer new_vertical_tokenizer();
Returns a new instance of a vertical tokenizer, which considers every line to be one token, with empty line denoting end of sentence. The user should delete the instance after use.
7. Class ner
class ner { public: virtual ~ner() {} static ner* load(const char* fname); static ner* load(istream& is); virtual void recognize(const std::vector<string_piece>& forms, std::vector<named_entity>& entities) const = 0; virtual void entity_types(std::vector<std::string>& types) const = 0; virtual void gazetteers(std::vector<std::string>& gazetteers, std::vector<int>* gazetteer_types) const = 0; virtual tokenizer* new_tokenizer() const = 0; };
A ner
instance represents a named entity recognizer. All methods
are thread-safe.
7.1. ner::load(const char*)
static ner* load(const char* fname);
Factory method constructor. Accepts C string with a file name of the model.
Returns a pointer to an instance of ner
which the user should delete
after use.
7.2. ner::load(istream&)
static ner* load(istream& is);
Factory method constructor. Accepts an input stream with the
model. Returns a pointer to an instance of ner
which the user should
delete after use.
7.3. ner::recognize
virtual void recognize(const std::vector<string_piece>& forms, std::vector<named_entity>& entities) const = 0;
Perform named entity recognition on a tokenized sentence given in the forms
argument.
The found entities are returned in the entities
argument. The range of the
returned named_entity is represented using form indices.
7.4. ner::entity_types
virtual void entity_types(std::vector<std::string>& types) const = 0;
Return the entity types recognizable by the recognizer.
7.5. ner::gazetteers
virtual void gazetteers(std::vector<std::string>& gazetteers, std::vector<int>* gazetteer_types) const = 0;
Return a list of gazetteers stored in the recognizer, optionally together with
corresponding named entity types. Currently only gazetteers from the
GazetteersEnhanced
feature template are returned.
7.6. ner::new_tokenizer
virtual tokenizer* new_tokenizer() const = 0;
Returns a new instance of a suitable tokenizer or NULL
if no such tokenizer
exists. The user should delete it after use.
8. C++ Bindings API
Bindings for other languages than C++ are created using SWIG from the C++
bindings API, which is a slightly modified version of the native C++ API.
Main changes are replacement of string_piece
type by native
strings and removal of methods using istream
. Here is the C++ bindings API
declaration:
8.1. Helper Structures
typedef vector<string> Forms; struct TokenRange { size_t start; size_t length; }; typedef vector<TokenRange> TokenRanges; struct NamedEntity { size_t start; size_t length; string type; NamedEntity(); NamedEntity(size_t start, size_t length, const string& type); }; typedef vector<NamedEntity> NamedEntities;
8.2. Main Classes
class Version { public: unsigned major; unsigned minor; unsigned patch; string prerelease; static Version current(); }; class Tokenizer { public: virtual void setText(const char* text); virtual bool nextSentence(Forms* forms, TokenRanges* tokens); static Tokenizer* newVerticalTokenizer(); }; class Ner { static ner* load(const char* fname); virtual void recognize(Forms& forms, NamedEntities& entities) const; virtual void entityTypes(Forms& types) const; virtual void gazetteers(Forms& gazetteers, Ints& gazetteer_types) const; virtual Tokenizer* newTokenizer() const; };
9. C# Bindings
NameTag library bindings is available in the Ufal.NameTag
namespace.
The bindings is a straightforward conversion of the C++
bindings API.
The bindings requires native C++ library libnametag_csharp
(called
nametag_csharp
on Windows).
See also C# binding example usage.
10. Java Bindings
NameTag library bindings is available in the cz.cuni.mff.ufal.nametag
package.
The bindings is a straightforward conversion of the C++
bindings API.
Vectors do not have native Java interface, see
cz.cuni.mff.ufal.nametag.Forms
class for reference. Also, class members
are accessible and modifiable using using getField
and setField
wrappers.
The bindings require native C++ library libnametag_java
(called
nametag_java
on Windows). If the library is found in the current
directory, it is used, otherwise standard library search process is used.
The path to the C++ library can also be specified using static
nametag_java.setLibraryPath(String path)
call (before the first call
inside the C++ library, of course).
See also Java binding example usage.
11. Perl Bindings
NameTag library bindings is available in the
Ufal::NameTag
package.
The classes can be imported into the current namespace using the :all
export tag.
The bindings is a straightforward conversion of the C++
bindings API.
Vectors do not have native Perl interface, see Ufal::NameTag::Forms
for
reference. Static methods and enumerations are available only through the
module, not through object instance.
See also Perl binding example usage.
12. Python Bindings
NameTag library bindings is available in the
ufal.nametag
module.
The bindings is a straightforward conversion of the C++
bindings API.
In Python 2, strings can be both unicode
and UTF-8 encoded str
, and the
library always produces unicode
. In Python 3, strings must be only str
.
See also Python binding example usage.