NameTag 2 Formats

This page describes the NameTag 2 REST Web Service input and output formats.

1. Input Formats

The input format is specified using the input parameter. Currently supported input formats are:

  • untokenized (default): the input will be tokenized and segmented using a tokenizer defined by the model,
  • vertical: the input is in vertical format, every line is considered a word, with empty line denoting end of sentence.
  • conllu-ne: the input is in the CoNLL-U format.

2. Output Formats

The output format is specified using the output parameter. Currently supported output formats are:

  • xml (default): Simple XML format without a root element, using <sentence> element to mark sentences and <token> element to mark tokens. The recognized named entities are encoded using <ne type="..."> element.

    Example input:
    Václav Havel byl český dramatik, esejista, kritik komunistického režimu a později politik.
    

    A NameTag identifies a first name (pf), a surname (ps) and a person name container (P) in the input (line breaks added):
      <sentence><ne type="P"><ne type="pf"><token>Václav</token></ne> <ne type="ps"><token>Havel</token></ne></ne>
      <token>byl</token> <token>český</token> <token>dramatik</token><token>,</token> <token>esejista</token><token>,</token>
      <token>kritik</token> <token>komunistického</token> <token>režimu</token> <token>a</token> <token>později</token>
      <token>politik</token><token>.</token></sentence>
    

  • vertical: Every found named entity is on a separate line. Each line contains three tab-separated fields: entity_range, entity_type and entity_text. The entity_range is composed of token identifiers (counting from 1 and including end-of-sentence; if the input is also vertical, token identifiers correspond exactly to line numbers) of tokens forming the named entity and entity_type represents its type. The entity_text is not strictly necessary and contains space separated words of this named entity.

    Example input:
    Václav Havel byl český dramatik, esejista, kritik komunistického režimu a později politik.
    

    Example output:
    1,2	P	Václav Havel
    1	pf	Václav
    2	ps	Havel
    

  • conll: A CoNLL-like vertical format. Every word is on a line, followed by a tab and recognized entity label. An empty line denotes end of sentence. The entity labels are:
    • O: no entity
    • B-type: the word is the first in the entity of type type
    • I-type: the word is a non-initial word in the entity of type type
    If there are embedded entities, only the outermost entity is saved in the file, the embedded ones are ignored.

    Example input:
    Václav Havel byl český dramatik, esejista, kritik komunistického režimu a později politik.
    

    Example output:
    Václav	B-P
    Havel	I-P
    byl	O
    český	O
    ...
    

  • conllu-ne: the output is in the CoNLL-U format, with named entities in the MISC column. All labels corresponding to the token create one item in the MISC column, delimited from the other annotations by vertical bars |. The item key is NE=. If there are multiple labels, they are delimited by a hyphen -. All named entity mentions receive a unique number identificator, appended to the label with and underscore _.

    Example input:
    1	Jmenuji	jmenovat	VERB	VB-S---1P-AA--1	Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	_	TokenRange=0:7
    2	se	se	PRON	P7-X4----------	Case=Acc|PronType=Prs|Reflex=Yes|Variant=Short	1	expl:pv	_	TokenRange=8:10
    3	Jan	Jan	PROPN	NNMS1-----A----	Animacy=Anim|Case=Nom|Gender=Masc|NameType=Giv|Number=Sing|Polarity=Pos	1	nsubj	_	TokenRange=11:14
    4	Novák	Novák	PROPN	NNMS1-----A----	Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Number=Sing|Polarity=Pos	3	flat	_	SpaceAfter=No|TokenRange=15:20
    5	.	.	PUNCT	Z:-------------	_	1	punct	_	SpacesAfter=\n|TokenRange=20:21
    
    

    Example output:
    1	Jmenuji	jmenovat	VERB	VB-S---1P-AA--1	Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act	0	root	_	TokenRange=0:7
    2	se	se	PRON	P7-X4----------	Case=Acc|PronType=Prs|Reflex=Yes|Variant=Short	1	expl:pv	_	TokenRange=8:10
    3	Jan	Jan	PROPN	NNMS1-----A----	Animacy=Anim|Case=Nom|Gender=Masc|NameType=Giv|Number=Sing|Polarity=Pos	1	nsubj	_	TokenRange=11:14|NE=P_1-pf_2
    4	Novák	Novák	PROPN	NNMS1-----A----	Animacy=Anim|Case=Nom|Gender=Masc|NameType=Sur|Number=Sing|Polarity=Pos	3	flat	_	SpaceAfter=No|TokenRange=15:20|NE=P_1-ps_3
    5	.	.	PUNCT	Z:-------------	_	1	punct	_	SpacesAfter=\n|TokenRange=20:21