NameTag 2 Models

In natural language text, the task of nested named entity recognition (NER) is to identify proper names such as names of persons, organizations and locations. NameTag 2 identifies and classifies nested named entities by an algorithm described in Straková et al. 2019.

Like any supervised machine learning tool, NameTag needs a trained linguistic model. This section describes the available language models.

All models are available under the CC BY-NC-SA licence and can be downloaded from LINDAT repository. The latest version is 210916.

The models work in NameTag version 2.

All models use UDPipe for tokenization.

The models are versioned according to the date when released, the version format is YYMMDD, where YY, MM and DD are two-digit representation of year, month and day, respectively. The latest version is210916.

1. Czech CNEC2.0 Model

The Czech model is trained on training part of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007).

The corpus uses 46 atomic named entity types, which can be embedded, e.g., the river name can be part of a name of a city as in <gu Ústí nad <gh Labem>>. There are also 4 so-called NE containers: two or more NEs are parts of a NE container (e.g., two NEs, a first name and a surname, form together a person name NE container such as in <P <pf Jan><ps Novák>>). The 4 NE containers are marked with a capital one-letter tag: P for (complex) person names, T for temporal expressions, A for addresses, and C for bibliographic items.

The latest version is 200831, distributed by LINDAT.

The model czech-cnec2.0-200831 reaches 83.44 F1-measure for fine-grained, two-character types and 87.04 for coarse, one-character supertypes on the CNEC2.0 test data.

1.1. Acknowledgements

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

Czech CNEC 2.0 model is trained on Czech Named Entity Corpus 2.0, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

1.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

Straka Milan, Straková Jana, Hajič Jan: Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Lecture Notes in Computer Science, Vol. 11697, Proceedings of the 22nd International Conference on Text, Speech and Dialogue - TSD 2019, Copyright © Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-030-27946-2, ISSN 0302-9743, pp. 137-150, 2019.

Straková Jana, Straka Milan, Hajič Jan, Popel Martin: Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, Vol. 80, No. 4, Copyright © Ústav pro jazyk český AV ČR, Prague, Czech Republic, ISSN 0037-7031, pp. 306-327, Dec 2019.

2. English CoNLL Model

The English model is trained on training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER, ORG, LOC and MISC.

The latest version is 200831, distributed by LINDAT.

The model english-conll-200831 reaches 91.68 F1-measure on the CoNLL-2003 test data.

2.1. Acknowledgements

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

2.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

3. German CoNLL Model

The German model is trained on training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER, ORG, LOC and MISC.

The latest version is 200831, distributed by LINDAT.

The model german-conll-200831 reaches 82.65 F1-measure on the CoNLL-2003 test data.

3.1. Acknowledgements

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

3.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

4. German GermEval Model

The German model is trained on training part of the GermEval 2014 NER Shared Task (Benikova et al., 2014). This corpus annotation uses nested entities (i.e., an entity can be embedded into another entity), and the nestedness is limited to at most two levels (outside entity and inside entity). The annotation accounts for derivatives and tokens that contain named entity only partially, thus the annotation uses the folllowing labels: PER, ORG, LOC, OTH and O for not an entity; and further PERderiv, ORGderiv, LOCderiv and OTHderiv for derivatives and PERpart, ORGpart, LOCpart and OTHpart for partial entities (e.g., Troia-Ausstellung, in which only Troia is the named entity, example from Benikova et al., 2014).

The latest version is 210916, distributed by LINDAT.

The model german-germeval-210916 reaches 84.40 F1-measure on the GermEval 2014 test data, measured with the official shared task evaluation script ``nereval.perl''.

4.1. German GermEval State of the Art

4.1.1. Systems trained on GermEval 2014 training data

F1 (strict, official) System
84.40 NameTag2 german-germeval-210916 model
79.10 Modular Classifier (Hänig 2014)
78.42 Semi-Supervised Features (Agerri 2017)
76.37 (Riedl and Padó, 2020)
76.12 Hybrid Neural Networks (Shao 2016)

4.1.2. Systems trained on additional data

F1 (strict, official) System
84.73 (Riedl and Padó, 2020), transfer from CoNLL data

4.2. Acknowledgements

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

4.2.1. Publications

The methodology is from the following publication, although the result has been measured later (in 2021) and is yet unpublished:

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

5. Dutch CoNLL Model

The Dutch model is trained on training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER, ORG, LOC and MISC.

The latest version is 200831, distributed by LINDAT.

The model dutch-conll-200831 reaches 91.17 F1-measure on the CoNLL-2002 test data.

5.1. Acknowledgements

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

5.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

6. Spanish CoNLL Model

The Spanish model is trained on training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER, ORG, LOC and MISC.

The latest version is 200831, distributed by LINDAT.

The model spanish-conll-200831 reaches 88.55 F1-measure on the CoNLL-2002 test data.

6.1. Acknowledgements

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported by LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101). It has also been supported by the Mellon Foundation grant No. G-1901-06505. It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

6.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.