NameTag 2 Models

In natural language text, the task of nested named entity recognition (NER) is to identify proper names such as names of persons, organizations and locations. NameTag 2 identifies and classifies nested named entities by an algorithm described in Straková et al. 2019.

Like any supervised machine learning tool, NameTag needs a trained linguistic model. This section describes the available language models.

All models are available under the CC BY-NC-SA licence.

The models work in NameTag version 2.0 or later.

All models use UDPipe for tokenization.

The models are versioned according to the date when released, the version format is YYMMDD, where YY, MM and DD are two-digit representation of year, month and day, respectively.

1. Czech CNEC2.0 Model

The Czech model is trained on training part of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007). The corpus uses 46 named entity types, which can be nested.

The latest version is 200831.

The model czech-cnec2.0-200831 reaches 83.43 F1-measure on the CNEC2.0 test data.

1.1. Acknowledgements

The work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

Czech CNEC 2.0 model is trained on Czech Named Entity Corpus 2.0, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka.

The research was carried out by Jana Straková and Milan Straka.

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported and has been using language resources developed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071). It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

All models use UDPipe for tokenization.

1.1.1. Publications

2. English CoNLL Model

The English model is trained on training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER, ORG, LOC and MISC.

The latest version is 200831.

The model english-conll-200831 reaches 91.68 F1-measure on the CoNLL-2003 test data.

2.1. Acknowledgements

The work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

The research was carried out by Jana Straková and Milan Straka.

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported and has been using language resources developed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071). It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

All models use UDPipe for tokenization.

2.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

3. German CoNLL Model

The German model is trained on training part of the CoNLL-2003 NER annotations (Sang and De Meulder, 2003) of part of Reuters Corpus. The corpus uses four classes PER, ORG, LOC and MISC.

The latest version is 200831.

The model german-conll-200831 reaches 82.65 F1-measure on the CoNLL-2003 test data.

3.1. Acknowledgements

The work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

The research was carried out by Jana Straková and Milan Straka.

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported and has been using language resources developed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071). It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

All models use UDPipe for tokenization.

3.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

4. Dutch CoNLL Model

The Dutch model is trained on training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER, ORG, LOC and MISC.

The latest version is 200831.

The model dutch-conll-200831 reaches 91.17 F1-measure on the CoNLL-2002 test data.

4.1. Acknowledgements

The work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

The research was carried out by Jana Straková and Milan Straka.

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported and has been using language resources developed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071). It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

All models use UDPipe for tokenization.

4.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

5. Spanish CoNLL Model

The Spanish model is trained on training part of the CoNLL-2002 NER annotations (Tjong Kim Sang, 2002). The corpus uses four classes: PER, ORG, LOC and MISC.

The latest version is 200831.

The model spanish-conll-200831 reaches 88.55 F1-measure on the CoNLL-2002 test data.

5.1. Acknowledgements

The work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

The research was carried out by Jana Straková and Milan Straka.

The work described herein has been supported by OP VVV VI LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project CZ.02.1.01/0.0/0.0/16 013/0001781) and it has been supported and has been using language resources developed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071). It has further been supported by PROGRES Q18 of the Charles University and by PROGRES Q48 of the Charles University.

All models use UDPipe for tokenization.

5.1.1. Publications

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.