NameTag 3 Models
In natural language text, the task of (nested) named entity recognition (NER) is to identify proper names such as names of persons, organizations and locations.
As a supervised machine learning tool, NameTag needs a trained linguistic model. This section describes the trained models available for NameTag 3.
All models are available under the CC BY-NC-SA licence and can be downloaded from the LINDAT repository.
The models are versioned according to the date when released, the version
format is YYMMDD, where YY, MM and DD are two-digit
representation of year, month and day, respectively.
The latest version is 240830 for the Czech CNEC 2.0 model, and 250203
for the Multilingual
model.
Coming soon: A new multilingual model nametag3-multilingual-260521.
1. Model vs. Software Version Compatibility
| 3.0 | 3.1 | |||
|---|---|---|---|---|
| Czech CNEC 2.0 | nametag3-czech-cnec2.0-240830 |
✔ | ✔ | |
| Multilingual | nametag-multilingual-260521, nametag3-multilingual-250203 |
✘ | ✔ | |
| Multilingual CoNLL | nametag3-multilingual-conll-240830 |
✔ | ✔ | |
2. Results at a Glance
3. Czech CNEC 2.0 Model
The Czech CNEC 2.0 model is trained on the training part of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007).
The corpus uses 46 atomic named entity types, which can be embedded,
e.g., the river name Labe can be part of a name of a city as in <gu Ústí nad <gh Labem>>.
In parallel, the corpus is also annotated with 7 coarser, one-character supertypes, also potentially nested. Furthermore, there are also 4 so-called NE (named entity) containers: two or more NEs are
parts of a NE container (e.g., two NEs, a first name and a surname, form
together a person name NE container such as in <P <pf Jan><ps Novák>>).
The 4 NE containers are marked with a capital one-letter tag: P for
(complex) person names, T for temporal expressions, A for addresses,
and C for bibliographic items.
The latest version is nametag3-czech-cnec2.0-240830, distributed by LINDAT.
The model nametag3-czech-cnec2.0-240830 reaches 86.39 F1-measure for the fine-grained,
two-character types and 89.29 for
the coarse, one-character supertypes on the
CNEC2.0 test data.
3.1. Acknowledgements
This work has been supported by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).
Czech CNEC 2.0 model is trained on Czech Named Entity Corpus 2.0, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka.
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
3.1.1. Publications
Jana Straková and Milan Straka. 2025. NameTag 3: A Tool and a Service for Multilingual/Multitagset NER . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–39, Vienna, Austria. Association for Computational Linguistics.
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
Straka Milan, Straková Jana, Hajič Jan: Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Lecture Notes in Computer Science, Vol. 11697, Proceedings of the 22nd International Conference on Text, Speech and Dialogue - TSD 2019, Copyright © Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-030-27946-2, ISSN 0302-9743, pp. 137-150, 2019.
Straková Jana, Straka Milan, Hajič Jan, Popel Martin: Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, Vol. 80, No. 4, Copyright © Ústav pro jazyk český AV ČR, Prague, Czech Republic, ISSN 0037-7031, pp. 306-327, Dec 2019.
4. Multilingual Model
NameTag 3 can be trained with multiple named entity tagsets. The trained model can then be required to recognize the named entities using a specific tagset during inference, or a default tagset will be used if none was requested.
The latest version is nametag3-multilingual-250203, and is distributed by
LINDAT. This model was trained on 17
languages of 21 datasets, and it can be used to recognize the following
tagsets:
conll(default): The CoNLL-2003 shared task tagset:PER,ORG,LOC, andMISC. Used when callingnametag3.pyprediction with--tagsets=conllor by requestingnametag3-multilingual-conll-250203from the NameTag 3 webservice.uner: The Universal NER v1 tagset:PER,ORG,LOC. Used when callingnametag3.pywith--tagsets=uneror by requestingnametag3-multilingual-uner-250203from the NameTag 3 webservice.onto: The OntoNotes v5 tagset:PERSON,NORP,FAC,ORG,GPE, etc. Used when callingnametag3.pywith--tagsets=ontoor by requestingnametag3-multilingual-onto-250203from the NameTag 3 webservice.
This model requires at least NameTag 3.1.
Coming soon: A new multilingual model nametag3-multilingual-260521.
| Corpus | tagset | 260521 |
250203 |
Note | |
|---|---|---|---|---|---|
| Arabic CoNLL-2012 OntoNotes v5 | onto | 74.42 | 74.20 | [1] | |
| Cebuano UNER GJA (cross-lingual transfer) | uner | 95.92 | 96.97 | [2] | |
| Chinese CoNLL-2012 OntoNotes v5 | onto | 81.47 | 81.63 | [1] | |
| Chinese UNER GSD | uner | 89.50 | 91.53 | [2] | |
| Chinese UNER GSDSIMP | uner | 89.83 | 90.99 | [3] | |
| Chinese UNER PUD (out-of-domain evaluation) | uner | 88.86 | 89.35 | [3] | |
| Croatian UNER SET | uner | 95.56 | 95.55 | [2] | |
| Czech CNEC 2.0 CoNLL (4 labels, flat) | conll | 85.21 | 86.24 | [4] | |
| Czech UNER2 PUD (out-of-domain evaluation) | uner | 84.46 | - | [3] | |
| Danish UNER DDT | uner | 89.32 | 89.75 | [2] | |
| Dutch CoNLL-2002 | conll | 94.03 | 94.93 | [5] | |
| English CoNLL-2003 | conll | 94.10 | 94.09 | [6] | |
| English CoNLL-2012 OntoNotes v5 | onto | 90.13 | 90.19 | [1] | |
| English UNER EWT | uner | 88.22 | 87.03 | [2] | |
| English UNER PUD (out-of-domain evaluation) | uner | 83.76 | - | [2] | |
| German CoNLL-2003 | uner | 87.72 | 87.48 | [6] | |
| German UNER PUD (out-of-domain evaluation) | uner | 83.68 | - | [2] | |
| Greek UNER2 GDT | uner | 100.00 | - | [3] | |
| Hebrew UNER2 HTB | uner | 83.43 | - | [3] | |
| Indonesian UNER2 PUD (cross-lingual transfer) | uner | 76.59 | - | [3] | |
| Japanese UNER2 PUD (cross-lingual transfer) | uner | 81.91 | - | [3] | |
| Korean UNER2 PUD (cross-lingual transfer) | uner | 73.21 | - | [3] | |
| Maghrebi UNER Arabizi | uner | 85.33 | 84.49 | [2] | |
| Norw. Bokmål UNER2 NDT | uner | 95.59 | 95.83 | [3] | |
| Norw. Nynorsk UNER2 NDT | uner | 95.04 | 94.51 | [3] | |
| Portuguese UNER Bosque | uner | 91.53 | 90.89 | [2] | |
| Portuguese UNER PUD (out-of-domain evaluation) | uner | 92.17 | 91.77 | [2] | |
| Romanian UENR2 LegalNERo (cross-lingual transfer) | uner | 68.43 | - | [3] | |
| Russian UNER PUD (cross-lingual transfer) | uner | 75.88 | 75.51 | [2] | |
| Serbian UNER SET | uner | 97.27 | 97.10 | [2] | |
| Slovak UNER SNK | uner | 88.36 | 88.46 | [2] | |
| Slovenian UNER2 SSJ | uner | 93.15 | - | [3] | |
| Spanish CoNLL-2002 | conll | 90.29 | 90.29 | [5] | |
| Swedish UNER2 Lines | uner | 91.15 | - | [3] | |
| Swedish UNER PUD (out-of-domain evaluation) | uner | 89.74 | 91.27 | [2] | |
| Swedish UNER Talbanken | uner | 92.03 | 91.79 | [2] | |
| Tagalog UNER TRG (cross-lingual transfer) | uner | 97.78 | 97.78 | [2] | |
| Tagalog UNER Ugnayan (cross-lingual transfer) | uner | 83.08 | 75.00 | [2] | |
| Ukrainian Lang-uk | conll | 92.18 | 92.88 | [7] | |
- OntoNotes v5 with the CoNLL-2012 train/dev/test split
- Universal NER 1.0
- Universal NER 2.0
- In order to train and serve the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) jointly within a large multilingual model, the original annotation of the CNEC 2.0 has been harmonized to the standard 4-label tagset with
PER,ORG,LOC, andMISC, resulting in an extensive simplification of the original annotation and flattening of the original nested entities. The script for the automated conversion to the 4-label CoNLL-2003 tagset can be found in the NameTag 3 GitHub repository. If you are interested in the original CNEC 2.0 model with the complete 46 labels and nested entities, see the Czech CNEC 2.0 model. - CoNLL-2002 NE annotations (Tjong Kim Sang, 2002) of part of Reuters Corpus
- CoNLL-2003 NE annotations (Sang and De Meulder, 2003) of part of Reuters Corpus
- The Ukrainian language is trained on the Ukrainian Lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes
PER,ORG,LOC, andMISC(please note that we harmonized the originalPERSto the commonPER). The corpus was split randomly into train/dev/test in ratio 8:1:1.
4.1. Acknowledgements
This work has been supported by the MŠMT OP JAK program, project No. CZ.02.01.01/00/22_008/0004605 and by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
4.1.1. Publications
Jana Straková and Milan Straka. 2025. NameTag 3: A Tool and a Service for Multilingual/Multitagset NER . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–39, Vienna, Austria. Association for Computational Linguistics.
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.
5. Multilingual CoNLL Model
The multilingual model uses four classes: PER, ORG, LOC and MISC.
The latest version is nametag3-multilingual-conll-240830, distributed by LINDAT.
| Corpus | 240830 |
Note | |
|---|---|---|---|
| Czech CNEC 2.0 CoNLL (4 labels, flat) | 86.35 | [1] | |
| Dutch CoNLL-2002 | 94.42 | [2] | |
| English CoNLL-2003 | 93.85 | [3] | |
| German CoNLL-2003 | 87.07 | [3] | |
| Spanish CoNLL-2002 | 89.90 | [3] | |
| Ukrainian Lang-uk | 91.73 | [4] | |
- In order to train and serve the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) jointly within a large multilingual model, the original annotation of the CNEC 2.0 has been harmonized to the standard 4-label tagset with
PER,ORG,LOC, andMISC, resulting in an extensive simplification of the original annotation and flattening of the original nested entities. The script for the automated conversion to the 4-label CoNLL-2003 tagset can be found in the NameTag 3 GitHub repository. If you are interested in the original CNEC 2.0 model with the complete 46 labels and nested entities, see the Czech CNEC 2.0 model. - CoNLL-2003 NE annotations (Sang and De Meulder, 2003) of part of Reuters Corpus
- CoNLL-2002 NE annotations (Tjong Kim Sang, 2002) of part of Reuters Corpus
- The Ukrainian language is trained on the Ukrainian Lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes
PER,ORG,LOC, andMISC(please note that we harmonized the originalPERSto the commonPER). The corpus was split randomly into train/dev/test in ratio 8:1:1.
5.1. Acknowledgements
This work has been supported by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).
The research was carried out by Jana Straková and Milan Straka.
All models use UDPipe for tokenization.
5.1.1. Publications
Jana Straková and Milan Straka. 2025. NameTag 3: A Tool and a Service for Multilingual/Multitagset NER . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–39, Vienna, Austria. Association for Computational Linguistics.
Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.


