NameTag 3 Models

Model vs. Software Version Compatibility
Results at a Glance
Czech CNEC 2.0 Model
- 3.1. Acknowledgements
  - 3.1.1. Publications
Multilingual Model
- 4.1. Acknowledgements
  - 4.1.1. Publications
Multilingual CoNLL Model
- 5.1. Acknowledgements
  - 5.1.1. Publications

In natural language text, the task of (nested) named entity recognition (NER) is to identify proper names such as names of persons, organizations and locations.

As a supervised machine learning tool, NameTag needs a trained linguistic model. This section describes the trained models available for NameTag 3.

All models are available under the CC BY-NC-SA licence and can be downloaded from the LINDAT repository.

The models are versioned according to the date when released, the version format is YYMMDD, where YY, MM and DD are two-digit representation of year, month and day, respectively.

The latest version is 240830 for the Czech CNEC 2.0 model, and 260521 for the Multilingual model.

1. Model vs. Software Version Compatibility

		3.0	3.1	3.2
Czech CNEC 2.0	`nametag3-czech-cnec2.0-240830`	✔	✔	✔
Multilingual	`nametag-multilingual-260521`, `nametag3-multilingual-250203`	✘	✔	✔
Multilingual CoNLL	`nametag3-multilingual-conll-240830`	✔	✔	✔

2. Results at a Glance

Model	Multi	Multi	Czech CNEC 2.0	NameTag 3
Version	`260521`	`250203`	`240830`	All
Languages trained	20	17	1	20
Languages evaluated	27	20	1	27
Languages SOTA	23	15	1	23
Datasets trained	25	21	1	26
Datasets evaluated	39	28	1	40
Datasets SOTA	31	20	1	33

3. Czech CNEC 2.0 Model

The Czech CNEC 2.0 model is trained on the training part of the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007).

The corpus uses 46 atomic named entity types, which can be embedded, e.g., the river name Labe can be part of a name of a city as in <gu Ústí nad <gh Labem>>. In parallel, the corpus is also annotated with 7 coarser, one-character supertypes, also potentially nested. Furthermore, there are also 4 so-called NE (named entity) containers: two or more NEs are parts of a NE container (e.g., two NEs, a first name and a surname, form together a person name NE container such as in <P <pf Jan><ps Novák>>). The 4 NE containers are marked with a capital one-letter tag: P for (complex) person names, T for temporal expressions, A for addresses, and C for bibliographic items.

The latest version is nametag3-czech-cnec2.0-240830, distributed by LINDAT.

The model nametag3-czech-cnec2.0-240830 reaches 86.39 F1-measure for the fine-grained, two-character types and 89.29 for the coarse, one-character supertypes on the CNEC2.0 test data.

3.1. Acknowledgements

This work has been supported by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).

Czech CNEC 2.0 model is trained on Czech Named Entity Corpus 2.0, which was created by Magda Ševčíková, Zdeněk Žabokrtský, Jana Straková and Milan Straka.

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

3.1.1. Publications

Jana Straková and Milan Straka. 2025. NameTag 3: A Tool and a Service for Multilingual/Multitagset NER . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–39, Vienna, Austria. Association for Computational Linguistics.

Straková Jana, Straka Milan, Hajič Jan: Neural Architectures for Nested NER through Linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-48-2, pp. 5326-5331, 2019.

Straka Milan, Straková Jana, Hajič Jan: Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER. In: Lecture Notes in Computer Science, Vol. 11697, Proceedings of the 22nd International Conference on Text, Speech and Dialogue - TSD 2019, Copyright © Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-030-27946-2, ISSN 0302-9743, pp. 137-150, 2019.

Straková Jana, Straka Milan, Hajič Jan, Popel Martin: Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, Vol. 80, No. 4, Copyright © Ústav pro jazyk český AV ČR, Prague, Czech Republic, ISSN 0037-7031, pp. 306-327, Dec 2019.

4. Multilingual Model

NameTag 3 multilingual models are single models trained on multiple datasets in multiple languages.

The multilingual model has the following versions:

The latest version is nametag3-multilingual-260521, distributed via LINDAT. It was trained on 25 datasets in 20 languages and achieves state-of-the-art results on 31 datasets in 23 languages, as of May 2026.
The previous version is nametag3-multilingual-250203, distributed via LINDAT. It was trained on 21 datasets across 17 languages and achieves state-of-the-art results on 20 evaluation datasets in 15 languages, as of February 2025.

NameTag 3 multilingual models recognize the following tagsets:

conll (default): The CoNLL-2003 shared-task tagset: PER, ORG, LOC, and MISC. Use --tagsets=conll with nametag3.py, or request nametag3-multilingual-conll-260521 from the NameTag 3 web service
uner: The Universal NER tagset: PER, ORG, and LOC. Use --tagsets=uner with nametag3.py, or request nametag3-multilingual-uner-260521 from the NameTag 3 web service.
onto: The OntoNotes v5 tagset: PERSON, NORP, FAC, ORG, GPE, and others. Use --tagsets=onto with nametag3.py, or request nametag3-multilingual-onto-260521 from the [NameTag 3 web service https://lindat.mff.cuni.cz/services/nametag/``.

Multilingual, multitagset models such as nametag3-multilingual-250203 and nametag3-multilingual-260521 require at least NameTag 3.1.

Corpus	tagset	`260521`	`250203`	Data
Arabic CoNLL-2012 OntoNotes v5	onto	74.42	74.20	[1]
Cebuano UNER GJA	uner	95.92*	96.97*	[2]
Chinese CoNLL-2012 OntoNotes v5	onto	81.47	81.63	[1]
Chinese UNER GSD	uner	89.50	91.53	[2]
Chinese UNER GSDSIMP	uner	89.83	90.99	[3]
Chinese UNER PUD	uner	88.86^	89.35^	[3]
Croatian UNER SET	uner	95.56	95.55	[2]
Czech CNEC 2.0 CoNLL (4 labels, flat)	conll	85.21	86.24	[4]
Czech UNER2 PUD	uner	84.46^	-	[3]
Danish UNER DDT	uner	89.32	89.75	[2]
Dutch CoNLL-2002	conll	94.03	94.93	[5]
English CoNLL-2003	conll	94.10	94.09	[6]
English CoNLL-2012 OntoNotes v5	onto	90.13	90.19	[1]
English UNER EWT	uner	88.22	87.03	[2]
English UNER PUD	uner	83.76^	-	[2]
German CoNLL-2003	uner	87.72	87.48	[6]
German UNER PUD	uner	83.68^	-	[2]
Greek UNER2 GDT	uner	100.00	-	[3]
Hebrew UNER2 HTB	uner	83.43	-	[3]
Indonesian UNER2 PUD	uner	76.59*	-	[3]
Japanese UNER2 PUD	uner	81.91*	-	[3]
Korean UNER2 PUD	uner	73.21*	-	[3]
Maghrebi UNER Arabizi	uner	85.33	84.49	[2]
Norw. Bokmål UNER2 NDT	uner	95.59	95.83	[3]
Norw. Nynorsk UNER2 NDT	uner	95.04	94.51	[3]
Portuguese UNER Bosque	uner	91.53	90.89	[2]
Portuguese UNER PUD	uner	92.17^	91.77^	[2]
Romanian UENR2 LegalNERo	uner	68.43*	-	[3]
Russian UNER PUD	uner	75.88*	75.51	[2]
Serbian UNER SET	uner	97.27	97.10	[2]
Slovak UNER SNK	uner	88.36	88.46	[2]
Slovenian UNER2 SSJ	uner	93.15	-	[3]
Spanish CoNLL-2002	conll	90.29	90.29	[5]
Swedish UNER2 Lines	uner	91.15	-	[3]
Swedish UNER PUD	uner	89.74^	91.27^	[2]
Swedish UNER Talbanken	uner	92.03	91.79	[2]
Tagalog UNER TRG	uner	97.78*	97.78*	[2]
Tagalog UNER Ugnayan	uner	83.08*	75.00*	[2]
Ukrainian Lang-uk	conll	92.18	92.88	[7]

Legend:

* Cross-lingual transfer: The model was evaluated on data in a language not represented in the training data.
^ Out-of-domain evaluation: The model was evaluated on a different domain or dataset, while the language was represented in the training data.

Data:

OntoNotes v5 with the CoNLL-2012 train/dev/test split
Universal NER 1.0
Universal NER 2.0
In order to train and serve the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) jointly within a large multilingual model, the original annotation of the CNEC 2.0 has been harmonized to the standard 4-label tagset with PER, ORG, LOC, and MISC, resulting in an extensive simplification of the original annotation and flattening of the original nested entities. The script for the automated conversion to the 4-label CoNLL-2003 tagset can be found in the NameTag 3 GitHub repository. If you are interested in the original CNEC 2.0 model with the complete 46 labels and nested entities, see the Czech CNEC 2.0 model.
CoNLL-2002 NE annotations (Tjong Kim Sang, 2002) of part of Reuters Corpus
CoNLL-2003 NE annotations (Sang and De Meulder, 2003) of part of Reuters Corpus
The Ukrainian language is trained on the Ukrainian Lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes PER, ORG, LOC, and MISC (please note that we harmonized the original PERS to the common PER). The corpus was split randomly into train/dev/test in ratio 8:1:1.

4.1. Acknowledgements

This work has been supported by the MŠMT OP JAK program, project No. CZ.02.01.01/00/22_008/0004605 and by the Grant Agency of the Czech Republic under the EXPRO program as project “LUSyD” (project No. GX20-16819X). The work described herein has also been using data provided by the LINDAT/CLARIAH-CZ Research Infrastructure, supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062).

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

4.1.1. Publications

5. Multilingual CoNLL Model

The multilingual model uses four classes: PER, ORG, LOC and MISC.

The latest version is nametag3-multilingual-conll-240830, distributed by LINDAT.

Corpus	`240830`	Note
Czech CNEC 2.0 CoNLL (4 labels, flat)	86.35	[1]
Dutch CoNLL-2002	94.42	[2]
English CoNLL-2003	93.85	[3]
German CoNLL-2003	87.07	[3]
Spanish CoNLL-2002	89.90	[3]
Ukrainian Lang-uk	91.73	[4]

In order to train and serve the Czech Named Entity Corpus 2.0 (Ševčíková et al., 2007) jointly within a large multilingual model, the original annotation of the CNEC 2.0 has been harmonized to the standard 4-label tagset with PER, ORG, LOC, and MISC, resulting in an extensive simplification of the original annotation and flattening of the original nested entities. The script for the automated conversion to the 4-label CoNLL-2003 tagset can be found in the NameTag 3 GitHub repository. If you are interested in the original CNEC 2.0 model with the complete 46 labels and nested entities, see the Czech CNEC 2.0 model.
CoNLL-2003 NE annotations (Sang and De Meulder, 2003) of part of Reuters Corpus
CoNLL-2002 NE annotations (Tjong Kim Sang, 2002) of part of Reuters Corpus
The Ukrainian language is trained on the Ukrainian Lang-uk NER corpus based on the Lang-uk initiative. The corpus uses four classes PER, ORG, LOC, and MISC (please note that we harmonized the original PERS to the common PER). The corpus was split randomly into train/dev/test in ratio 8:1:1.

5.1. Acknowledgements

The research was carried out by Jana Straková and Milan Straka.

All models use UDPipe for tokenization.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

NameTag 3 Models

1. Model vs. Software Version Compatibility

2. Results at a Glance

3. Czech CNEC 2.0 Model

3.1. Acknowledgements

3.1.1. Publications

4. Multilingual Model

4.1. Acknowledgements

4.1.1. Publications

5. Multilingual CoNLL Model

5.1. Acknowledgements

5.1.1. Publications