Tags:

Czech Named Entity Corpus

The latest version of the Czech Named Entity Corpus (Czech Named Entity Corpus 2.0) is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities.

The corpus uses 46 atomic named entity types, which can be embedded, e.g., the river name can be part of a name of a city as in <gu Ústí nad <gh Labem>>. There are also 4 so-called NE containers: two or more NEs are parts of a NE container (e.g., two NEs, a first name and a surname, form together a person name NE container such as in <P <pf Jan><ps Novák>>). The 4 NE containers are marked with a capital one-letter tag: P for (complex) person names, T for temporal expressions, A for addresses, and C for bibliographic items.

Current version download: Czech Named Entity Corpus 2.0.

Detailed description of the corpus, file formats, two-level named entity hierarchy and download links are available for every released version:

Work Published using CNEC

CNEC Leaderboard

CNEC 1.0 and 2.0 results: Span-based micro F1 scores measured using the dataset's canonical evaluation script
CNEC 1.0 Types	CNEC 1.0 Supertypes	CNEC 2.0 Types	CNEC 2.0 Supertypes	CNEC 1.0 Extended	CNEC 2.0 Extended	System	Code	Method
–	–	86.39	89.29	–	–	NameTag 3 (Straková & Straka, 2025)	GitHub	Seq2seq+fine-tuned RobeCzech
–	–	–	–	–	86.39	Bachelor Thesis of Müller 2020, a rerun of Straková et al., 2019	GitHub	LSTM-CRF+BERT
86.88	89.91	~~86.23~~ 84.66	~~89.37~~ 88.02	–	–	Straka et al., 2019	–	Seq2seq+BERT
86.88	–	–	–	–	–	Straková et al., 2019	GitHub	Seq2seq+BERT
83.15	86.30	–	–	83.27	84.22	Jana Straková, Milan Straka, Jan Hajič, Martin Popel (2019): Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, ISSN 0037-7031, vol. 80, no. 4, pp. 306-327	–	Deep NN
–	–	–	–	–	81.05	Güngör, 2018	–	RNN+WE+CLE
81.20	84.68	79.23	82.78	80.88	80.79	Straková et al., 2016	GitHub	RNN+WE+CLE
–	–	–	–	74.08	–	Konkol et al., 2015	–	Latent semantics
–	–	–	–	75.61	–	Demir and Özgür, 2014	–	NN+WE
–	–	–	–	74.23	74.37	Konkol and Konopík, 2014	–	CRF+stemming
79.23	82.82	–	–	–	–	Straková et al., 2013	NameTag 1	Simple NN
–	79.00	–	–	74.08	–	Konkol and Konopík, 2013	–	CRF
–	72.94	–	–	–	–	Konkol and Konopík, 2011	–	Maximum entropy
68.00	71.00	–	–	–	–	Kravalová and Žabokrtský, 2009	–	SVM
62.00	68.00	–	–	–	–	Ševčíková et al., 2007	–	Dec. trees

Please let us know if you would like to be featured on this leaderboard. Thank you!

Tools

NameTag: Czech Named Entity Recognizer

Please let us know if you would like your tool to be added to the list.

Other

Straková Jana, Straka Milan, Ševčíková Magda, Žabokrtský Zdeněk: Czech Named Entity Corpus. In: Handbook of Linguistic Annotation, Copyright © Springer Netherlands, Netherlands, ISBN 978-94-024-0879-9, pp. 855-873, 1459 pp., 2017.

Please Cite this Corpus As:

Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named Entities in Czech: Annotating Data and Developing NE Tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007).

@inproceedings{SevcikovaEtAl2007CNEC,
booktitle = {Lecture Notes in Artificial Intelligence, Proceedings of the 10th International Conference on Text, Speech and Dialogue},
series = {Lecture Notes in Computer Science},
title = {Named Entities in Czech: Annotating Data and Developing {NE} Tagger},
editor = {V{\'{a}}clav Matou{\v{s}}ek and Pavel Mautner},
author = {Magda {\v{S}}ev{\v{c}}{\'{\i}}kov{\'{a}} and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}} and Old{\v{r}}ich Kr{\r{u}}za},
year = {2007},
publisher = {Springer},
address = {Berlin / Heidelberg},
volume = {4629},
number = {{XVII}},
pages = {188--195},
isbn = {978-3-540-74627-0},
issn = {0302-9743},
}

Acknowledgements:

SVV project number 267 314 (Teoretické základy informatiky a výpočetní lingvistiky)
LINDAT/CLARIN (Large infrastructural grant for language resources, data access and distribution and related reseearch), project LM2010013 of the Ministry of Education of the Czech Republic
GAČR 406/12/P175 project (Vybrané derivační vztahy pro automatické zpracování češtiny) of the Grant Agency of the Czech Republic
PRVOUK P46 project

Czech Named Entity Corpus

Search form

Czech Named Entity Corpus

Work Published using CNEC

CNEC Leaderboard

Tools

Other

Please Cite this Corpus As:

Acknowledgements:

Authors: