Czech Named Entity Corpus

The latest version of the Czech Named Entity Corpus (Czech Named Entity Corpus 2.0) is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities.

The corpus uses 46 atomic named entity types, which can be embedded, e.g., the river name can be part of a name of a city as in <gu Ústí nad <gh Labem>>. There are also 4 so-called NE containers: two or more NEs are parts of a NE container (e.g., two NEs, a first name and a surname, form together a person name NE container such as in <P <pf Jan><ps Novák>>). The 4 NE containers are marked with a capital one-letter tag: P for (complex) person names, T for temporal expressions, A for addresses, and C for bibliographic items.

 

Current version download: Czech Named Entity Corpus 2.0.

 

Detailed description of the corpus, file formats, two-level named entity hierarchy and download links are available for every released version:

Work Published using CNEC

State-of-the-art Results

CNEC 1.0 and 2.0 Results, F1 measure
CNEC 1.0 Types CNEC 1.0 Supertypes CNEC 2.0 Types CNEC 2.0 Supertypes CNEC 1.0 Extended CNEC 2.0 Extended Publication Code Method
86.39 Bachelor Thesis of Müller 2020, a rerun of Straková et al., 2019 Straková et al., 2019 LSTM-CRF+BERT
86.88 89.91 86.23 84.66 89.37 88.02 Straka et al., 2019 Seq2seq+BERT
86.88 Straková et al., 2019 GitHub Seq2seq+BERT
83.15 86.30 83.27 84.22 Jana Straková, Milan Straka, Jan Hajič, Martin Popel (2019): Hluboké učení v automatické analýze českého textu. In: Slovo a slovesnost, ISSN 0037-7031, vol. 80, no. 4, pp. 306-327 Deep NN
81.05 Güngör, 2018 RNN+WE+CLE
81.20 84.68 79.23 82.78 80.88 80.79 Straková et al., 2016 GitHub RNN+WE+CLE
74.08 Konkol et al., 2015 Latent semantics
75.61 Demir and Özgür, 2014 NN+WE
74.23 74.37 Konkol and Konopík, 2014 CRF+stemming
79.23 82.82 Straková et al., 2013 NameTag Simple NN
79.00 74.08 Konkol and Konopík, 2013 CRF
72.94 Konkol and Konopík, 2011 Maximum entropy
68.00 71.00 Kravalová and Žabokrtský, 2009 SVM
62.00 68.00 Ševčíková et al., 2007 Dec. trees

Please let us know if you have a contribution to this table. Thanks!

Tools

Other

  • Straková Jana, Straka Milan, Ševčíková Magda, Žabokrtský Zdeněk: Czech Named Entity Corpus. In: Handbook of Linguistic Annotation, Copyright © Springer Netherlands, Netherlands, ISBN 978-94-024-0879-9, pp. 855-873, 1459 pp., 2017.
  • Ševčíková Magda, Žabokrtský Zdeněk, Krůza Oldřich: Zpracování pojmenovaných entit v českých textech. Technical report no. 2007/TR-2007-36, Copyright © ÚFAL MFF UK, 60 pp., 2007.

Please Cite this Corpus As:

Ševčíková, M., Žabokrtský, Z., Krůza, O.: Named Entities in Czech: Annotating Data and Developing NE Tagger. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 188–195. Springer, Heidelberg (2007).

@inproceedings{SevcikovaEtAl2007CNEC,
booktitle = {Lecture Notes in Artificial Intelligence, Proceedings of the 10th International Conference on Text, Speech and Dialogue},
series = {Lecture Notes in Computer Science},
title = {Named Entities in Czech: Annotating Data and Developing {NE} Tagger},
editor = {V{\'{a}}clav Matou{\v{s}}ek and Pavel Mautner},
author = {Magda {\v{S}}ev{\v{c}}{\'{\i}}kov{\'{a}} and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}} and Old{\v{r}}ich Kr{\r{u}}za},
year = {2007},
publisher = {Springer},
address = {Berlin / Heidelberg},
volume = {4629},
number = {{XVII}},
pages = {188--195},
isbn = {978-3-540-74627-0},
issn = {0302-9743},
}

Acknowledgements:

  • SVV project number 267 314 (Teoretické základy informatiky a výpočetní lingvistiky)
  • LINDAT/CLARIN (Large infrastructural grant for language resources, data access and distribution and related reseearch), project LM2010013 of the Ministry of Education of the Czech Republic
  • GAČR 406/12/P175 project (Vybrané derivační vztahy pro automatické zpracování češtiny) of the Grant Agency of the Czech Republic
  • PRVOUK P46 project

Authors: