The Czech Named Entity Corpus 1.1 is a corpus of 5868 Czech sentences with manually
annotated 33662 Czech named entities, classified according to a
two-level hierarchy of 62 named entities. It is a minor update to the Czech Named Entity Corpus 1.0, a first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification. The corpus is available under the CC BY-NC-SA 3.0 license.
The named entities in Czech are classified according to a two-level hierarchy taken from Ševčíková et al., 2007. The hierarchy is the same as in CNEC 1.0.
Named entities are saved in formats:
- plain text – manual annotations in plain text format
- simple xml – simple xml format
- treex – xml format from Treex (formerly TectoMT) with morphologic analysis
- html – html with highlighted named entities
Czech Named Entity Corpus 1.1 can be downloaded from LINDAT/CLARIN repository.
The difference between Czech Named Entity Corpus 1.1 and 1.0 are the following:
- fixed some misannotated entities
- make all formats contain the same data
- provide the same tokenzation in all formats
- add two sentences omitted in some format
- fixed typos in entity names in plain text format
- replaced tmt format by treex format
- removed the original text format
- split all formats into train, dtest and etest, not only treex