Czech Named Entity Corpus 2.0
The Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities. It is a major update to the Czech Named Entity Corpus 1.0, a first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification. The corpus is available under the CC BY-NC-SA 3.0 license.
Classification
The named entities in Czech are classified according to an updated version of the two-level hierarchy of CNEC 1.0 described in Ševčíková et al., 2007.
Data Formats
Named entities are saved in formats:
- plain text – manual annotations in plain text format
- simple xml – simple xml format
- treex – xml format from Treex (formerly TectoMT) with morphologic analysis
- html – html with highlighted named entities
Downloads
Czech Named Entity Corpus 2.0 can be downloaded from LINDAT/CLARIN repository.
Changes
Named Entity Hierarchy
The changes in the named entity hierarchy compared to CNEC 1.0 are the following:
- overhaul the number entities
- entities of supertype c were merged into n; in order to accommodate bibliographic entities a new type nb “vol./page/chap./sec./fig. numbers” was added
- cs → oa
- cn → nb
- cb → nb
- cp → nb
- cr → n_ , or
- entities of supertype q were moved into n
- low frequent entities of supertype n were removed and some renamed and merged
- removed nm, nr, nw
- nc was renamed to ns
- np → no
- nq → n_
- some time entities were removed
- tc → no
- tp → no
- tn → nc
- ts → nc
- new entity me representing email was added
- gp entity was merged into g
- mr and mt were merged into new ms
- oc entity was merged into o
- pb entity was merged into p
New Data
New data was annotated and added:
- 125 sentences with many addresses and emails were added,
- 3000 sentences containing only a few named entities were added so that the resulting corpus better represents the density of named entities (density of named entities in CNEC 1.1 is too high).