Introduction

The Czech Named Entity Corpus 1.1 is a corpus of 5868 Czech sentences with manually annotated 33662 Czech named entities, classified according to a two-level hierarchy of 62 named entities. It is a minor update to the Czech Named Entity Corpus 1.0, a first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification. The corpus is available under the CC BY-NC-SA 3.0 license.

NE Hierarchy and Classes

CNEC 1.0 NE hierarchyThe named entities in Czech are classified according to a two-level hierarchy taken from Ševčíková et al., 2007. The hierarchy is the same as in CNEC 1.0.

Data Formats

Named entities are saved in formats:

  • plain text – manual annotations in plain text format
  • simple xml – simple xml format
  • treex – xml format from Treex (formerly TectoMT) with morphologic analysis
  • html – html with highlighted named entities

Downloads

Czech Named Entity Corpus 1.1 can be downloaded from LINDAT/CLARIN repository.

Evaluation

The Czech Named Entity Corpus 1.1 is evaluated using the canonical script distributed with the corpus. The evaluation metric is a strict (both span and type must be correct) span-based micro F1.

Changes

The difference between Czech Named Entity Corpus 1.1 and 1.0 are the following:

  • fixed some misannotated entities
  • make all formats contain the same data
    • provide the same tokenzation in all formats
    • add two sentences omitted in some format
    • fixed typos in entity names in plain text format
  • replaced tmt format by treex format
  • removed the original text format
  • split all formats into train, dtest and etest, not only treex