CzEng 1.0 Known Issues

We are aware of the following issues in CzEng 1.0 data:

Input Texts

  • OCR errors. Some texts come from optical character recognition, leading to occasional "typos" or even wrongly joined or split words.
  • Extra hyphens. For hard-wrapped texts, we have tried to remove re-join hyphenated words at line breaks. Unfortunately, some of these typesetting relicts remained in the data.
  • Page numbers and page headers. We invested significant effort in discovering and removing page numbers and even page headers that remained in the contiguous texts. We were not able to identify and discard all of them.

Tokenization

  • This should be tokenized in Czech but it is not: Users|%0
  • Negative numbers are tokenized but they should not: -30 => - 30

Lemmatization

  • Bad lemmas of Czech hyphenated words, e.g. fyzikálně-chemický.