CzEng 1.0 Known Issues

We are aware of the following issues in CzEng 1.0 data:

OCR errors. Some texts come from optical character recognition, leading to occasional "typos" or even wrongly joined or split words.
Extra hyphens. For hard-wrapped texts, we have tried to remove re-join hyphenated words at line breaks. Unfortunately, some of these typesetting relicts remained in the data.
Page numbers and page headers. We invested significant effort in discovering and removing page numbers and even page headers that remained in the contiguous texts. We were not able to identify and discard all of them.

CzEng