CzEng 1.0 Known Issues
We are aware of the following issues in CzEng 1.0 data:
Input Texts
-
OCR errors. Some texts come from optical character recognition, leading
to occasional "typos" or even wrongly joined or split words.
-
Extra hyphens. For hard-wrapped texts, we have tried to remove re-join
hyphenated words at line breaks. Unfortunately, some of these typesetting
relicts remained in the data.
-
Page numbers and page headers. We invested significant effort in
discovering and removing page numbers and even page headers that remained in
the contiguous texts. We were not able to identify and discard all of them.
Tokenization
-
This should be tokenized in Czech but it is not:
Users|%0
-
Negative numbers are tokenized but they should not:
-30 => - 30
Lemmatization
-
Bad lemmas of Czech hyphenated words, e.g. fyzikálně-chemický.
Institute of Formal and Applied Linguistics (ÚFAL)
Ondřej Bojar, bojar <at> ufal.mff.cuni.cz
$Id: known-issues.html 1092 2011-12-14 13:40:09Z bojar $