CzEng 1.7

(Czech-English Parallel Corpus, version 1.7)

CzEng 1.7 is a filtered version of the previous release CzEng1.6.

During 2017, it turned out that CzEng 1.6 contains a considerable number of sentence pairs where the Czech side was not Czech or the English side was not English. We identified such blocks of sentences using an automatic language identification. Instead of providing the whole data for download, we provide a Perl script, which converts CzEng1.6 to CzEng1.7 by filtering out the affected blocks. The script works as a pipe for the plaintext and export formats of CzEng. Note that the filtering does not affect sections 98dtest and 99etest.

After the filtering 4,177,894 sentence pairs are removed, leading to 57,065,358 sentence pairs in the training sections of CzEng 1.7.

For data download, citation reference, explanation of the file format and acknowledgement, see CzEng1.6.