CzEng 2.0 is the sixth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes. The main aim of the current release is to filter and enlarge the collection of parallel sentences.
Here we summarize which CzEng versions should be used in which shared tasks:
WMT16 Translation Task and IT Translation Task | a simplified pre-release, CzEng 1.6pre |
WMT17 Translation Task | CzEng 1.6 |
WMT18 Translation Task | a subset of CzEng 1.6 sentences, released under a new label: CzEng 1.7 |
WMT19 Translation Task | CzEng 1.7 |
WMT20 Translation Task | a release of CzEng 2.0 available from this page. |
CzEng 2.0 is composed from authentic and synthetic parallel data.
The authentic part contains filtered CzEng 1.6 [6] (train+dtest sections) and six additional resources: Europarl, Paracrawl, Common Crawl, News Commentary, Tilde MODEL, Wiki Titles, WikiMatrix, which we downloaded from WMT 2020.
Synthetic part contains Czech and English news crawl translated with CUNI-TRANSFORMER systems [3].
If you want a smaller and cleaner corpus, you may consider - further filtering (sentence level or document level) based on the provided scores. - removing noisier sources, e.g. Paracrawl and WikiMatrix (information about the source is encoded in the ID).
Each file contains the following six tab-separated columns. All three scores are within 0 and 1 and higher values mean better scores (cleaner sentence pairs). Documents are separated by empty lines. All the data are document-level deduplicated and shuffled.
For the synthetic data (csmono and enmono), we set adq_score to 1.0 for all sentences. For the authentic data (train and test), we computed adq_score using conditional cross-entropies (without word-normalization) predicted by the CUNI-TRANSFORMER [3] model: adq_score = exp -(|HA-HB| + (HA+HB)/2) HA = -log P(en|cs) HB = -log P(cs|en) After document-level deduplication, we deleted:
@article{kocmi2020announcing, title={Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords}, author={Tom Kocmi and Martin Popel and Ondrej Bojar}, year={2020}, journal={arXiv preprint arXiv:2007.03006}, }
To improve the reproducibility of your results, please indicate which sections have you used for training and/or evaluation.
To download CzEng 2.0, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.
After the registration, you will have received a unique username. The unique username and a shared password "czeng" will be requested at the following link:
Download | Description | Sentence pairs | Czech words | English words |
README | Readme file with instructions | - | - | - |
czeng20-train.gz [4.4G] | Authentic training set | 61M | 617M | 702M |
czeng20-test.gz [0.02G] | Authentic testing set | 0.5M | 4M | 5M |
czeng20-csmono.gz [4.4G] | Czech mono with English synthetic | 51M | 700M | 833M |
czeng20-enmono.gz [7.7G] | English mono with Czech synthetic | 76M | 1296M | 1474M |
Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=czeng
to pass the authorization check. Use the flag --continue
to continue an interrupted transfer.
Remark for WMT shared task participants (WMT16 and later): There is no intersection between CzEng 1.6 data and WMT dev and evaluation data. However, WMT shared task participants are kindly asked to use only the Training sections of CzEng and avoid Test section so that there remain some held-out data for the evaluation of future experiments. In any case, please indicate clearly how much data and from which sections of CzEng 2.0 you have eventually used.
[1] http://data.statmt.org/news-crawl/cs-doc/
[2] http://data.statmt.org/news-crawl/en-doc/
[3] Martin Popel. "CUNI Transformer Neural MT System for WMT18" (2018). https://www.aclweb.org/anthology/W18-6424/
[4] Marcin Junczys-Dowmunt. "Dual conditional cross-entropy filtering of noisy parallel corpora." (2018). https://www.aclweb.org/anthology/W18-6478/
[5] https://fasttext.cc/blog/2017/10/02/blog-post.html
[6] Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovický, Michal Novák, Martin Popel, Roman Sudarikov, Dušan Variš. "CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered." http://link.springer.com/chapter/10.1007/978-3-319-45510-5_27
We gratefully acknowledge support from:
CzEng 2.0 contains data from previous releases of CzEng that have been supported by: