CzEng 1.6 will be the fifth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes.
For WMT16 Translation Task and IT Translation Task, we issue a simplified pre-release. The full release will be available this year but unfortunately too late to be used in WMT16 translation task. For the translation task, CzEng 1.6pre is the proper version to use.
CzEng 1.6pre contains about 51 million parallel sentences (almost 0.5G Czech words and 0.6G English words) from eight different types of sources. The number of sentences and words (wc -w
without tokenization) per data source is given in the following table:
Section | Sents | wc-w Czech | wc-w English |
---|---|---|---|
EU Legislation | 5,482,959 | 149,342,434 | 171,903,496 |
Fiction | 6,251,728 | 66,792,969 | 76,713,014 |
Medical | 587,317 | 8,525,889 | 9,005,151 |
Navajo | 32,305 | 372,429 | 449,022 |
News | 206,302 | 3,926,368 | 4,360,814 |
Parallel Web Pages | 544,539 | 4,793,971 | 5,164,096 |
PDFs from Web | 396,463 | 5,745,904 | 6,508,720 |
Subtitles | 36,836,991 | 227,166,420 | 278,706,498 |
Technical Documentation | 1,085,404 | 6,094,407 | 6,820,548 |
Tweets | 576 | 8,331 | 9,153 |
Total | 51,424,584 | 472,769,122 | 559,640,512 |
Since this is a pre-release, we will gladly receive any comments or reports of systematic problems in CzEng data. This is a big collection from varied sources, and we do quite complex per-source cleanup, but inevitably, even regular errors can leasily pass through the pipeline.
If you make use of CzEng 1.6 data, please cite the following paper:
Ondřej Bojar, et al. 2016. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. In Text,
Speech and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings. Springer Verlag. In press.
@inProceedings{czeng16:2016, author = "Ond\v{r}ej Bojar and Ond\v{r}ej Du\v{s}ek and Tom Kocmi and Jind\v{r}ich Libovick\'{y} and Michal Nov\'{a}k and Martin Popel and Roman Sudarikov and Du\v{s}an Vari\v{s}", year = "2016", title = "{CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered}", booktitle = {Text, Speech and Dialogue: 19th International Conference, {TSD} 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings}, publisher = {Springer Verlag}, venue = {Brno, Czech Republic}, month = {September 12-16}, year = {2016}, note = {In press.} }
To download CzEng 1.6pre, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.
After the registration, you will have received a unique username. The unique username and a shared password "czeng" will be requested at the following link:
If you are not interested in CzEng sections at all, you may prefer the file without section IDs and deduplicated at the level of individual sentences:
CzEng 1.6 will follow the style of CzEng 1.0, with morphological, syntactic and deep-syntactic annotation, deduplicated at the level of documents and shuffled at the level of short sequences of consecutive sentences.
The pre-release for WMT16 is simpler: shuffled sentence pairs, deduplicated within each source domain. (In other words there can be up to 8 copies of the same sentence pair, each labelled a different domain.) Also, the pre-release was not subject to our sentence-level filtering, so more noise can be expected in the data.
The plaintext format of CzEng 1.6pre is very simple, three tab-delimited columns containing: