CzEng 1.6pre for WMT16

(Czech-English Parallel Corpus, version 1.6pre, a pre-release for WMT16)

Introduction

CzEng 1.6 will be the fifth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes.

For WMT16 Translation Task and IT Translation Task, we issue a simplified pre-release. The full release will be available this year but unfortunately too late to be used in WMT16 translation task. For the translation task, CzEng 1.6pre is the proper version to use.

CzEng 1.6pre contains about 51 million parallel sentences (almost 0.5G Czech words and 0.6G English words) from eight different types of sources. The number of sentences and words (wc -w without tokenization) per data source is given in the following table:

Section Sents wc-w Czech wc-w English
EU Legislation 5,482,959 149,342,434 171,903,496
Fiction 6,251,728 66,792,969 76,713,014
Medical 587,317 8,525,889 9,005,151
Navajo 32,305 372,429 449,022
News 206,302 3,926,368 4,360,814
Parallel Web Pages 544,539 4,793,971 5,164,096
PDFs from Web 396,463 5,745,904 6,508,720
Subtitles 36,836,991 227,166,420 278,706,498
Technical Documentation 1,085,404 6,094,407 6,820,548
Tweets 576 8,331 9,153
Total 51,424,584 472,769,122 559,640,512

Feedback on CzEng 1.6pre

Since this is a pre-release, we will gladly receive any comments or reports of systematic problems in CzEng data. This is a big collection from varied sources, and we do quite complex per-source cleanup, but inevitably, even regular errors can leasily pass through the pipeline.

Known Issues

  • There are still some strange characters, but all should be valid UTF-8
  • There are occasional errors in alignment and non-parallel sentences. You may want to run your own filtering; please let us know what worked best for you.

Citing CzEng 1.6pre

If you make use of CzEng 1.6 data, please cite the following paper:

  • Ondřej Bojar, et al. 2016. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. In Text,
    Speech and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings. Springer Verlag. In press.

    @inProceedings{czeng16:2016,
      author  = "Ond\v{r}ej Bojar and Ond\v{r}ej Du\v{s}ek and Tom Kocmi and
        Jind\v{r}ich Libovick\'{y} and Michal Nov\'{a}k and Martin Popel and Roman
        Sudarikov and Du\v{s}an Vari\v{s}",
      year    = "2016",
      title = "{CzEng 1.6: Enlarged Czech-English Parallel Corpus
                with Processing Tools Dockered}",
      booktitle = {Text, Speech and Dialogue: 19th International Conference, {TSD}
                  2016, Brno, Czech Republic, September 12-16, 2016, Proceedings},
      publisher = {Springer Verlag},
      venue = {Brno, Czech Republic},
      month = {September 12-16},
      year = {2016},
      note = {In press.}
    }
    
  • URL: http://ufal.mff.cuni.cz/czeng/czeng16pre/

Register

To download CzEng 1.6pre, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.

Name:
E-mail:
Institution:
Country:

I certify that I will use CzEng 1.6pre only for research and non-commercial purposes.

Download

After the registration, you will have received a unique username. The unique username and a shared password "czeng" will be requested at the following link:

If you are not interested in CzEng sections at all, you may prefer the file without section IDs and deduplicated at the level of individual sentences:

 

Brief Note on File Formats

CzEng 1.6 will follow the style of CzEng 1.0, with morphological, syntactic and deep-syntactic annotation, deduplicated at the level of documents and shuffled at the level of short sequences of consecutive sentences.

The pre-release for WMT16 is simpler: shuffled sentence pairs, deduplicated within each source domain. (In other words there can be up to 8 copies of the same sentence pair, each labelled a different domain.) Also, the pre-release was not subject to our sentence-level filtering, so more noise can be expected in the data.

Plaintext Format

The plaintext format of CzEng 1.6pre is very simple, three tab-delimited columns containing:

  1. Sentence pair ID
  2. Czech, not tokenized.
  3. English, not tokenized.
The ID indicates the sentence number, unique in the whole corpus and the source domain.