CzEng 1.6pre for WMT16

(Czech-English Parallel Corpus, version 1.6pre, a pre-release for WMT16)

Introduction

CzEng 1.6 will be the fifth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes.

For WMT16 Translation Task and IT Translation Task, we issue a simplified pre-release. The full release will be available this year but unfortunately too late to be used in WMT16 translation task. For the translation task, CzEng 1.6pre is the proper version to use.

CzEng 1.6pre contains about 51 million parallel sentences (almost 0.5G Czech words and 0.6G English words) from eight different types of sources. The number of sentences and words (wc -w without tokenization) per data source is given in the following table:

Section	Sents	wc-w Czech	wc-w English
EU Legislation	5,482,959	149,342,434	171,903,496
Fiction	6,251,728	66,792,969	76,713,014
Medical	587,317	8,525,889	9,005,151
Navajo	32,305	372,429	449,022
News	206,302	3,926,368	4,360,814
Parallel Web Pages	544,539	4,793,971	5,164,096
PDFs from Web	396,463	5,745,904	6,508,720
Subtitles	36,836,991	227,166,420	278,706,498
Technical Documentation	1,085,404	6,094,407	6,820,548
Tweets	576	8,331	9,153
Total	51,424,584	472,769,122	559,640,512

Feedback on CzEng 1.6pre

Since this is a pre-release, we will gladly receive any comments or reports of systematic problems in CzEng data. This is a big collection from varied sources, and we do quite complex per-source cleanup, but inevitably, even regular errors can leasily pass through the pipeline.

Known Issues

There are still some strange characters, but all should be valid UTF-8
There are occasional errors in alignment and non-parallel sentences. You may want to run your own filtering; please let us know what worked best for you.

Citing CzEng 1.6pre

If you make use of CzEng 1.6 data, please cite the following paper:

Ondřej Bojar, et al. 2016. CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered. In Text,
Speech and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings. Springer Verlag. In press.

@inProceedings{czeng16:2016,
  author  = "Ond\v{r}ej Bojar and Ond\v{r}ej Du\v{s}ek and Tom Kocmi and
    Jind\v{r}ich Libovick\'{y} and Michal Nov\'{a}k and Martin Popel and Roman
    Sudarikov and Du\v{s}an Vari\v{s}",
  year    = "2016",
  title = "{CzEng 1.6: Enlarged Czech-English Parallel Corpus
            with Processing Tools Dockered}",
  booktitle = {Text, Speech and Dialogue: 19th International Conference, {TSD}
              2016, Brno, Czech Republic, September 12-16, 2016, Proceedings},
  publisher = {Springer Verlag},
  venue = {Brno, Czech Republic},
  month = {September 12-16},
  year = {2016},
  note = {In press.}
}

URL: http://ufal.mff.cuni.cz/czeng/czeng16pre/

Register

To download CzEng 1.6pre, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.

Download

After the registration, you will have received a unique username. The unique username and a shared password "czeng" will be requested at the following link:

CzEng 1.6pre (3.1 GB)

If you are not interested in CzEng sections at all, you may prefer the file without section IDs and deduplicated at the level of individual sentences:

CzEng 1.6pre deduped ignoring sections (2.0 GB; sorted alphabetically, not shuffled!)

Brief Note on File Formats

CzEng 1.6 will follow the style of CzEng 1.0, with morphological, syntactic and deep-syntactic annotation, deduplicated at the level of documents and shuffled at the level of short sequences of consecutive sentences.

The pre-release for WMT16 is simpler: shuffled sentence pairs, deduplicated within each source domain. (In other words there can be up to 8 copies of the same sentence pair, each labelled a different domain.) Also, the pre-release was not subject to our sentence-level filtering, so more noise can be expected in the data.

Plaintext Format

The plaintext format of CzEng 1.6pre is very simple, three tab-delimited columns containing:

Sentence pair ID
Czech, not tokenized.
English, not tokenized.

The ID indicates the sentence number, unique in the whole corpus and the source domain.

Name:
E-mail:
Institution:
Country:

CzEng

Czech-English parallel corpus

Search form