CzEng 1.6

(Czech-English Parallel Corpus, version 1.6)

Update

Please note that better results can be obtained by using a filtered version of CzEng 1.6 which we released under a new label, CzEng 1.7.

Introduction

CzEng 1.6 is the fifth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes. The main aim of the current release is to update and enlarge the collection of sources and to provide CzEng users with all tools needed to replicate the automatic annotation on other data.

Here we summarize which CzEng versions should be used in which shared tasks:

WMT16 Translation Task and IT Translation Task a simplified pre-release, CzEng 1.6pre
WMT17 Translation Task CzEng 1.6, available from this page
WMT18 Translation Task a subset of CzEng 1.6 sentences, released under a new label: CzEng 1.7

Data

CzEng 1.6 primarily uses the same data sources as the previous versions. Most of the sources grow in time and some can be better exploited.

Pipeline

We release CzEng 1.6 with the complete monolingual analysis pipeline (for both Czech and English) wrapped as a Docker container. It aims at facilitating running the analysis pipeline provided by Treex platform. In fact, we release two containers: ufal/treex is a container with the latest version of Treex, and ufal/czeng16 contains Treex frozen on the revision that was used to process CzEng 1.6 data.

Citing CzEng 1.6

If you make use of CzEng 1.6 data, please cite the following paper:

  • Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovický, Michal Novák, Martin Popel, Roman Sudarikov, Dušan Variš:
    CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered.
    In: Lecture Notes in Computer Science, No. 9924 , Text, Speech, and Dialogue: 19th International Conference, TSD 2016 , Copyright © Springer International Publishing, Cham / Heidelberg / New York / Dordrecht / London, ISBN 978-3-319-45509-9, ISSN 0302-9743, pp. 231-238, 2016

    @inproceedings{czeng16:2016,
      title = "{CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered}",
      author = {Ond{\v{r}}ej Bojar and Ond{\v{r}}ej Du{\v{s}}ek and Tom Kocmi and
          Jind{\v{r}}ich Libovick{\'{y}} and Michal Nov{\'{a}}k and Martin Popel and
          Roman Sudarikov and Du{\v{s}}an Vari{\v{s}}},
      booktitle = "{Text, Speech, and Dialogue: 19th International Conference, {TSD} 2016}",
      series = {Lecture Notes in Computer Science},
      editor = {Petr Sojka and Ale{\v{s}} Hor{\'{a}}k and Ivan Kope{\v{c}}ek and Karel Pala},
      year = {2016},
      publisher = {Springer International Publishing},
      organization = {Masaryk University},
      address = {Cham / Heidelberg / New York / Dordrecht / London},
      venue = {Hotel Continental},
      series = {Lecture Notes in Artificial Intelligence},
      number = {9924},
      pages = {231--238},
      isbn = {978-3-319-45509-9},
      issn = {0302-9743},
    }
    
  • URL: http://ufal.mff.cuni.cz/czeng/

To improve the reproducibility of your results, please indicate which sections have you used for training and/or evaluation.

Register

To download CzEng 1.6, you have to register by filling in the following form. We will send you a unique username to access the files. If you do not hear from us within a week, fill the form again or contact us directly.

Name:
E-mail:
Institution:
Country:

I certify that I will use CzEng 1.6 only for research and non-commercial purposes.

Download

After the registration, you will have received a unique username. The unique username and a shared password "czeng" will be requested at the following link:

To simplify the download, the 100 sections of CzEng are grouped to packs of 10 sections each (the dev and test sections are separate). CzEng 1.6 is shuffled so you may wish to use just one of the packs for your experiments as a random sample.

File Format Avg. Download Size Training Sections DevTest Section EvalTest Section
Treex Format, rich XML 20 GB each 0* 1* 2* 3* 4* 5* 6* 7* 8* 9[0-7] 98 (2 GB) 99 (2 GB)
CONLL-U Format, surface syntax only, tentative 1.1 GB each 0* 1* 2* 3* 4* 5* 6* 7* 8* 9[0-7] 98 (105 MB) 99 (105 MB)
Export Format, rich factored plaintext 1.6 GB each 0* 1* 2* 3* 4* 5* 6* 7* 8* 9[0-7] 98 (161 MB) 99 (161 MB)
Plaintext, untokenized ~400 MB each 0* 1* 2* 3* 4* 5* 6* 7* 8* 9[0-7] 98 (40 MB) 99 (40 MB)

Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=czeng to pass the authorization check. Use the flag --continue to continue an interrupted transfer.

Remark for WMT shared task participants (WMT16 and later): There is no intersection between CzEng 1.6 data and WMT dev and evaluation data. However, WMT shared task participants are kindly asked to use only the Training sections of CzEng and avoid DevTest and EvalTest sections (packs 98 and 99), so that there remain some held-out data for the evaluation of future experiments. In any case, please indicate clearly how much data and from which sections of CzEng 1.6 you have eventually used.

 

Brief Note on File Formats

CzEng 1.6 is shuffled at the level of "blocks", sequences of not more than 15 consecutive sentences from one source. The original documents thus cannot be reconstructed but some information for cross-sentence phenomena is preserved. Specifically, the Treex format of CzEng includes Czech grammatical and textual co-reference links that do span sentence boundaries.

Individual text "blocks" are combined to numbered files, each file holds about 200 sentence pairs.

Each "block" comes from one of the domains (EU Legislation, etc.) and the domain is indicated in the sentence ID.

Treex Format

The primary format of CzEng 1.6 is Treex XML as used in CzEng 1.0.

Treex XML can be processed using the Treex platform implemented in Perl and available on CPAN.

The best option for manual inspection of the data is the tree editor TrEd, which can read Treex XML using ttred wrapper provided in Treex.

Export Format

To facilitate access to most of the automatic rich annotation of CzEng 1.6 without XML hassle, we provide a simple "factored" line-oriented export format. Note that e.g. named entities or co-reference links are not available in export format at all.

Column Sample Explanation
1 subtitles-b2-00train-f00001-s8 ID specifying the domain, block number, train/dev/test section, file number and sentence within the file.
2 0.99261036 Filter score indicating the quality of the sentence pair. The score of 1 is perfect pair, pairs below 0.3 are removed.
Czech
3 Zachránil|zachránit_:W|VpYS---XR-AA---|1|0|Pred mi|já|PH-S3--1-------|2|1|Obj můj|můj_^(přivlast.)|PSYS1-S1-------|3|5|Atr milovaný|milovaný_^(*2t)|AAIS1----1A----|4|5|Atr krk|krk|NNIS1-----A----|5|1|Obj .|.|Z:----... Czech a-layer (surface-syntactic tree) in factored form: word-form|lemma|morphological-tag|index-in-sentence|index-of-governor|syntactic-function.
4 zachránit|PRED|1|0|complex|v:fin|v|-|neg0|ant|ind|decl|-|cpl|-|-|disp0|-|it0|-|-|res0|-|-|1|-|- #PersPron|ADDR|2|1|complex|n:3|n.pron.def.pers|sg|-|-|-|-|-|-|-|-|-|nr|-|1|basic|-|-|-|-|-|- ... Czech t-layer (tectogrammatical tree): t-lemma|functor|index-in-tree|index-of-governor|nodetype|formeme|semantic-part-of-speech|... and many detailed t-layer attributes.
5 0-0 1-1 2-2 3-3 4-4 Correspondence between Czech a-layer and t-layer for content words. Indexed from 0.
6   Correspondence between Czech a-layer and t-layer for auxiliary words. Indexed from 0.
English
7 He|he|PRP|1|2|Sb saved|save|VBD|2|0|Pred my|my|PRP$|3|4|Atr ever-lovin|ever-lovin|NN|4|6|Atr '|'|''|5|6|AuxG neck|neck|NN|6|2|Obj .|.|.|7|0|AuxK English a-layer (surface-syntactic tree) in factored form: word-form|lemma|tag|index-in-sentence|index-of-governor|syntactic-function.
8 #PersPron|ACT|1|2|complex|n:subj|n.pron.def.pers|sg|-|-|-|-|-|-|-|-|-|inan|-|3|-|-|-|-|0|-|- save|PRED|2|0|complex|v:fin|v|-|neg0|ant|ind|decl|-|-|-|-|disp0|-|it0|-|-|res0|-|-|1|-|- #PersPron|APP|3|4|complex|n:poss|n.pron... English t-layer (tectogrammatical tree): t-lemma|functor|index-in-tree|index-of-governor|nodetype|formeme|semantic-part-of-speech|... and many detailed t-layer attributes.
9 0-0 1-1 2-2 3-3 5-4 Correspondence between English a-layer and t-layer for content words. Indexed from 0.
10 4-4 Correspondence between English a-layer and t-layer for auxiliary words. Indexed from 0.
Cross-Language Alignments Between Surface Czech and English
Always indexed from 0, Czech-English.
11 0-1 1-2 2-2 3-3 4-5 5-6 GIZA++ alignments "there" for cs2en.
12 0-0 0-1 2-2 3-3 3-4 4-5 5-6 GIZA++ alignments "back" for cs2en.
13 0-0 0-1 1-2 2-2 3-3 3-4 4-5 5-6 GIZA++ alignments symmetrized using grow-diag-final-and for cs2en.
14 0-0 0-1 1-2 2-2 3-3 3-4 4-5 5-6 GIZA++ alignments symmetrized using grow-diag-final-and for en2cs (not the inverse of column 13).
Cross-Language Alignments Between T-Layer Czech and English
Always indexed from 0, Czech-English.
15 0-1 1-2 2-2 3-3 4-4 T-alignment "there" for cs2en.
16 0-0 0-1 2-2 3-3 4-4 T-alignment "back" for cs2en.
17   Additional rule-based t-alignment linking esp. generated nodes like #Perspron;.

Plaintext Format

The plaintext format is very simple, four tab-delimited columns containing:

  1. Sentence pair ID
  2. Filter score
  3. Czech, not tokenized.
  4. English, not tokenized.

Acknowledgment

We gratefully acknowledge support from: