Ondřej Bojar, Zdeněk Žabokrtský
in cooperation with Miroslav Janíček, Václav Klimeš, Jana Kravalová, David Mareček, Václav Novák, Martin Popel and Jan Ptáček |
CzEng 0.9 is the third release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial and research purposes.
CzEng 0.9 contains 8.0 million parallel sentences (93 million English and 82 million Czech tokens) from seven different types of sources automatically annotated at surface and deep (a- and t-) layers of syntactic representation. The number of sentences and nodes of a given layer and language per data source is given in the following table:
English | Czech | ||||
---|---|---|---|---|---|
Source | Sentences | a-layer | t-layer | a-layer | t-layer |
Movie Subtitles | 3,549,367 | 26,550,305 | 16,615,991 | 22,175,284 | 16,675,187 |
EU Legislation | 1,589,036 | 31,725,089 | 19,458,544 | 28,484,512 | 19,310,396 |
Technical Documentation | 1,212,494 | 9,099,748 | 6,339,129 | 8,460,491 | 6,512,247 |
Fiction | 1,036,952 | 17,045,233 | 10,861,341 | 15,031,926 | 11,102,760 |
Parallel Web Pages | 464,522 | 4,946,552 | 3,666,149 | 4,750,757 | 3,667,297 |
News | 140,191 | 3,196,303 | 2,019,758 | 2,945,777 | 2,220,789 |
Project Navajo | 37,239 | 612,826 | 385,292 | 539,659 | 405,484 |
Total | 8,029,801 | 93,176,056 | 59,346,204 | 82,388,406 | 59,894,160 |
All further details about CzEng 0.9 are in the paper cited below.
If you make use of CzEng data, please make sure to cite CzEng properly. To improve the reproducibility of your results, please indicate which sections have you used for training and/or evaluation.
Ondřej Bojar and Zdeněk Žabokrtský. 2009. CzEng 0.9: Large Parallel Treebank with Rich Annotation. Prague Bulletin of Mathematical Linguistics, 92. PDF
@Article{czeng:pbml:2009, publicationtype = {article}, Author = {Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}}, title = "{CzEng0.9: Large Parallel Treebank with Rich Annotation}", Journal = {Prague Bulletin of Mathematical Linguistics}, Volume = {92}, ISSN = {0032-6585}, Publisher = {Charles University}, PubAddress = {Prague}, Year = {2009}, note = {in print} }
To download CzEng 0.9, you have to register by filling in the following form. Within a week we will send you a unique username to access the files.
After the registration, you received a unique username. The unique username and a shared password "czeng" will be requested at the following links.
To simplify the download, the 100 sections of CzEng are grouped to packs of 10 sections each. CzEng 0.9 is shuffled so you may wish to use just one of the packs for your experiments as a random sample.
File Format | Avg. Download Size | Training Sections | DevTest Sections | EvalTest Sections |
---|---|---|---|---|
TMT, rich XML | 2.0 GB each | 0* 1* 2* 3* 4* 5* 6* 7* | 8* | 9* |
Export Format, rich factored plaintext | 310 MB each | 0* 1* 2* 3* 4* 5* 6* 7* | 8* | 9* |
Plaintext, untokenized | 45 MB each | 0* 1* 2* 3* 4* 5* 6* 7* | 8* | 9* |
Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=czeng
to pass the authorization check. Use the flag
--continue
to continue an interrupted transfer.
Remark for WMT'10 participants: There will be no intersection between CzEng 0.9 EvalTest data and WMT'10 evaluation data. However, WMT10 participants are kindly asked not to use the CzEng 0.9 EvalTest sections (pack 9) for any training purposes, so that there remains some held-out data for evaluating future experiments, such as combining outputs of different MT systems. In any case, please indicate clearly how much data / which sections from CzEng 0.9 you have eventually used.