CzEng 1.0

(Czech-English Parallel Corpus, version 1.0)

Ondřej Bojar, Zdeněk Žabokrtský
in cooperation with
Ondřej Dušek, Petra Galuščáková, Martin Majliš, David Mareček,
Jiří Maršík, Michal Novák, Martin Popel, and Aleš Tamchyna

Introduction

CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes.

CzEng 1.0 contains 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep (a- and t-) layers of syntactic representation. The number of sentences and nodes of a given layer and language per data source is given in the following table:

Source Domain	Sentences	a-layer	t-layer	a-layer	t-layer
		Czech		English
Fiction	4,335,183	57,176,714	41,142,003	64,264,229	38,690,193
EU Legislation	3,992,551	78,022,137	56,446,093	87,488,870	52,717,871
Movie Subtitles	3,076,887	19,571,940	14,614,899	23,353,842	14,917,777
Parallel Web Pages	1,883,804	30,891,696	23,140,879	35,454,657	22,057,255
Technical Documentation	1,613,297	16,015,336	11,941,650	16,836,098	11,207,157
News	201,103	4,280,039	3,207,858	4,736,751	2,963,451
Project Navajo	33,301	484,453	363,317	556,702	343,649
Total	15,136,126	206,442,315	150,856,699	232,691,149	142,897,353

All further details about CzEng 1.0 will be available in a paper (in preparation). The core design is siminar to CzEng 0.9, see the paper cited below. The data formats were slightly updated since CzEng 0.9, see the summary below.

Citing CzEng 1.0

If you make use of CzEng 1.0 data, please cite the following paper:

Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. PDF

@inProceedings{czeng10:lrec2012,
 author = {
   Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}} and
   Ond{\v{r}}ej Du{\v{s}}ek and Petra Galu{\v{s}}{\v{c}}{\'{a}}kov{\'{a}} and
   Martin Majli{\v{s}} and David Mare{\v{c}}ek and
   Ji{\v{r}}{\'{\i}} Mar{\v{s}}{\'{\i}}k and Michal Nov{\'{a}}k and
   Martin Popel and Ale{\v{s}} Tamchyna
 },
 title = "{The Joy of Parallelism with CzEng 1.0}",
 booktitle = {Proceedings of LREC2012},
 organization = {ELRA},
 address = {Istanbul, Turkey},
 month = may,
 publisher = {European Language Resources Association},
 year = {2012},
 url = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/645_Paper.pdf}
}

URL: http://ufal.mff.cuni.cz/czeng/

To improve the reproducibility of your results, please indicate which sections have you used for training and/or evaluation.

Register

To download CzEng 1.0, you have to register by filling in the following form. Within a week we will send you a unique username to access the files.

Download

After the registration, you received a unique username. The unique username and a shared password "czeng" will be requested at the following links.

To simplify the download, the 100 sections of CzEng are grouped to packs of 10 sections each (the dev and test sections are separate). CzEng 1.0 is shuffled so you may wish to use just one of the packs for your experiments as a random sample.

File Format	Avg. Download Size	Training Sections	DevTest Section	EvalTest Section
Treex Format, rich XML	4.7 GB each	0* 1* 2* 3* 4* 5* 6* 7* 8* 9[0-7]	98 (~480 MB)	99 (~480 MB)
Export Format, rich factored plaintext	900 MB each	0* 1* 2* 3* 4* 5* 6* 7* 8* 9[0-7]	98 (88 MB)	99 (88 MB)
Plaintext, untokenized	115 MB each	0* 1* 2* 3* 4* 5* 6* 7* 8* 9[0-7]	98 (12 MB)	99 (12 MB)

Tip for Linux wget tool: Use the flags --user=YOUR-USERNAME --password=czeng to pass the authorization check. Use the flag --continue to continue an interrupted transfer.

Remark for WMT shared task participants (WMT12 and later): There is no intersection between CzEng 1.0 data and WMT dev and evaluation data. However, WMT shared task participants are kindly asked to use only the Training sections of CzEng and avoid DevTest and EvalTest sections (packs 98 and 99), so that there remain some held-out data for the evaluation of future experiments. In any case, please indicate clearly how much data and from which sections of CzEng 1.0 you have eventually used.

Brief Note on File Formats

CzEng 1.0 is shuffled at the level of "blocks", sequences of not more than 15 consecutive sentences from one source. The original documents thus cannot be reconstructed but some information for cross-sentence phenomena is preserved. Specifically, the Treex format of CzEng includes Czech grammatical and textual co-reference links that do span sentence boundaries.

Individual text "blocks" are combined to numbered files, each file holds about 200 sentence pairs.

Each "block" comes from one of the domains (EU Legislation, etc.) and the domain is indicated in the sentence ID.

Treex Format

The primary format of CzEng is Treex XML, a successor of the TMT format of TectoMT used in CzEng 0.9.

Treex XML can be processed using the Treex platform implemented in Perl and available on CPAN.

The best option for manual inspection of the data is the tree editor TrEd, which can read Treex XML using ttred wrapper provided in Treex.

Export Format

To facilitate access to most of the automatic rich annotation of CzEng 1.0 without XML hassle, we provide a simple "factored" line-oriented export format. Note that e.g. named entities or co-reference links are not available in export format at all.

Column	Sample	Explanation
1	subtitles-b2-00train-f00001-s8	ID specifying the domain, block number, train/dev/test section, file number and sentence within the file.
2	0.99261036	Filter score indicating the quality of the sentence pair. The score of 1 is perfect pair, pairs below 0.3 are removed.
Czech
3	Zachránil\|zachránit_:W\|VpYS---XR-AA---\|1\|0\|Pred mi\|já\|PH-S3--1-------\|2\|1\|Obj můj\|můj_^(přivlast.)\|PSYS1-S1-------\|3\|5\|Atr milovaný\|milovaný_^(*2t)\|AAIS1----1A----\|4\|5\|Atr krk\|krk\|NNIS1-----A----\|5\|1\|Obj .\|.\|Z:----...	Czech a-layer (surface-syntactic tree) in factored form: word-form\|lemma\|morphological-tag\|index-in-sentence\|index-of-governor\|syntactic-function.
4	zachránit\|PRED\|1\|0\|complex\|v:fin\|v\|-\|neg0\|ant\|ind\|decl\|-\|cpl\|-\|-\|disp0\|-\|it0\|-\|-\|res0\|-\|-\|1\|-\|- #PersPron\|ADDR\|2\|1\|complex\|n:3\|n.pron.def.pers\|sg\|-\|-\|-\|-\|-\|-\|-\|-\|-\|nr\|-\|1\|basic\|-\|-\|-\|-\|-\|- ...	Czech t-layer (tectogrammatical tree): t-lemma\|functor\|index-in-tree\|index-of-governor\|nodetype\|formeme\|semantic-part-of-speech\|... and many detailed t-layer attributes.
5	0-0 1-1 2-2 3-3 4-4	Correspondence between Czech a-layer and t-layer for content words. Indexed from 0.
6		Correspondence between Czech a-layer and t-layer for auxiliary words. Indexed from 0.
English
7	He\|he\|PRP\|1\|2\|Sb saved\|save\|VBD\|2\|0\|Pred my\|my\|PRP$\|3\|4\|Atr ever-lovin\|ever-lovin\|NN\|4\|6\|Atr '\|'\|''\|5\|6\|AuxG neck\|neck\|NN\|6\|2\|Obj .\|.\|.\|7\|0\|AuxK	English a-layer (surface-syntactic tree) in factored form: word-form\|lemma\|tag\|index-in-sentence\|index-of-governor\|syntactic-function.
8	#PersPron\|ACT\|1\|2\|complex\|n:subj\|n.pron.def.pers\|sg\|-\|-\|-\|-\|-\|-\|-\|-\|-\|inan\|-\|3\|-\|-\|-\|-\|0\|-\|- save\|PRED\|2\|0\|complex\|v:fin\|v\|-\|neg0\|ant\|ind\|decl\|-\|-\|-\|-\|disp0\|-\|it0\|-\|-\|res0\|-\|-\|1\|-\|- #PersPron\|APP\|3\|4\|complex\|n:poss\|n.pron...	English t-layer (tectogrammatical tree): t-lemma\|functor\|index-in-tree\|index-of-governor\|nodetype\|formeme\|semantic-part-of-speech\|... and many detailed t-layer attributes.
9	0-0 1-1 2-2 3-3 5-4	Correspondence between English a-layer and t-layer for content words. Indexed from 0.
10	4-4	Correspondence between English a-layer and t-layer for auxiliary words. Indexed from 0.
Cross-Language Alignments Between Surface Czech and English
Always indexed from 0, Czech-English.
11	0-1 1-2 2-2 3-3 4-5 5-6	GIZA++ alignments "there" for cs2en.
12	0-0 0-1 2-2 3-3 3-4 4-5 5-6	GIZA++ alignments "back" for cs2en.
13	0-0 0-1 1-2 2-2 3-3 3-4 4-5 5-6	GIZA++ alignments symmetrized using grow-diag-final-and for cs2en.
14	0-0 0-1 1-2 2-2 3-3 3-4 4-5 5-6	GIZA++ alignments symmetrized using grow-diag-final-and for en2cs (not the inverse of column 13).
Cross-Language Alignments Between T-Layer Czech and English
Always indexed from 0, Czech-English.
15	0-1 1-2 2-2 3-3 4-4	T-alignment "there" for cs2en.
16	0-0 0-1 2-2 3-3 4-4	T-alignment "back" for cs2en.
17		Additional rule-based t-alignment linking esp. generated nodes like #Perspron;.

Plaintext Format

The plaintext format is very simple, four tab-delimited columns containing:

Sentence pair ID
Filter score
Czech, not tokenized.
English, not tokenized.

Online access via PML-TQ

A sample of the first 151 212 sentences (section 00train) is indexed via the PML-TQ service and you can query it here. See the documentation for PML-TQ syntax. A very simple query a-node [lemma="go"] will find all occurences of the word go. Note the gray line with filename " f00086.treex.gz (96/213)" – by clicking on the arrows (and >) you can browse to the previous/following sentence.

Sample sentence

Sample sentence annotated on the a-layer and t-layer (with named entities marked in the n-tree).

en: Shaw cursed himself under his breath for not starting here.

cs: Shaw se v duchu na sebe zlobil, že nezačal právě tady.

Image of a sample tree

Acknowledgement

CzEng 1.0 release was funded by the grants EuroMatrix Plus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic), Faust (FP7-ICT-2009-4-247762 of the EU and 7E11041 of the Ministry of Education, Youth and Sports of the Czech Republic), GAČR P406/10/P259, GAUK 116310, and GAUK 4226/2011.

Name:
E-mail:
Institution:
Country:

CzEng

Czech-English parallel corpus

Search form