Table of Contents
CzEng 0.5 (http://ufal.mff.cuni.cz/czeng/) is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics, Charles University, Prague in 2005-2006. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.5 is available free of charge for educational and research purposes, however, the users should become acquainted with the license agreement (http://ufal.mff.cuni.cz/czeng/license.html).
CzEng 0.5 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment. The corpus contains altogether 7,743 document pairs. Full details on the corpus size are given in the table below.
Full CzEng 0.5 data are available in the following package:
We have used texts from the following publicly available sources:
Acquis Communautaire Parallel Corpus (prefix celex) available at http://wt.jrc.it/lt/Acquis/. It contains a huge body of EU legislative texts written between 1950s and 2005 (CzEng uses only two out of 20 languages covered by Acquis Communautaire Corpus).
Corpus OPUS available at http://logos.uio.no/opus/. It is an open source collection of freely available corpora; two of them are used in CzEng:
EU constitution proposal (prefix euconst)
KDE documentation (prefix kde)
Samples from the Official Journal of the European Union (prefix eujournal) available at http://europa.eu.int/eur-lex/lex/JOIndex.do?ihmlang=en. This is a tiny collection of some rather randomly chosen issues of the the Official Journal of the European Union.
Reader's Digest stories from two sources:
Stories available as a part of Prague Czech-English Dependency Treebank (PCEDT) (prefix pcedt-rd). For more information on PCEDT, see http://ufal.mff.cuni.cz/pcedt/.
Additional Reader's Digest stories (prefix rd2).
Parallel corpus Kačenka (prefix kacenka) available at http://www.phil.muni.cz/angl/kacenka/kachna.html. Because of the authors' rights, CzEng 0.5 can include only its subset, namely the following books: D. H. Lawrence: Sons and Lovers / Synové a milenci, Ch. Dickens: The Pickwick Papers / Pickwickovci, Ch. Dickens: Oliver Twist, T. Hardy: Jude the Obscure / Neblahý Juda, T. Hardy: Tess of the d'Urbervilles / Tess z d'Urbervillu.
E-books (prefix books) freely available on the Internet both in English and Czech (especially at http://www.gutenberg.org and http://www.palmknihy.cz), namely: Jack London: The Star Rover / Tulák po hvězdách, Franz Kafka: Trial / Proces, E.A. Poe: The Narrative of Arthur Gordon Pym of Nantucket: Dobrodružství A.G.Pyma, E.A. Poe: A Descent into the Maelstrom / Pád do Malströmu, Jerome K. Jerome: Three Men in a Boat / Tři muži ve člunu.
The quantitative properties of the individual sources (after performing the necessary preprocessing, as described in the next section) are summarized in the following table:
Document pairs | Sentences | Words+Punctuation | |||
---|---|---|---|---|---|
Czech | English | Czech | English | ||
Total | 7,743 | 1,418,721 | 1,295,647 | 18,517,624 | 20,994,274 |
100.0% | 100.0% | 100.0% | 100.0% | 100.0% | |
Acquis Communautaire | 6,272 | 1,101,610 | 930,626 | 14,619,572 | 16,079,043 |
81.0% | 77.6% | 71.8% | 78.9% | 76.6% | |
European Constitution | 47 | 11,506 | 10,380 | 138,853 | 176,096 |
0.6% | 0.8% | 0.8% | 0.7% | 0.8% | |
Samples from European Journal | 8 | 5,777 | 4,993 | 104,560 | 133,136 |
0.1% | 0.4% | 0.4% | 0.6% | 0.6% | |
Readers' Digest | 927 | 121,203 | 128,305 | 1,794,827 | 2,234,047 |
12.0% | 8.5% | 9.9% | 9.7% | 10.6% | |
Kačenka | 5 | 62,696 | 69,951 | 1,034,642 | 1,188,029 |
0.1% | 4.4% | 5.4% | 5.6% | 5.7% | |
E-Books | 5 | 17,140 | 17,495 | 330,118 | 399,607 |
0.1% | 1.2% | 1.4% | 1.8% | 1.9% | |
KDE | 479 | 98,789 | 133,897 | 495,052 | 784,316 |
6.2% | 7.0% | 10.3% | 2.7% | 3.7% |
Since the individual sources of parallel texts differ in many aspects, a lot of effort was required to integrate them into a common framework. Depending on the type of the input resource, (some of) the following steps have been applied on the Czech and English documents:
The tokenization and segmentation rules were kept as simple as possible:
This decision leads to some unpleasant differences in tokenization and segmentation compared to the "common standard" of Penn-Treebank-like annotation.
The documents were sentence-aligned using hunalign (http://mokk.bme.hu/resources/hunalign), a freely available tool.
All the settings were kept default and we did not use any dictionary to bootstrap from. Hunalign collected its own temporary dictionary to improve sentence-level alignments.
The number of alignments pairs according to the number of sentences on the English and Czech side is given in the following table:
English-Czech | 1-1 | 0-1 | 1-2 | 2-1 | 1-0 | 1-3 | 0-2 | 3-1 | Other |
---|---|---|---|---|---|---|---|---|---|
Alignment pairs | 924,543 | 97,929 | 70,879 | 69,558 | 64,490 | 23,538 | 8,526 | 6,768 | 24,943 |
71.6% | 7.6% | 5.5% | 5.4% | 5.0% | 1.8% | 0.7% | 0.5% | 1.9% |
In CzEng 0.5, each document pair is represented by three files:
Example:
Sample from the file data/books/books-two_towers-en.xml
... <s id='books-two_towers-en-c1p2s6'> <w id='books-two_towers-en-c1p2s6w1'>I</w> <w id='books-two_towers-en-c1p2s6w2'>wonder</w> <w id='books-two_towers-en-c1p2s6w3'>what</w> <w id='books-two_towers-en-c1p2s6w4'>he</w> <w id='books-two_towers-en-c1p2s6w5'>saw</w> <w id='books-two_towers-en-c1p2s6w6' no_space_after='1'>there</w> <w id='books-two_towers-en-c1p2s6w7'>?</w> </s> <s id='books-two_towers-en-c1p2s7'> <w id='books-two_towers-en-c1p2s7w1'>But</w> <w id='books-two_towers-en-c1p2s7w2'>he</w> <w id='books-two_towers-en-c1p2s7w3'>returned</w> ...
Sample from the file data/books/books-two_towers-cs.xml
... <s id='books-two_towers-cs-c1p3s7'> <w id='books-two_towers-cs-c1p3s7w1'>Co</w> <w id='books-two_towers-cs-c1p3s7w2'>tam</w> <w id='books-two_towers-cs-c1p3s7w3'>asi</w> <w id='books-two_towers-cs-c1p3s7w4' no_space_after='1'>uviděl</w> <w id='books-two_towers-cs-c1p3s7w5'>?</w> </s> <s id='books-two_towers-cs-c1p3s8'> <w id='books-two_towers-cs-c1p3s8w1'>Vracel</w> <w id='books-two_towers-cs-c1p3s8w2'>se</w> <w id='books-two_towers-cs-c1p3s8w3'>ale</w> ...
Sample from the file data/books/books-two_towers-salign.xml
... <pair type="1-1"> <members1> <member idref="books-two_towers-en-c1p2s6"/> </members1> <members2> <member idref="books-two_towers-cs-c1p3s7"/> </members2> </pair> <pair type="1-1"> <members1> <member idref="books-two_towers-en-c1p2s7"/> </members1> <members2> <member idref="books-two_towers-cs-c1p3s8"/> </members2> </pair> ...
The following simple tools for manipulating with CzEng files are included in tools/:
restrict_f2_to_11_alignments.pl restricts CzEng text files to contain only sentences that are aligned 1-to-1, accoring the corresponding .salign file.
f2_to_m0.pl and f2_to_w.pl convert CzEng text files to a variant of Prague Markup Language (PML, http://ufal.mff.cuni.cz/jazz/PML/doc/pml_doc.html) which was used to annotate Prague Dependency Treebank.
f2_to_plain.pl converts CzEng text files to plain text, each sentence on a line. The tokenization of CzEng is retained by default, but can be removed if necessary.
If you make use of CzEng data, please make sure to cite CzEng properly:
Ondřej Bojar and Zdeněk Žabokrtský. 2006. CzEng: Czech-English Parallel Corpus, Release version 0.5. Prague Bulletin of Mathematical Linguistics, 86. (in print). PDF
@Article{czeng:pbml:2006, publicationtype = {article}, Author = {Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}}, title = "{CzEng: Czech-English Parallel Corpus, Release version 0.5}", Journal = {Prague Bulletin of Mathematical Linguistics}, Volume = {86}, ISSN = {0032-6585}, Publisher = {Charles University}, PubAddress = {Prague}, Year = {2006}, note = {(in print)} }
By using CzEng 0.5 the user agrees to be bound by the license agreement. Briefly said, the license
The user of CzEng 0.5 should be aware of its following properties: