Table of Contents
CzEng 0.7 (http://ufal.mff.cuni.cz/czeng/) is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague in 2005-2007. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.7 is available free of charge for educational and research purposes, however, the users should become acquainted with the license agreement (http://ufal.mff.cuni.cz/czeng/license07.html).
CzEng 0.7 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment. Full details on the corpus size are given in the table below.
Participants of WMT08 shared task: if you downloaded CzEng before February 5, 2008, please download again.
Full CzEng 0.7 data are available in the following package:
We have used texts from the following publicly available sources:
Acquis Communautaire Parallel Corpus (prefix
celex
) available at
http://wt.jrc.it/lt/Acquis/. It
contains a huge body of EU legislative texts written between 1950s
and 2005 (CzEng uses only two out of 20 languages covered by Acquis
Communautaire Corpus).
EU constitution proposal (prefix euconst
) as made available
in Corpus OPUS (http://logos.uio.no/opus/).
Anonymous user translations as provided for the Navajo project (prefix navajo_user_translations
) available
from http://www.navajo.cz/.
GNOME projects localization files (prefix gnome
) available
from http://www.gnome.org/projects/.
KDE localization files (prefix kde
) available from http://l10n.kde.org/.
Articles from Project Syndicate (prefix
project_syndicate
) available at
http://www.project-syndicate.org/.
Copyright: Project Syndicate, 2007. Permission granted to use the data for
educational and non-commercial purposes only. Reprinting the material
without written consent from Project Syndicate is a violation of
international copyright law.
Samples from the Official Journal of the European Union (prefix
eujournal
) available at
http://europa.eu.int/eur-lex/lex/JOIndex.do?ihmlang=en.
This is a tiny collection of some rather randomly chosen issues of the the Official Journal of the European Union.
Reader's Digest stories from two sources:
Stories available as a part of Prague Czech-English Dependency Treebank (PCEDT)
(prefix pcedt-rd
). For more information on PCEDT, see
http://ufal.mff.cuni.cz/pcedt/.
Additional Reader's Digest stories (prefix rd2
).
Parallel corpus Kačenka (prefix kacenka
) available at http://www.phil.muni.cz/angl/kacenka/kachna.html.
Because of the authors' rights, CzEng 0.7
can include only its subset, namely the following books:
D. H. Lawrence: Sons and Lovers / Synové a milenci,
Ch. Dickens: The Pickwick Papers / Pickwickovci,
Ch. Dickens: Oliver Twist,
T. Hardy: Jude the Obscure / Neblahý Juda,
T. Hardy: Tess of the d'Urbervilles / Tess z d'Urbervillu.
E-books (prefix books
) freely available on the
Internet both in English and Czech (especially at http://www.gutenberg.org
and http://www.palmknihy.cz),
namely:
Jack London: The Star Rover / Tulák po hvězdách,
Franz Kafka: Trial / Proces,
E.A. Poe: The Narrative of Arthur Gordon Pym of Nantucket: Dobrodružství A.G.Pyma,
E.A. Poe: A Descent into the Maelstrom / Pád do Malströmu,
Jerome K. Jerome: Three Men in a Boat / Tři muži ve člunu.
The quantitative properties of the individual sources (after performing the necessary preprocessing, as described in the next section) are summarized in the following table:
Document pairs | Sentences | Words+Punctuation | |||
---|---|---|---|---|---|
Czech | English | Czech | English | ||
Total | 13,793 | 1,375,908 | 1,383,203 | 20,967,030 | 23,415,945 |
100% | 100% | 100% | 100% | 100% | |
Acquis Communautaire | 5,945 | 881,348 | 882,965 | 14,465,145 | 15,820,486 |
43.1% | 64.1% | 63.8% | 69.0% | 67.6 | |
Readers' Digest | 927 | 118,972 | 126,975 | 1,794,045 | 2,233,022 |
6.7% | 8.6% | 9.2% | 8.6% | 9.5 | |
Project Syndicate | 2,046 | 89,460 | 88,675 | 1,869,292 | 2,076,702 |
14.8% | 6.5% | 6.4% | 8.9% | 8.9 | |
KDE | 864 | 85,591 | 85,582 | 396,542 | 440,921 |
6.3% | 6.2% | 6.2% | 1.9% | 1.9 | |
GNOME | 224 | 79,021 | 79,083 | 399,933 | 434,039 |
1.6% | 5.7% | 5.7% | 1.9% | 1.9 | |
Kačenka | 5 | 57,157 | 57,580 | 1,034,638 | 1,188,023 |
0.0% | 4.2% | 4.2% | 4.9% | 5.1 | |
Navajo User Translations | 3,722 | 32,288 | 31,578 | 433,941 | 513,989 |
27.0% | 2.3% | 2.3% | 2.1% | 2.2 | |
E-Books | 5 | 15,966 | 16,308 | 330,112 | 399,595 |
0.0% | 1.2% | 1.2% | 1.6% | 1.7 | |
European Constitution | 47 | 11,101 | 9,500 | 138,990 | 176,032 |
0.3% | 0.8% | 0.7% | 0.7% | 0.8 | |
Samples from European Journal | 8 | 5,004 | 4,957 | 104,392 | 133,136 |
0.1% | 0.4% | 0.4% | 0.5% | 0.6 |
Since the individual sources of parallel texts differ in many aspects, a lot of effort was required to integrate them into a common framework. Depending on the type of the input resource, (some of) the following steps have been applied on the Czech and English documents:
The documents were sentence-aligned using hunalign (http://mokk.bme.hu/resources/hunalign), a freely available tool.
All the settings were kept default and we did not use any dictionary to bootstrap from. Hunalign collected its own temporary dictionary to improve sentence-level alignments.
The number of alignments pairs according to the number of sentences on the English and Czech side is given in the following table:
English-Czech | 1-1 | 2-1 | 0-1 | 1-2 | 1-0 | 3-1 | 1-3 | 0-2 | Others |
Alignment pairs | 1,096,940 | 68,856 | 63,185 | 43,057 | 30,694 | 11,003 | 4,786 | 3,855 | 13,479 |
82.1% | 5.2% | 4.7% | 3.2% | 2.3% | 0.8% | 0.4% | 0.3% | 1.0% |
In CzEng 0.7, each document pair is represented by three files:
*-en.xml
- XML file containing
the English text structured according to czeng07-text.dtd. Historically, we call this the 'f2' format.
*-cs.xml
- XML file containing
the Czech counterpart structured according to the same DTD
*-salign.xml
- XML file containing
the sentence alignment of the two texts, represented as pairs of
identifiers of the corresponding sentences according to czeng07-alignment.dtd
Example:
Sample from the file data/books/books-two_towers-en.xml
... <s id='books-two_towers-en-c1p2s6'> <w id='books-two_towers-en-c1p2s6w1'>I</w> <w id='books-two_towers-en-c1p2s6w2'>wonder</w> <w id='books-two_towers-en-c1p2s6w3'>what</w> <w id='books-two_towers-en-c1p2s6w4'>he</w> <w id='books-two_towers-en-c1p2s6w5'>saw</w> <w id='books-two_towers-en-c1p2s6w6' no_space_after='1'>there</w> <w id='books-two_towers-en-c1p2s6w7'>?</w> </s> <s id='books-two_towers-en-c1p2s7'> <w id='books-two_towers-en-c1p2s7w1'>But</w> <w id='books-two_towers-en-c1p2s7w2'>he</w> <w id='books-two_towers-en-c1p2s7w3'>returned</w> ...
Sample from the file data/books/books-two_towers-cs.xml
... <s id='books-two_towers-cs-c1p3s7'> <w id='books-two_towers-cs-c1p3s7w1'>Co</w> <w id='books-two_towers-cs-c1p3s7w2'>tam</w> <w id='books-two_towers-cs-c1p3s7w3'>asi</w> <w id='books-two_towers-cs-c1p3s7w4' no_space_after='1'>uviděl</w> <w id='books-two_towers-cs-c1p3s7w5'>?</w> </s> <s id='books-two_towers-cs-c1p3s8'> <w id='books-two_towers-cs-c1p3s8w1'>Vracel</w> <w id='books-two_towers-cs-c1p3s8w2'>se</w> <w id='books-two_towers-cs-c1p3s8w3'>ale</w> ...
Sample from the file data/books/books-two_towers-salign.xml
... <pair type="1-1"> <members1> <member idref="books-two_towers-en-c1p2s6"/> </members1> <members2> <member idref="books-two_towers-cs-c1p3s7"/> </members2> </pair> <pair type="1-1"> <members1> <member idref="books-two_towers-en-c1p2s7"/> </members1> <members2> <member idref="books-two_towers-cs-c1p3s8"/> </members2> </pair> ...
For convenience, CzEng 0.7 release also includes plaintext tokenized versions
of sentences that were aligned 1-to-1. You can find this restricted collection
in the directory data-1-1-plaintext/
.
The following simple tools for manipulating with CzEng files are included in tools/
:
restrict_f2_to_11_alignments.pl
restricts CzEng
text files to contain only sentences that are aligned 1-to-1, accoring
the corresponding .salign file.
f2_to_m0.pl
and f2_to_w.pl
convert CzEng text files to a variant of Prague Markup Language (PML,
http://ufal.mff.cuni.cz/jazz/PML/doc/pml_doc.html)
which was used to annotate Prague Dependency Treebank.
f2_to_plain.pl
converts CzEng text files to plain
text, each sentence on a line. The tokenization of CzEng is retained
by default, but can be removed if necessary.
If you make use of CzEng data, please make sure to cite CzEng properly:
Ondřej Bojar and Zdeněk Žabokrtský. 2006. CzEng: Czech-English Parallel Corpus, Release version 0.5. Prague Bulletin of Mathematical Linguistics, 86. pp59-62. PDF
@Article{czeng:pbml:2006, publicationtype = {article}, Author = {Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}}, title = "{CzEng: Czech-English Parallel Corpus, Release version 0.5}", Journal = {Prague Bulletin of Mathematical Linguistics}, Volume = {86}, pages = {59--62}, ISSN = {0032-6585}, Publisher = {Charles University}, PubAddress = {Prague}, Year = {2006} }
By using CzEng 0.7 the user agrees to be bound by the license agreement. Briefly said, the license
The user of CzEng 0.7 should be aware of its following properties: