CzEng 0.7 (Czech-English Parallel Corpus, version 0.7)

Ondřej Bojar

Zdeněk Žabokrtský

Pavel Češka

Peter Beňa

Miroslav Janíček

Table of Contents

1. Introduction
2. Download
3. Sources of Parallel Texts
4. Text Preprocessing
5. Sentence Alignment
6. Data Formats and File Naming Convention
7. Tools
8. Citing CzEng
9. License
10. Disclaimer
11. Acknowledgement

1. Introduction

CzEng 0.7 (http://ufal.mff.cuni.cz/czeng/) is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, Prague in 2005-2007. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.7 is available free of charge for educational and research purposes, however, the users should become acquainted with the license agreement (http://ufal.mff.cuni.cz/czeng/license07.html).

CzEng 0.7 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment. Full details on the corpus size are given in the table below.

2. Download

Participants of WMT08 shared task: if you downloaded CzEng before February 5, 2008, please download again.

Full CzEng 0.7 data are available in the following package:

http://ufal.mff.cuni.cz/czeng/download.php?f=czeng07.zip (365 MiB or 348 MB)

3. Sources of Parallel Texts

We have used texts from the following publicly available sources:

Acquis Communautaire Parallel Corpus (prefix celex) available at http://wt.jrc.it/lt/Acquis/. It contains a huge body of EU legislative texts written between 1950s and 2005 (CzEng uses only two out of 20 languages covered by Acquis Communautaire Corpus).
EU constitution proposal (prefix euconst) as made available in Corpus OPUS (http://logos.uio.no/opus/).
Anonymous user translations as provided for the Navajo project (prefix navajo_user_translations) available from http://www.navajo.cz/.
GNOME projects localization files (prefix gnome) available from http://www.gnome.org/projects/.
KDE localization files (prefix kde) available from http://l10n.kde.org/.
Articles from Project Syndicate (prefix project_syndicate) available at http://www.project-syndicate.org/. Copyright: Project Syndicate, 2007. Permission granted to use the data for educational and non-commercial purposes only. Reprinting the material without written consent from Project Syndicate is a violation of international copyright law.
Samples from the Official Journal of the European Union (prefix eujournal) available at http://europa.eu.int/eur-lex/lex/JOIndex.do?ihmlang=en. This is a tiny collection of some rather randomly chosen issues of the the Official Journal of the European Union.
Reader's Digest stories from two sources:
- Stories available as a part of Prague Czech-English Dependency Treebank (PCEDT) (prefix pcedt-rd). For more information on PCEDT, see http://ufal.mff.cuni.cz/pcedt/.
- Additional Reader's Digest stories (prefix rd2).
Parallel corpus Kačenka (prefix kacenka) available at http://www.phil.muni.cz/angl/kacenka/kachna.html. Because of the authors' rights, CzEng 0.7 can include only its subset, namely the following books: D. H. Lawrence: Sons and Lovers / Synové a milenci, Ch. Dickens: The Pickwick Papers / Pickwickovci, Ch. Dickens: Oliver Twist, T. Hardy: Jude the Obscure / Neblahý Juda, T. Hardy: Tess of the d'Urbervilles / Tess z d'Urbervillu.
E-books (prefix books) freely available on the Internet both in English and Czech (especially at http://www.gutenberg.org and http://www.palmknihy.cz), namely: Jack London: The Star Rover / Tulák po hvězdách, Franz Kafka: Trial / Proces, E.A. Poe: The Narrative of Arthur Gordon Pym of Nantucket: Dobrodružství A.G.Pyma, E.A. Poe: A Descent into the Maelstrom / Pád do Malströmu, Jerome K. Jerome: Three Men in a Boat / Tři muži ve člunu.

The quantitative properties of the individual sources (after performing the necessary preprocessing, as described in the next section) are summarized in the following table:

	Document pairs	Sentences		Words+Punctuation
		Czech	English	Czech	English
Total	13,793	1,375,908	1,383,203	20,967,030	23,415,945
	100%	100%	100%	100%	100%
Acquis Communautaire	5,945	881,348	882,965	14,465,145	15,820,486
	43.1%	64.1%	63.8%	69.0%	67.6
Readers' Digest	927	118,972	126,975	1,794,045	2,233,022
	6.7%	8.6%	9.2%	8.6%	9.5
Project Syndicate	2,046	89,460	88,675	1,869,292	2,076,702
	14.8%	6.5%	6.4%	8.9%	8.9
KDE	864	85,591	85,582	396,542	440,921
	6.3%	6.2%	6.2%	1.9%	1.9
GNOME	224	79,021	79,083	399,933	434,039
	1.6%	5.7%	5.7%	1.9%	1.9
Kačenka	5	57,157	57,580	1,034,638	1,188,023
	0.0%	4.2%	4.2%	4.9%	5.1
Navajo User Translations	3,722	32,288	31,578	433,941	513,989
	27.0%	2.3%	2.3%	2.1%	2.2
E-Books	5	15,966	16,308	330,112	399,595
	0.0%	1.2%	1.2%	1.6%	1.7
European Constitution	47	11,101	9,500	138,990	176,032
	0.3%	0.8%	0.7%	0.7%	0.8
Samples from European Journal	8	5,004	4,957	104,392	133,136
	0.1%	0.4%	0.4%	0.5%	0.6

4. Text Preprocessing

Since the individual sources of parallel texts differ in many aspects, a lot of effort was required to integrate them into a common framework. Depending on the type of the input resource, (some of) the following steps have been applied on the Czech and English documents:

conversion from PDF, PALM (PDB DOC), SGML, HTML and other formats,
encoding conversion (everything converted into UTF-8 character encoding), sometimes manual correction of mis-interpreted character codes,
removing scanning errors, removing end-of-line hyphens,
file renaming, directory restructuring,
sentence segmentation,
tokenization,
removing long text segments having no counterpart in the corresponding document,
adding sentence and token identifiers,
conversion to a common XML format.

5. Sentence Alignment

The documents were sentence-aligned using hunalign (http://mokk.bme.hu/resources/hunalign), a freely available tool.

All the settings were kept default and we did not use any dictionary to bootstrap from. Hunalign collected its own temporary dictionary to improve sentence-level alignments.

The number of alignments pairs according to the number of sentences on the English and Czech side is given in the following table:

English-Czech	1-1	2-1	0-1	1-2	1-0	3-1	1-3	0-2	Others
Alignment pairs	1,096,940	68,856	63,185	43,057	30,694	11,003	4,786	3,855	13,479
	82.1%	5.2%	4.7%	3.2%	2.3%	0.8%	0.4%	0.3%	1.0%

6. Data Formats and File Naming Convention

In CzEng 0.7, each document pair is represented by three files:

*-en.xml - XML file containing the English text structured according to czeng07-text.dtd. Historically, we call this the 'f2' format.
*-cs.xml - XML file containing the Czech counterpart structured according to the same DTD
*-salign.xml - XML file containing the sentence alignment of the two texts, represented as pairs of identifiers of the corresponding sentences according to czeng07-alignment.dtd

Example:

Sample from the file data/books/books-two_towers-en.xml

 ...
 <s id='books-two_towers-en-c1p2s6'>
   <w id='books-two_towers-en-c1p2s6w1'>I</w>
   <w id='books-two_towers-en-c1p2s6w2'>wonder</w>
   <w id='books-two_towers-en-c1p2s6w3'>what</w>
   <w id='books-two_towers-en-c1p2s6w4'>he</w>
   <w id='books-two_towers-en-c1p2s6w5'>saw</w>
   <w id='books-two_towers-en-c1p2s6w6' no_space_after='1'>there</w>
   <w id='books-two_towers-en-c1p2s6w7'>?</w>
 </s>
 <s id='books-two_towers-en-c1p2s7'>
   <w id='books-two_towers-en-c1p2s7w1'>But</w>
   <w id='books-two_towers-en-c1p2s7w2'>he</w>
   <w id='books-two_towers-en-c1p2s7w3'>returned</w>
 ...

Sample from the file data/books/books-two_towers-cs.xml

 ...
 <s id='books-two_towers-cs-c1p3s7'>
  <w id='books-two_towers-cs-c1p3s7w1'>Co</w>
  <w id='books-two_towers-cs-c1p3s7w2'>tam</w>
  <w id='books-two_towers-cs-c1p3s7w3'>asi</w>
  <w id='books-two_towers-cs-c1p3s7w4' no_space_after='1'>uviděl</w>
  <w id='books-two_towers-cs-c1p3s7w5'>?</w>
 </s>
 <s id='books-two_towers-cs-c1p3s8'>
  <w id='books-two_towers-cs-c1p3s8w1'>Vracel</w>
  <w id='books-two_towers-cs-c1p3s8w2'>se</w>
  <w id='books-two_towers-cs-c1p3s8w3'>ale</w>
 ...

Sample from the file data/books/books-two_towers-salign.xml

 ...
<pair type="1-1">
  <members1>
    <member idref="books-two_towers-en-c1p2s6"/>
  </members1>
  <members2>
    <member idref="books-two_towers-cs-c1p3s7"/>
  </members2>
</pair>
<pair type="1-1">
  <members1>
    <member idref="books-two_towers-en-c1p2s7"/>
  </members1>
  <members2>
    <member idref="books-two_towers-cs-c1p3s8"/>
  </members2>
</pair>
 ...

For convenience, CzEng 0.7 release also includes plaintext tokenized versions of sentences that were aligned 1-to-1. You can find this restricted collection in the directory data-1-1-plaintext/.

7. Tools

The following simple tools for manipulating with CzEng files are included in tools/:

restrict_f2_to_11_alignments.pl restricts CzEng text files to contain only sentences that are aligned 1-to-1, accoring the corresponding .salign file.
f2_to_m0.pl and f2_to_w.pl convert CzEng text files to a variant of Prague Markup Language (PML, http://ufal.mff.cuni.cz/jazz/PML/doc/pml_doc.html) which was used to annotate Prague Dependency Treebank.
f2_to_plain.pl converts CzEng text files to plain text, each sentence on a line. The tokenization of CzEng is retained by default, but can be removed if necessary.

8. Citing CzEng

If you make use of CzEng data, please make sure to cite CzEng properly:

Preferred citation:

Ondřej Bojar and Zdeněk Žabokrtský. 2006. CzEng: Czech-English Parallel Corpus, Release version 0.5. Prague Bulletin of Mathematical Linguistics, 86. pp59-62. PDF

@Article{czeng:pbml:2006,
 publicationtype = {article},
 Author = {Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
 title = "{CzEng: Czech-English Parallel Corpus, Release version 0.5}",
 Journal = {Prague Bulletin of Mathematical Linguistics},
 Volume = {86},
 pages = {59--62},
 ISSN = {0032-6585},
 Publisher = {Charles University},
 PubAddress = {Prague},
 Year = {2006}
}

URL: http://ufal.mff.cuni.cz/czeng/

9. License

By using CzEng 0.7 the user agrees to be bound by the license agreement. Briefly said, the license

follows the restrictions specified in the individual licenses of the sources of parallel texts,
allows the user to use the data only for non-commercial research or educational purposes,
allows the user to extract statistical information from the texts and/or to make short citations,
requires the user to make a reference to CzEng in any published work in which he/she used the CzEng data.

10. Disclaimer

The user of CzEng 0.7 should be aware of its following properties:

CzEng is not claimed to be a balanced corpus (whatever it means).
CzEng does not provide the information about what was the original text and what was the translation (usually, though not always, English is the original language and in some cases both English and Czech texts are translations from a third language).
Quality of the contained data (including grammatical correctness, translation accuracy, alignment quality etc.) is not guaranteed and actually can be very diverse, depending especially on the type of the input resource.
CzEng does not contain all the information present in the input resources, and thus they cannot be reconstructed from CzEng. Some text segments as well as parts of the original annotation might be missing (for instance, all the resources have been (re-)segmented and (re-)aligned).

11. Acknowledgement

CzEng 0.7 release was partially funded by the grant FP6-IST-5-034291-STP (EuroMatrix).