CzEng 0.5 (Czech-English Parallel Corpus, version 0.5)

Ondřej Bojar

Zdeněk Žabokrtský

Table of Contents

1. Introduction
2. Download
3. Sources of Parallel Texts
4. Text Preprocessing
5. Known Limitations of Preprocessing
6. Sentence Alignment
7. Data Formats and File Naming Convention
8. Tools
9. Citing CzEng
10. License
11. Disclaimer

1. Introduction

CzEng 0.5 (http://ufal.mff.cuni.cz/czeng/) is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics, Charles University, Prague in 2005-2006. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.5 is available free of charge for educational and research purposes, however, the users should become acquainted with the license agreement (http://ufal.mff.cuni.cz/czeng/license.html).

CzEng 0.5 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment. The corpus contains altogether 7,743 document pairs. Full details on the corpus size are given in the table below.

2. Download

Full CzEng 0.5 data are available in the following package:

http://ufal.mff.cuni.cz/czeng/czeng05.zip (260 MiB or 270 MB)

3. Sources of Parallel Texts

We have used texts from the following publicly available sources:

Acquis Communautaire Parallel Corpus (prefix celex) available at http://wt.jrc.it/lt/Acquis/. It contains a huge body of EU legislative texts written between 1950s and 2005 (CzEng uses only two out of 20 languages covered by Acquis Communautaire Corpus).
Corpus OPUS available at http://logos.uio.no/opus/. It is an open source collection of freely available corpora; two of them are used in CzEng:
- EU constitution proposal (prefix euconst)
- KDE documentation (prefix kde)
Samples from the Official Journal of the European Union (prefix eujournal) available at http://europa.eu.int/eur-lex/lex/JOIndex.do?ihmlang=en. This is a tiny collection of some rather randomly chosen issues of the the Official Journal of the European Union.
Reader's Digest stories from two sources:
- Stories available as a part of Prague Czech-English Dependency Treebank (PCEDT) (prefix pcedt-rd). For more information on PCEDT, see http://ufal.mff.cuni.cz/pcedt/.
- Additional Reader's Digest stories (prefix rd2).
Parallel corpus Kačenka (prefix kacenka) available at http://www.phil.muni.cz/angl/kacenka/kachna.html. Because of the authors' rights, CzEng 0.5 can include only its subset, namely the following books: D. H. Lawrence: Sons and Lovers / Synové a milenci, Ch. Dickens: The Pickwick Papers / Pickwickovci, Ch. Dickens: Oliver Twist, T. Hardy: Jude the Obscure / Neblahý Juda, T. Hardy: Tess of the d'Urbervilles / Tess z d'Urbervillu.
E-books (prefix books) freely available on the Internet both in English and Czech (especially at http://www.gutenberg.org and http://www.palmknihy.cz), namely: Jack London: The Star Rover / Tulák po hvězdách, Franz Kafka: Trial / Proces, E.A. Poe: The Narrative of Arthur Gordon Pym of Nantucket: Dobrodružství A.G.Pyma, E.A. Poe: A Descent into the Maelstrom / Pád do Malströmu, Jerome K. Jerome: Three Men in a Boat / Tři muži ve člunu.

The quantitative properties of the individual sources (after performing the necessary preprocessing, as described in the next section) are summarized in the following table:

	Document pairs	Sentences		Words+Punctuation
		Czech	English	Czech	English
Total	7,743	1,418,721	1,295,647	18,517,624	20,994,274
Total	100.0%	100.0%	100.0%	100.0%	100.0%
Acquis Communautaire	6,272	1,101,610	930,626	14,619,572	16,079,043
Acquis Communautaire	81.0%	77.6%	71.8%	78.9%	76.6%
European Constitution	47	11,506	10,380	138,853	176,096
European Constitution	0.6%	0.8%	0.8%	0.7%	0.8%
Samples from European Journal	8	5,777	4,993	104,560	133,136
Samples from European Journal	0.1%	0.4%	0.4%	0.6%	0.6%
Readers' Digest	927	121,203	128,305	1,794,827	2,234,047
Readers' Digest	12.0%	8.5%	9.9%	9.7%	10.6%
Kačenka	5	62,696	69,951	1,034,642	1,188,029
Kačenka	0.1%	4.4%	5.4%	5.6%	5.7%
E-Books	5	17,140	17,495	330,118	399,607
E-Books	0.1%	1.2%	1.4%	1.8%	1.9%
KDE	479	98,789	133,897	495,052	784,316
KDE	6.2%	7.0%	10.3%	2.7%	3.7%

4. Text Preprocessing

Since the individual sources of parallel texts differ in many aspects, a lot of effort was required to integrate them into a common framework. Depending on the type of the input resource, (some of) the following steps have been applied on the Czech and English documents:

conversion from PDF, PALM (PDB DOC), SGML, HTML and other formats,
encoding conversion (everything converted into UTF-8 character encoding), sometimes manual correction of mis-interpreted character codes,
removing scanning errors, removing end-of-line hyphens,
file renaming, directory restructuring,
sentence segmentation,
tokenization,
removing long text segments having no counterpart in the corresponding document,
adding sentence and token identifiers,
conversion to a common XML format.

5. Known Limitations of Preprocessing

The tokenization and segmentation rules were kept as simple as possible:

a different character class (digit, letter, punctuation) always starts a new token,
adjacent punctuation characters are encoded as separate tokens.

This decision leads to some unpleasant differences in tokenization and segmentation compared to the "common standard" of Penn-Treebank-like annotation.

No abbreviations were searched for. This hurts especially with titles (Dr.) or abbreviated names (O. Bojar), because a period followed by an upper-case letter is treated as the sentence boundary. All such expressions are thus splitted into several sentences.
Consecutive periods (...) lead to a sequence of one-token sentences.

6. Sentence Alignment

The documents were sentence-aligned using hunalign (http://mokk.bme.hu/resources/hunalign), a freely available tool.

All the settings were kept default and we did not use any dictionary to bootstrap from. Hunalign collected its own temporary dictionary to improve sentence-level alignments.

The number of alignments pairs according to the number of sentences on the English and Czech side is given in the following table:

English-Czech	1-1	0-1	1-2	2-1	1-0	1-3	0-2	3-1	Other
Alignment pairs	924,543	97,929	70,879	69,558	64,490	23,538	8,526	6,768	24,943
	71.6%	7.6%	5.5%	5.4%	5.0%	1.8%	0.7%	0.5%	1.9%

7. Data Formats and File Naming Convention

In CzEng 0.5, each document pair is represented by three files:

*-en.xml - XML file containing the English text structured according to czeng05-text.dtd. Historically, we call this the 'f2' format.
*-cs.xml - XML file containing the Czech counterpart structured according to the same DTD
*-salign.xml - XML file containing the sentence alignment of the two texts, represented as pairs (???) of identifiers of the corresponding sentences according to czeng05-alignment.dtd

Example:

Sample from the file data/books/books-two_towers-en.xml

 ...
 <s id='books-two_towers-en-c1p2s6'>
   <w id='books-two_towers-en-c1p2s6w1'>I</w>
   <w id='books-two_towers-en-c1p2s6w2'>wonder</w>
   <w id='books-two_towers-en-c1p2s6w3'>what</w>
   <w id='books-two_towers-en-c1p2s6w4'>he</w>
   <w id='books-two_towers-en-c1p2s6w5'>saw</w>
   <w id='books-two_towers-en-c1p2s6w6' no_space_after='1'>there</w>
   <w id='books-two_towers-en-c1p2s6w7'>?</w>
 </s>
 <s id='books-two_towers-en-c1p2s7'>
   <w id='books-two_towers-en-c1p2s7w1'>But</w>
   <w id='books-two_towers-en-c1p2s7w2'>he</w>
   <w id='books-two_towers-en-c1p2s7w3'>returned</w>
 ...

Sample from the file data/books/books-two_towers-cs.xml

 ...
 <s id='books-two_towers-cs-c1p3s7'>
  <w id='books-two_towers-cs-c1p3s7w1'>Co</w>
  <w id='books-two_towers-cs-c1p3s7w2'>tam</w>
  <w id='books-two_towers-cs-c1p3s7w3'>asi</w>
  <w id='books-two_towers-cs-c1p3s7w4' no_space_after='1'>uviděl</w>
  <w id='books-two_towers-cs-c1p3s7w5'>?</w>
 </s>
 <s id='books-two_towers-cs-c1p3s8'>
  <w id='books-two_towers-cs-c1p3s8w1'>Vracel</w>
  <w id='books-two_towers-cs-c1p3s8w2'>se</w>
  <w id='books-two_towers-cs-c1p3s8w3'>ale</w>
 ...

Sample from the file data/books/books-two_towers-salign.xml

 ...
<pair type="1-1">
  <members1>
    <member idref="books-two_towers-en-c1p2s6"/>
  </members1>
  <members2>
    <member idref="books-two_towers-cs-c1p3s7"/>
  </members2>
</pair>
<pair type="1-1">
  <members1>
    <member idref="books-two_towers-en-c1p2s7"/>
  </members1>
  <members2>
    <member idref="books-two_towers-cs-c1p3s8"/>
  </members2>
</pair>
 ...

8. Tools

The following simple tools for manipulating with CzEng files are included in tools/:

restrict_f2_to_11_alignments.pl restricts CzEng text files to contain only sentences that are aligned 1-to-1, accoring the corresponding .salign file.
f2_to_m0.pl and f2_to_w.pl convert CzEng text files to a variant of Prague Markup Language (PML, http://ufal.mff.cuni.cz/jazz/PML/doc/pml_doc.html) which was used to annotate Prague Dependency Treebank.
f2_to_plain.pl converts CzEng text files to plain text, each sentence on a line. The tokenization of CzEng is retained by default, but can be removed if necessary.

9. Citing CzEng

If you make use of CzEng data, please make sure to cite CzEng properly:

Preferred citation:

Ondřej Bojar and Zdeněk Žabokrtský. 2006. CzEng: Czech-English Parallel Corpus, Release version 0.5. Prague Bulletin of Mathematical Linguistics, 86. (in print). PDF

@Article{czeng:pbml:2006,
 publicationtype = {article},
 Author = {Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
 title = "{CzEng: Czech-English Parallel Corpus, Release version 0.5}",
 Journal = {Prague Bulletin of Mathematical Linguistics},
 Volume = {86},
 ISSN = {0032-6585},
 Publisher = {Charles University},
 PubAddress = {Prague},
 Year = {2006},
 note = {(in print)}
}

URL: http://ufal.mff.cuni.cz/czeng/

10. License

By using CzEng 0.5 the user agrees to be bound by the license agreement. Briefly said, the license

follows the restrictions specified in the individual licenses of the sources of parallel texts,
allows the user to use the data only for non-commercial research or educational purposes,
allows the user to extract statistical information from the texts and/or to make short citations,
requires the user to make a reference to CzEng in any published work in which he/she used the CzEng data.

11. Disclaimer

The user of CzEng 0.5 should be aware of its following properties:

CzEng is not claimed to be a balanced corpus (whatever it means).
CzEng does not provide the information about what was the original text and what was the translation (English is usually the original language, however, in some cases both English and Czech texts are translations from a third language).
Quality of the contained data (including grammatical correctness, translation accuracy, alignment quality etc.) is not guaranteed and actually can be very diverse, depending especially on the type of the input resource.
CzEng does not contain all the information present in the input resources, and thus they cannot be reconstructed from CzEng. Some text segments as well as parts of the original annotation might be missing (for instance, all the resources have been (re-)segmented and (re-)aligned).