CzEng 0.5 (Czech-English Parallel Corpus, version 0.5)

Ondřej Bojar

Zdeněk Žabokrtský


Table of Contents

1. Introduction
2. Download
3. Sources of Parallel Texts
4. Text Preprocessing
5. Known Limitations of Preprocessing
6. Sentence Alignment
7. Data Formats and File Naming Convention
8. Tools
9. Citing CzEng
10. License
11. Disclaimer

1.  Introduction

CzEng 0.5 (http://ufal.mff.cuni.cz/czeng/) is a Czech-English parallel corpus compiled at the Institute of Formal and Applied Linguistics, Charles University, Prague in 2005-2006. The corpus contains no manual annotation. It is limited only to texts which have been already available in an electronic form and which are not protected by authors' rights in the Czech Republic. The main purpose of the corpus is to support Czech-English and English-Czech machine translation research with the necessary data. CzEng 0.5 is available free of charge for educational and research purposes, however, the users should become acquainted with the license agreement (http://ufal.mff.cuni.cz/czeng/license.html).

CzEng 0.5 consists of a large set of parallel textual documents mainly from the fields of European law, information technology, and fiction, all of them converted into a uniform XML-based file format and provided with automatic sentence alignment. The corpus contains altogether 7,743 document pairs. Full details on the corpus size are given in the table below.

2. Download

Full CzEng 0.5 data are available in the following package:

3. Sources of Parallel Texts

We have used texts from the following publicly available sources:

  • Acquis Communautaire Parallel Corpus (prefix celex) available at http://wt.jrc.it/lt/Acquis/. It contains a huge body of EU legislative texts written between 1950s and 2005 (CzEng uses only two out of 20 languages covered by Acquis Communautaire Corpus).

  • Corpus OPUS available at http://logos.uio.no/opus/. It is an open source collection of freely available corpora; two of them are used in CzEng:

    • EU constitution proposal (prefix euconst)

    • KDE documentation (prefix kde)

  • Samples from the Official Journal of the European Union (prefix eujournal) available at http://europa.eu.int/eur-lex/lex/JOIndex.do?ihmlang=en. This is a tiny collection of some rather randomly chosen issues of the the Official Journal of the European Union.

  • Reader's Digest stories from two sources:

    • Stories available as a part of Prague Czech-English Dependency Treebank (PCEDT) (prefix pcedt-rd). For more information on PCEDT, see http://ufal.mff.cuni.cz/pcedt/.

    • Additional Reader's Digest stories (prefix rd2).

  • Parallel corpus Kačenka (prefix kacenka) available at http://www.phil.muni.cz/angl/kacenka/kachna.html. Because of the authors' rights, CzEng 0.5 can include only its subset, namely the following books: D. H. Lawrence: Sons and Lovers / Synové a milenci, Ch. Dickens: The Pickwick Papers / Pickwickovci, Ch. Dickens: Oliver Twist, T. Hardy: Jude the Obscure / Neblahý Juda, T. Hardy: Tess of the d'Urbervilles / Tess z d'Urbervillu.

  • E-books (prefix books) freely available on the Internet both in English and Czech (especially at http://www.gutenberg.org and http://www.palmknihy.cz), namely: Jack London: The Star Rover / Tulák po hvězdách, Franz Kafka: Trial / Proces, E.A. Poe: The Narrative of Arthur Gordon Pym of Nantucket: Dobrodružství A.G.Pyma, E.A. Poe: A Descent into the Maelstrom / Pád do Malströmu, Jerome K. Jerome: Three Men in a Boat / Tři muži ve člunu.

The quantitative properties of the individual sources (after performing the necessary preprocessing, as described in the next section) are summarized in the following table:

 Document pairsSentencesWords+Punctuation
 Czech English Czech English
Total 7,743 1,418,721 1,295,647 18,517,624 20,994,274
100.0% 100.0% 100.0% 100.0% 100.0%
Acquis Communautaire 6,272 1,101,610 930,626 14,619,572 16,079,043
81.0% 77.6% 71.8% 78.9% 76.6%
European Constitution 47 11,506 10,380 138,853 176,096
0.6% 0.8% 0.8% 0.7% 0.8%
Samples from European Journal8 5,777 4,993 104,560 133,136
0.1% 0.4% 0.4% 0.6% 0.6%
Readers' Digest 927 121,203 128,305 1,794,827 2,234,047
12.0% 8.5% 9.9% 9.7% 10.6%
Kačenka 5 62,696 69,951 1,034,642 1,188,029
0.1% 4.4% 5.4% 5.6% 5.7%
E-Books 5 17,140 17,495 330,118 399,607
0.1% 1.2% 1.4% 1.8% 1.9%
KDE 479 98,789 133,897 495,052 784,316
6.2% 7.0% 10.3% 2.7% 3.7%

4. Text Preprocessing

Since the individual sources of parallel texts differ in many aspects, a lot of effort was required to integrate them into a common framework. Depending on the type of the input resource, (some of) the following steps have been applied on the Czech and English documents:

  • conversion from PDF, PALM (PDB DOC), SGML, HTML and other formats,
  • encoding conversion (everything converted into UTF-8 character encoding), sometimes manual correction of mis-interpreted character codes,
  • removing scanning errors, removing end-of-line hyphens,
  • file renaming, directory restructuring,
  • sentence segmentation,
  • tokenization,
  • removing long text segments having no counterpart in the corresponding document,
  • adding sentence and token identifiers,
  • conversion to a common XML format.

5. Known Limitations of Preprocessing

The tokenization and segmentation rules were kept as simple as possible:

  • a different character class (digit, letter, punctuation) always starts a new token,
  • adjacent punctuation characters are encoded as separate tokens.

This decision leads to some unpleasant differences in tokenization and segmentation compared to the "common standard" of Penn-Treebank-like annotation.

  • No abbreviations were searched for. This hurts especially with titles (Dr.) or abbreviated names (O. Bojar), because a period followed by an upper-case letter is treated as the sentence boundary. All such expressions are thus splitted into several sentences.
  • Consecutive periods (...) lead to a sequence of one-token sentences.

6. Sentence Alignment

The documents were sentence-aligned using hunalign (http://mokk.bme.hu/resources/hunalign), a freely available tool.

All the settings were kept default and we did not use any dictionary to bootstrap from. Hunalign collected its own temporary dictionary to improve sentence-level alignments.

The number of alignments pairs according to the number of sentences on the English and Czech side is given in the following table:

English-Czech 1-1 0-1 1-2 2-1 1-0 1-3 0-2 3-1 Other
Alignment pairs924,54397,92970,87969,55864,49023,5388,5266,76824,943
 71.6% 7.6% 5.5% 5.4% 5.0% 1.8% 0.7%0.5%1.9%

7. Data Formats and File Naming Convention

In CzEng 0.5, each document pair is represented by three files:

  • *-en.xml - XML file containing the English text structured according to czeng05-text.dtd. Historically, we call this the 'f2' format.
  • *-cs.xml - XML file containing the Czech counterpart structured according to the same DTD
  • *-salign.xml - XML file containing the sentence alignment of the two texts, represented as pairs (???) of identifiers of the corresponding sentences according to czeng05-alignment.dtd

Example:

  • Sample from the file data/books/books-two_towers-en.xml

     ...
     <s id='books-two_towers-en-c1p2s6'>
       <w id='books-two_towers-en-c1p2s6w1'>I</w>
       <w id='books-two_towers-en-c1p2s6w2'>wonder</w>
       <w id='books-two_towers-en-c1p2s6w3'>what</w>
       <w id='books-two_towers-en-c1p2s6w4'>he</w>
       <w id='books-two_towers-en-c1p2s6w5'>saw</w>
       <w id='books-two_towers-en-c1p2s6w6' no_space_after='1'>there</w>
       <w id='books-two_towers-en-c1p2s6w7'>?</w>
     </s>
     <s id='books-two_towers-en-c1p2s7'>
       <w id='books-two_towers-en-c1p2s7w1'>But</w>
       <w id='books-two_towers-en-c1p2s7w2'>he</w>
       <w id='books-two_towers-en-c1p2s7w3'>returned</w>
     ...
    

  • Sample from the file data/books/books-two_towers-cs.xml

     ...
     <s id='books-two_towers-cs-c1p3s7'>
      <w id='books-two_towers-cs-c1p3s7w1'>Co</w>
      <w id='books-two_towers-cs-c1p3s7w2'>tam</w>
      <w id='books-two_towers-cs-c1p3s7w3'>asi</w>
      <w id='books-two_towers-cs-c1p3s7w4' no_space_after='1'>uviděl</w>
      <w id='books-two_towers-cs-c1p3s7w5'>?</w>
     </s>
     <s id='books-two_towers-cs-c1p3s8'>
      <w id='books-two_towers-cs-c1p3s8w1'>Vracel</w>
      <w id='books-two_towers-cs-c1p3s8w2'>se</w>
      <w id='books-two_towers-cs-c1p3s8w3'>ale</w>
     ...
    

  • Sample from the file data/books/books-two_towers-salign.xml

     ...
    <pair type="1-1">
      <members1>
        <member idref="books-two_towers-en-c1p2s6"/>
      </members1>
      <members2>
        <member idref="books-two_towers-cs-c1p3s7"/>
      </members2>
    </pair>
    <pair type="1-1">
      <members1>
        <member idref="books-two_towers-en-c1p2s7"/>
      </members1>
      <members2>
        <member idref="books-two_towers-cs-c1p3s8"/>
      </members2>
    </pair>
     ...
    
    

8. Tools

The following simple tools for manipulating with CzEng files are included in tools/:

  • restrict_f2_to_11_alignments.pl restricts CzEng text files to contain only sentences that are aligned 1-to-1, accoring the corresponding .salign file.

  • f2_to_m0.pl and f2_to_w.pl convert CzEng text files to a variant of Prague Markup Language (PML, http://ufal.mff.cuni.cz/jazz/PML/doc/pml_doc.html) which was used to annotate Prague Dependency Treebank.

  • f2_to_plain.pl converts CzEng text files to plain text, each sentence on a line. The tokenization of CzEng is retained by default, but can be removed if necessary.

9. Citing CzEng

If you make use of CzEng data, please make sure to cite CzEng properly:

  • Preferred citation:

    Ondřej Bojar and Zdeněk Žabokrtský. 2006. CzEng: Czech-English Parallel Corpus, Release version 0.5. Prague Bulletin of Mathematical Linguistics, 86. (in print). PDF

    @Article{czeng:pbml:2006,
     publicationtype = {article},
     Author = {Ond{\v{r}}ej Bojar and Zden{\v{e}}k {\v{Z}}abokrtsk{\'{y}}},
     title = "{CzEng: Czech-English Parallel Corpus, Release version 0.5}",
     Journal = {Prague Bulletin of Mathematical Linguistics},
     Volume = {86},
     ISSN = {0032-6585},
     Publisher = {Charles University},
     PubAddress = {Prague},
     Year = {2006},
     note = {(in print)}
    }
    
  • URL: http://ufal.mff.cuni.cz/czeng/

10. License

By using CzEng 0.5 the user agrees to be bound by the license agreement. Briefly said, the license

  • follows the restrictions specified in the individual licenses of the sources of parallel texts,
  • allows the user to use the data only for non-commercial research or educational purposes,
  • allows the user to extract statistical information from the texts and/or to make short citations,
  • requires the user to make a reference to CzEng in any published work in which he/she used the CzEng data.

11. Disclaimer

The user of CzEng 0.5 should be aware of its following properties:

  • CzEng is not claimed to be a balanced corpus (whatever it means).
  • CzEng does not provide the information about what was the original text and what was the translation (English is usually the original language, however, in some cases both English and Czech texts are translations from a third language).
  • Quality of the contained data (including grammatical correctness, translation accuracy, alignment quality etc.) is not guaranteed and actually can be very diverse, depending especially on the type of the input resource.
  • CzEng does not contain all the information present in the input resources, and thus they cannot be reconstructed from CzEng. Some text segments as well as parts of the original annotation might be missing (for instance, all the resources have been (re-)segmented and (re-)aligned).