UMC 0.1 and UMC003: Czech-English-Russian

Introduction

UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.

All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.

UMC003 is a cleaned tokenized development and test set to accompany the training data in UMC 0.1. For more information about the test set, see the README file in the UMC003 package.

Licensing

UMC 0.1 and UMC003 are available for research, educational and non-profit use free of charge. Contact us if you are interested in obtaining a different type of license.

Download

UMC 0.1 File Formats

UMC 0.1 is released as plain text files (Unicode in UTF-8, Unix line breaks). Each language pair is stored in separate files.

For each language pair, we provide the full data (e.g. Czech-Russian.txt) and a separate file containing only sentences that were aligned 1-1 (e.g. Czech-Russian.1-1.txt).

Each line in the files corresponds to one alignment pair. We use tab to delimit columns.

The full data files have the following columns: number of sentences in the alignment pair, quality of the alignment pair (the higher, the better), source language sentence, target language sentence. Here is an example of two alignment pairs. In the first one, the sentences corresponded to each other, in the second one, there were two Czech sentences corresponding to one Russian sentence:

1-1	0.55641	Mušarafův poslední výstup ?	Последний ход Мушаррафа ?
2-1	0.765517	My je ale vyslovit musíme . Je to naše povinnost .	Но мы не только должны осмелиться , мы обязаны это сделать .

The 1-1 files have just two columns: source language sentence and target language sentence:

Mušarafův poslední výstup ?	Последний ход Мушаррафа ?

UMC 0.1 Statistics

Monolingual Statistics

In the following table we summarize the monolingual statistics of the corpus:

CzechRussianEnglish
Words1,747,9971,815,5501,920,164
Tokens2,022,9902,152,3262,255,901
Sentences96,335101,52897,250

Sentence Alignment

The texts were therefore aligned at sentence level with the help of hunalign You can see the distribution of the alignment types in the table below:

1-12-10-11-21-0Others
2595999074855176868342434
90.1 %3.1 %3.0 %2.7 %0.3 %0.8 %

Bilingual Statistics

Czech-Russian English-Russian Czech-English
Many-to-many alignments Alignment pairs 97,373 97,656 93,149
Tokens 2,022,950 2,151,946 2,255,889 2,151,948 2,022,950 2,255,889
1-to-1 alignments Alignment pairs 88,092 86,604 84,903
Tokens 1,867,392 1,927,283 2,062,849 1,885,683 1,788,227 2,000,684

Citing

If you want to cite UMC 0.1, please use the following reference and optionally the URL:

  • Natalia Klyueva and Ondřej Bojar. UMC 0.1: Czech-Russian-English Multilingual Corpus. Proceedings of International Conference Corpus Linguistics, pages 188-195, October 2008. PDF
    @inProceedings{umc:2008,
      publicationtype = {inProceedings},
      author = {Natalia Klyueva and Ond{\v{r}}ej Bojar},
      title = "{UMC 0.1: Czech-Russian-English Multilingual Corpus}",
      booktitle = {Proceedings of International Conference Corpus Linguistics},
      pages = {188--195},
      year = {2008},
    }
    
  • http://ufal.mff.cuni.cz/umc
  • Acknowledgment

    The work on UMC 0.1 was supported by the grant FP6-IST-5-034291-STP (EuroMatrix).

    The work on UMC003 was supported by the grant FP7-ICT-2007-3-231720 (EuroMatrix Plus).


    Institute of Formal and Applied Linguistics (ÚFAL)
    Ondřej Bojar, bojar <at> ufal.mff.cuni.cz
    Natalia Kljueva, kljueva <at> ufal.mff.cuni.cz
    David Kolovratník, david <at> kolovratnik.net
    $Id: index.html 319 2009-11-26 09:10:28Z bojar $