UMC 0.1 Czech-English-Russian

Introduction

UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.

All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.

Licensing

UMC 0.1 is available for research, educational and non-profit use free of charge. Contact us if you are interested in obtaining a different type of license.

Download

Corpus File Formats

UMC 0.1 is released as plain text files (Unicode in UTF-8, Unix line breaks). Each language pair is stored in separate files.

For each language pair, we provide the full data (e.g. Czech-Russian.txt) and a separate file containing only sentences that were aligned 1-1 (e.g. Czech-Russian.1-1.txt).

Each line in the files corresponds to one alignment pair. We use tab to delimit columns.

The full data files have the following columns: number of sentences in the alignment pair, quality of the alignment pair (the higher, the better), source language sentence, target language sentence. Here is an example of two alignment pairs. In the first one, the sentences corresponded to each other, in the second one, there were two Czech sentences corresponding to one Russian sentence:

1-1	0.55641	Mušarafův poslední výstup ?	Последний ход Мушаррафа ?
2-1	0.765517	My je ale vyslovit musíme . Je to naše povinnost .	Но мы не только должны осмелиться , мы обязаны это сделать .

The 1-1 files have just two columns: source language sentence and target language sentence:

Mušarafův poslední výstup ?	Последний ход Мушаррафа ?

Corpus Statistics

Monolingual Statistics

In the following table we summarize the monolingual statistics of the corpus:

CzechRussianEnglish
Words1,747,9971,815,5501,920,164
Tokens2,022,9902,152,3262,255,901
Sentences96,335101,52897,250

Sentence Alignment

The texts were therefore aligned at sentence level with the help of hunalign You can see the distribution of the alignment types in the table below:

1-12-10-11-21-0Others
2595999074855176868342434
90.1 %3.1 %3.0 %2.7 %0.3 %0.8 %

Bilingual Statistics

Czech-Russian English-Russian Czech-English
Many-to-many alignments Alignment pairs 97,373 97,656 93,149
Tokens 2,022,950 2,151,946 2,255,889 2,151,948 2,022,950 2,255,889
1-to-1 alignments Alignment pairs 88,092 86,604 84,903
Tokens 1,867,392 1,927,283 2,062,849 1,885,683 1,788,227 2,000,684

Citing

If you want to cite us, you can do it either:

  • http://ufal.mff.cuni.cz/umc
  • Klyueva Natalia and Ondřej Bojar. 2008. UMC 0.1: Czech-Russian-English Multilingual Corpus.Proceedings of the Conference "Corpora 2008"
    @article{umc:2008
    publicationtype = {article},
    author = {Natalia Klyueva and },
    }
    
  • Acknowledgment

    The work on UMC 0.1 was supported by the following grants: FP6-IST-5-034291-STP (EuroMatrix).


    Institute of Formal and Applied Linguistics (ÚFAL)
    Ondřej Bojar, bojar <at> ufal.mff.cuni.cz
    Natalia Kljueva, kljueva <at> ufal.mff.cuni.cz
    $Id: index.html 72 2008-10-02 09:33:49Z bojar $