UMC 0.1 and UMC003: Czech-English-Russian

UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.

All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.

UMC003 is a cleaned tokenized development and test set to accompany the training data in UMC 0.1. For more information about the test set, see the README file in the UMC003 package.

Licensing

UMC 0.1 and UMC003 are available for research, educational and non-profit use free of charge. Contact us if you are interested in obtaining a different type of license.

Download

UMC 0.1 File Formats

UMC 0.1 is released as plain text files (Unicode in UTF-8, Unix line breaks). Each language pair is stored in separate files.

For each language pair, we provide the full data (e.g. Czech-Russian.txt) and a separate file containing only sentences that were aligned 1-1 (e.g. Czech-Russian.1-1.txt).

Each line in the files corresponds to one alignment pair. We use tab to delimit columns.

The full data files have the following columns: number of sentences in the alignment pair, quality of the alignment pair (the higher, the better), source language sentence, target language sentence. Here is an example of two alignment pairs. In the first one, the sentences corresponded to each other, in the second one, there were two Czech sentences corresponding to one Russian sentence:

1-1	0.55641	Mušarafův poslední výstup ?	Последний ход Мушаррафа ?
2-1	0.765517	My je ale vyslovit musíme . Je to naše povinnost .	Но мы не только должны осмелиться , мы обязаны это сделать .

The 1-1 files have just two columns: source language sentence and target language sentence:

Mušarafův poslední výstup ?	Последний ход Мушаррафа ?

UMC 0.1 Statistics

Monolingual Statistics

In the following table we summarize the monolingual statistics of the corpus:

	Czech	Russian	English
Words	1,747,997	1,815,550	1,920,164
Tokens	2,022,990	2,152,326	2,255,901
Sentences	96,335	101,528	97,250

Sentence Alignment

The texts were therefore aligned at sentence level with the help of hunalign You can see the distribution of the alignment types in the table below:

1-1	2-1	0-1	1-2	1-0	Others
259599	9074	8551	7686	834	2434
90.1 %	3.1 %	3.0 %	2.7 %	0.3 %	0.8 %

Bilingual Statistics

		Czech-Russian		English-Russian		Czech-English
Many-to-many alignments	Alignment pairs	97,373		97,656		93,149
Many-to-many alignments	Tokens	2,022,950	2,151,946	2,255,889	2,151,948	2,022,950	2,255,889
1-to-1 alignments	Alignment pairs	88,092		86,604		84,903
1-to-1 alignments	Tokens	1,867,392	1,927,283	2,062,849	1,885,683	1,788,227	2,000,684

Citing

If you want to cite UMC 0.1, please use the following reference and optionally the URL:

Natalia Klyueva and Ondřej Bojar. UMC 0.1: Czech-Russian-English Multilingual Corpus. Proceedings of International Conference Corpus Linguistics, pages 188-195, October 2008. PDF

@inProceedings{umc:2008,
  publicationtype = {inProceedings},
  author = {Natalia Klyueva and Ond{\v{r}}ej Bojar},
  title = "{UMC 0.1: Czech-Russian-English Multilingual Corpus}",
  booktitle = {Proceedings of International Conference Corpus Linguistics},
  pages = {188--195},
  year = {2008},
}

http://ufal.mff.cuni.cz/umc

Acknowledgment

The work on UMC 0.1 was supported by the grant FP6-IST-5-034291-STP (EuroMatrix).

The work on UMC003 was supported by the grant FP7-ICT-2007-3-231720 (EuroMatrix Plus).