UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation.
All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.
UMC003 is a cleaned tokenized development and test set to accompany the training data in UMC 0.1. For more information about the test set, see the README file in the UMC003 package.
UMC 0.1 and UMC003 are available for research, educational and non-profit use free of charge. Contact us if you are interested in obtaining a different type of license.
UMC 0.1 is released as plain text files (Unicode in UTF-8, Unix line breaks). Each language pair is stored in separate files.
For each language pair, we provide the full data (e.g. Czech-Russian.txt
) and a separate file containing only sentences that were aligned 1-1 (e.g. Czech-Russian.1-1.txt
).
Each line in the files corresponds to one alignment pair. We use tab
to delimit columns.
The full data files have the following columns: number of sentences in the alignment pair, quality of the alignment pair (the higher, the better), source language sentence, target language sentence. Here is an example of two alignment pairs. In the first one, the sentences corresponded to each other, in the second one, there were two Czech sentences corresponding to one Russian sentence:
1-1 0.55641 Mušarafův poslední výstup ? Последний ход Мушаррафа ? 2-1 0.765517 My je ale vyslovit musíme . Je to naše povinnost . Но мы не только должны осмелиться , мы обязаны это сделать .
The 1-1 files have just two columns: source language sentence and target language sentence:
Mušarafův poslední výstup ? Последний ход Мушаррафа ?
In the following table we summarize the monolingual statistics of the corpus:
Czech | Russian | English | |
---|---|---|---|
Words | 1,747,997 | 1,815,550 | 1,920,164 |
Tokens | 2,022,990 | 2,152,326 | 2,255,901 |
Sentences | 96,335 | 101,528 | 97,250 |
The texts were therefore aligned at sentence level with the help of hunalign You can see the distribution of the alignment types in the table below:
1-1 | 2-1 | 0-1 | 1-2 | 1-0 | Others |
---|---|---|---|---|---|
259599 | 9074 | 8551 | 7686 | 834 | 2434 |
90.1 % | 3.1 % | 3.0 % | 2.7 % | 0.3 % | 0.8 % |
Czech-Russian | English-Russian | Czech-English | |||||
---|---|---|---|---|---|---|---|
Many-to-many alignments | Alignment pairs | 97,373 | 97,656 | 93,149 | |||
Tokens | 2,022,950 | 2,151,946 | 2,255,889 | 2,151,948 | 2,022,950 | 2,255,889 | |
1-to-1 alignments | Alignment pairs | 88,092 | 86,604 | 84,903 | |||
Tokens | 1,867,392 | 1,927,283 | 2,062,849 | 1,885,683 | 1,788,227 | 2,000,684 |
If you want to cite UMC 0.1, please use the following reference and optionally the URL:
@inProceedings{umc:2008, publicationtype = {inProceedings}, author = {Natalia Klyueva and Ond{\v{r}}ej Bojar}, title = "{UMC 0.1: Czech-Russian-English Multilingual Corpus}", booktitle = {Proceedings of International Conference Corpus Linguistics}, pages = {188--195}, year = {2008}, }
The work on UMC 0.1 was supported by the grant FP6-IST-5-034291-STP (EuroMatrix).
The work on UMC003 was supported by the grant FP7-ICT-2007-3-231720 (EuroMatrix Plus).