UMC005: English-Urdu Parallel Corpus

by Bushra Jawaid and Daniel Zeman
2010

Introduction

UMC005 English-Urdu is a parallel corpus of texts in English and Urdu language with sentence alignments. The corpus can be used for experiments with statistical machine translation.

The texts come from four different sources:

We provide the religious texts of Quran and Bible for direct download. Because of licensing reasons, Penn and Emille texts cannot be redistributed freely. However, if you already hold a license for the original corpora, we are able to provide scripts that will recreate our data on your disk. Our modifications include but are not limited to the following:

Licensing

UMC005 is available for research, educational and non-profit use free of charge. Contact us if you are interested in obtaining a different type of license. Note: These terms apply to our modifications/additions to the data. You need to obtain separate license for the English texts of the Penn Treebank and for the Emille corpus. The Urdu translation of the Penn Treebank texts has been provided by CRULP and is distributed under the GNU GPL license. Quran and Bible are religious texts whose copyright had expired long ago; our version has been collected from the web.

Download

UMC005 File Formats

UMC005 is released as plain text files (Unicode in UTF-8, Unix line breaks).

One file corresponds to one part (training/development/test) of source (Quran/Bible) in one language (English/Urdu). The English and Urdu versions of the same source and part have the same number of lines whereas a line corresponds to one segment of the text, usually a sentence, and two same-numbered lines are translations of each other.

UMC005 Statistics

CorpusSentence pairsEN tokensUR tokensEN vocabularyUR vocabularyUR normalized vocabulary
Quran6414252603269991813580277183
Bible7957210597203927596989956980
Penn6215161294185690138261288312457
Emille87361535192001799087100429626

Citing

If you want to cite UMC005, please use the following reference and the URL:

  • Bushra Jawaid, Daniel Zeman: Word-Order Issues in English-to-Urdu Statistical Machine Translation. Submitted for publication in: The Prague Bulletin of Mathematical Linguistics, No. 95, Copyright © Univerzita Karlova, Praha, Czechia, ISSN 0032-6585, May 2011
    @unpublished{JaZeWordOrderIssues2011,
    author      = {Bushra Jawaid and Daniel Zeman},
    title       = {Word-Order Issues in {English}-to-{Urdu} Statistical Machine Translation},
    year        = {2011},
    journal     = {The Prague Bulletin of Mathematical Linguistics},
    number      = {95},
    institution = {Univerzita Karlova},
    address     = {Praha, Czechia},
    issn        = {0032-6585},
    }
    
  • http://ufal.mff.cuni.cz/umc
  • Acknowledgement

    The work on UMC005 was supported by the grants No. MSM0021620838 of the Czech Ministry of Education, GAP406/11/1499 of the Czech Science Foundation and SVV Project 261314/2010 (University-funded).


    Charles University in Prague, Institute of Formal and Applied Linguistics (ÚFAL)
    Daniel Zeman, zeman <at> ufal.mff.cuni.cz
    Bushra Jawaid, jawaid <at> ufal.mff.cuni.cz