UrMonoCorp is a monolingual Urdu corpus freely available for non-commercial research purposes. The version 1.0, released in 2014, contains about 95.4 million tokens distributed in 5.4 million sentences. The collected corpus is a (unlabeled) mix of the following major domains: News, Religion, Blogs, Literature, Science, Education and numerous others.

We release both plain and automatically tagged with part-of-speech tags monolingual corpora. The automatic part-of-speech tagging is the extension of Jawaid and Bojar (2012) work who uses three different taggers and apply a voting scheme to disambiguate among the different choices suggested by each tagger. Their complex voting ensemble is later applied on a large monolingual corpus for tagging the plain text mono corpus.

UrMonoCorp can be downloaded from LINDAT repository and the full description of the data can be found in the LREC paper.

How to cite

If you use UrMonoCorp in your work, please cite the following paper:

  author = {Bushra Jawaid and Amir Kamran and Ondrej Bojar},
  title = {A Tagged Corpus and a Tagger for Urdu},
  booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
  year = {2014},
  month = {may},
  date = {26-31},
  address = {Reykjavik, Iceland},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-8-4},
  language = {english}