OdiEnCorp

This project initiated to enrich Odia language NLP resources, particularly for machine translation. OdiEnCorp is a collection of Odia-English parallel and Odia monolingual sentences collected from different sources such as Odia Wikipedia, web sites, books, and dictionaries using different manual and machine learning techniques including web scraping and optical character recognition. We described the need, development process, and benefit of such corpus [here].  

Two releases of English-Odia corpus were created:

The latter (OdiEnCorp 2.0) serves in WAT 2020 EnglishOdia Indic Task. For using additional resouce, please refer to the Odia NLP Resource Catalog for English-Odia parallel and Odia Monolingual data and mention in your system description paper. Ask the organizer for using any other corpora other than those listed in the Odia NLP Resource Catalog.  

Please refer to the WAT2020 webpage for registration/timeline/submission details.

Paper and References:

[OdiEnCorp 1.0] : OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

[OdiEnCorp 2.0] : OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation

Organizers

  • Ondřej Bojar (Charles University, Czech Republic)
  • Shantipriya Parida (Idiap Research Institute, Switzerland)

Contact

email: wat-multimodal-task@ufal.mff.cuni.cz

License

The data is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Acknowledgement

This shared task is supported by the below projects/grants from Idiap Research Institute (Switzerland) and Charles University (Czech Republic).

  • European Union, Project code: EC/H2020/833635, Project name: ROXANNE - Real time network, text, and speaker analytics for combating organized crime
  • InnoSuisse, Project code: 29814.1 IP-ICT, Project name: SM2: Extracting Semantic Meaning from Spoken Material” funding application no. 29814.1 IP-ICT
  • Grantová agentura České republiky, Project code: 18-24210S, Project name: Multilingual Machine Translation

How to cite

If you use OdiEnCorp 1.0 or OdiEnCorp 2.0, please cite the respective paper:

@incollection{parida2020odiencorp,
  title={OdiEnCorp: Odia--English and Odia-Only Corpus for Machine Translation},
  author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan},
  booktitle={Smart Intelligent Computing and Applications},
  pages={495--504},
  year={2020},
  publisher={Springer}
}

@inproceedings{parida2020odiencorp,
  title={OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation},
  author={Parida, Shantipriya and Dash, Satya Ranjan and Bojar, Ond{\v{r}}ej and Motlicek, Petr and Pattnaik, Priyanka and Mallick, Debasish Kumar},
  booktitle={Proceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation},
  pages={14--19},
  year={2020}
}