WAT2023 English-Hindi Multi-Modal Translation Task

After three successive events of “WAT 2019, WAT2020, WAT2021, and WAT2022 English-Hindi Multimodal Translation Task”, the Workshop on Asian Translation 2023 (WAT2023) will continue the task of multimodal English-to-Hindi translation which is the first multimodal translation task for any Indian language. The task relies on our “Hindi Visual Genome,” a multimodal dataset of text and images suitable for English-Hindi machine translation tasks and multimodal research.


  • July 07: Translations need to be submitted to the organizers
  • July 14, System description paper submission deadline
  • July 28: Review feedback for system description
  • Aug 4: Camera-ready
  • Sep 4: WAT2023 takes place

Task Description

The setup of the WAT2023 task is as follows:

  • Inputs:
    • An image,
    • A rectangular region in that image
    • A short English caption of the rectangular region.
  • Output:
    • The caption translated to Hindi.

Types of Submissions Expected

The setup of the WAT2023 task is as follows:

  • Text-only translation
  • Hindi-only image captioning
  • Multi-modal translation (uses both the image and the text)

Training Data

The Hindi Visual Genome consists of:

  • 29k training examples
  • 1k dev set
  • 1.6k evaluation set


WAT2023 Multi-Modal Task will be evaluated on:

  • 1.6k evaluation set of Hindi Visual Genome
  • 1.4k challenge set of Hindi Visual Genome

Means of evaluation:

  • Automatic metrics: BLEU, CHRF3, and others
  • Manual evaluation, subject to the availability of Hindi speakers

Participants of the task need to indicate which track their translations belong to:

  • Text-only / Image-only / Multi-modal
    • see above
  • Domain-Aware / Domain-Unaware
    • Whether or not the full (English) Visual Genome was used in training.
  • Constrained / Non-Constrained
    • 29k training segments from the Hindi Visual Genome
    • HindEnCorp 0.5
    • (English-only) Visual Genome [submitting a domain-aware run]
  • Non-constrained submissions may use other data but need to specify what data was used.

Download Link

Submission Requirement

The system description should be a short report (4 to 6 pages) submitted to WAT 2023 describing the method(s).

Each participating team can submit at most two systems for each task (e.g., Text-only, Hindi-only image captioning, multimodal translation using text and image). Please submit through the submission link available on the WAT2022 website and select the task for submission.   

Paper and References

Please refer to the below papers:

[paper] : https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/3294

[arxiv] : https://arxiv.org/abs/1907.08948

[WAT 2022 Proceedings] : https://www.aclweb.org/anthology/2022.wat-1.0/

[WAT 2021 Proceedings] : https://www.aclweb.org/anthology/2021.wat-1.0/

[WAT 2020 Proceedings] : https://www.aclweb.org/anthology/2020.wat-1.0/

[WAT 2019 Proceedings] : https://www.aclweb.org/anthology/D19-5200/


[Reference Papers]

Silo NLP´s Participation at WAT2022

Improved English to Hindi Multimodal Neural Machine Translation

IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task

ViTA: Visual-Linguistic Translation by Aligning Object Tags

NLPHut’s Participation at WAT2021

ODIANLP’s Participation in WAT2020

Multimodal Neural Machine Translation for English to Hindi

Idiap NMT System for WAT 2019 Multimodal Translation Task

English to Hindi Multi-modal Neural Machine Translation and Hindi Image Captioning

WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset



  • Shantipriya Parida (Silo AI, Finland)
  • Ondřej Bojar (Charles University, Czech Republic)


email: wat-multimodal-task@ufal.mff.cuni.cz


The data is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.


This shared task is supported by the below projects/grants from Charles University (Czech Republic).

  • Grantová agentura České republiky, Project code: 19-26934X, Project name: Neural Representations in Multi-modal and Multi-lingual Modelling