WAT2024 English-to-Lowres Multi-Modal Translation Task

After five years of various versions of Multimodal Translation Task at the Workshop on Asian Translation 2024 (WAT2024), WAT 2024 continues with a merger: “WAT 2024 English-to-Lowres Multimodal Translation Task”. The task relies on our “{Hindi, Bengali, Malayalam, Hausa} Visual Genome” datasets, all of which provide text and images suitable for English-{Hindi, Bengali, Malayalam, Hausa} machine translation tasks and multimodal research.

Hindi, Bengali, and Malayalam are Indic medium-to-low-resource languages, Hausa is a low-resource African language.


  • XXX: Translations need to be submitted to the organizers
  • XXX: System description paper submission deadline
  • XXX: Review feedback for system description
  • XXX: Camera-ready
  • XXX: WAT2024 takes place

Task Description

The setup of the WAT2024 task is as follows:

  • Inputs:
    • An image,
    • A rectangular region in that image,
    • A short English caption of the rectangular region.
  • Expected Output:
    • The caption of the rectangual region in {Hindi, Bengali, Malayalam, Hausa}.

Types of Submissions Expected

Participants are welcome to submit outputs for any subsets of the task languages (Hindi, Bengali, Malayalam, Hausa) for any subset of the following task modalities:

  • Text-only translation (Source image not used)
  • Image captioning (English source text not used)
  • Multi-modal translation (uses both the image and the text)

Participants must indicate to which track their translations belong:

  • Text-only / Image-only / Multi-modal
    • see above
  • Domain-Aware / Domain-Unaware
    • Whether or not the full (English) Visual Genome was used in training.
  • Bilingual / Multilingual
    • Whether or not your model is multilingual, translating from English into all of the desired languages, or whether you used individual pairwise models.
  • Constrained / Non-Constrained
    • ​​The limitations for the constrained systems track are as follows:
      • ​Allowed pretrained LLMs:
      • Allowed multimodal LLMs
        • TO BE DETERMINED      
      • Allowed pretrained LMs:
        • mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE (all in all publicly available model sizes)
      • You may ONLY use the training data allowed for this year (linked below).
      • You may use any publicly available automatic quality metric during your development
      • Any basic linguistics tools (taggers, parsers, morphology analyzers, etc.)
    • Non-constrained submissions may use other data or pretrained models but need to specify what was used.

Training Data

The {Hindi, Bengali, Malayalam, Hausa} Visual Genome consists of:

  • 29k training examples
  • 1k dev set
  • 1.6k evaluation set

All the datasets use the same underlying set of images with a handful of differences due to sanity checks which were carried out in each of the languages independently.


WAT2024 Multi-Modal Task will be evaluated on:

  • 1.6k evaluation set of {Hindi, Bengali, Malayalam, Hausa} Visual Genome
  • 1.4k challenge set of {Hindi, Bengali, Malayalam, Hausa} Visual Genome

Means of evaluation:

  • Automatic metrics: BLEU, CHRF3, and others
  • Manual evaluation, subject to the availability of {Hindi, Bengali, Malayalam, Hausa} speakers

Download Links

Submission Requirement

The system description should be a short report (4 to 6 pages) submitted to WAT 2024 describing the method(s).

Each participating team can submit at most two systems for each modality (i.e., text-only translation, image captioning, multimodal translation) and each target language.

Please submit through the submission link available on the WAT2024 website and select the task for submission.   

Paper and References

Please refer to the below papers:

[paper] : https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/3294

[arxiv] : https://arxiv.org/abs/1907.08948

[WAT 2023 Proceedings] : https://www.aclweb.org/anthology/2023.wat-1.0/

[WAT 2022 Proceedings] : https://www.aclweb.org/anthology/2022.wat-1.0/

[WAT 2021 Proceedings] : https://www.aclweb.org/anthology/2021.wat-1.0/

[WAT 2020 Proceedings] : https://www.aclweb.org/anthology/2020.wat-1.0/

[WAT 2019 Proceedings] : https://aclanthology.org/events/emnlp-2019/#d19-52


[Reference Papers]

Silo NLP´s Participation at WAT2022

Improved English to Hindi Multimodal Neural Machine Translation

IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task

ViTA: Visual-Linguistic Translation by Aligning Object Tags

NLPHut’s Participation at WAT2021

ODIANLP’s Participation in WAT2020

Multimodal Neural Machine Translation for English to Hindi

Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning

Hausa Visual Genome: A Dataset for Multi-Modal English-to-Hausa Machine Translation



  • Shantipriya Parida (Silo AI, Finland)
  • Ondřej Bojar (Charles University, Czech Republic)
  • Idris Abdulmumin (University of Pretoria, South Africa)
  • Shamsuddeen Hassan Muhammad (Bayero University Kano, Nigeria)


email: wat-multimodal-task@ufal.mff.cuni.cz


The data is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.


The datasets in this shared task were supported by the grant 19-26934X (Neural Representations in Multi-modal and Multi-lingual Modelling) of the Grant Agency of the Czech Republic.