English-to-Indic Multi-Modal Translation Task

After six years of evolving Multimodal Translation Tasks at the Workshop on Asian Translation (WAT), this year introduces a new merged task: the “English-to-Indic Multimodal Translation Task.” The task is based on our “{Hindi, Bengali, Malayalam, Odia} Visual Genome” datasets, which offer both text and image data to support English-to-{Hindi, Bengali, Malayalam, Odia} machine translation and multimodal research.

Timeline

TBA: Translations need to be submitted to the organizers
September 29, 2025: System description paper submission deadline
November 03, 2025: Review feedback for system description
November 11, 2025: Camera-ready
December 23 or 24, 2025: Workshop takes place

Task Description

The setup of the task is as follows:

Inputs:
- An image,
- A rectangular region in that image,
- A short English caption of the rectangular region.
Expected Output:
- The caption of the rectangular region in {Hindi, Bengali, Malayalam, Odia}.
Types of Submissions Expected
Participants are welcome to submit outputs for any subsets of the task languages (Hindi, Bengali, Malayalam, Odia) for any subset of the following task modalities:
- Text-only translation (Source image not used)
- Image captioning (English source text not used)
- Multi-modal translation (uses both the image and the text)
Participants must indicate to which track their translations belong:
- Text-only / Image-only / Multi-modal
  
  see above
- Domain-Aware / Domain-Unaware
  
  Whether or not the full (English) Visual Genome was used in training.
- Bilingual / Multilingual
  
  Whether or not your model is multilingual, translating from English into all of the desired languages, or whether you used individual pairwise models.
- Constrained / Non-Constrained
  
  The limitations for the constrained systems track are as follows:
  
  Allowed pretrained LLMs:
  
  Llama-2-7B, Llama-2-13B, Mistral-7B
  
  You may adapt/fine-tune / whatever these models using the provided constrained data only.
  
  Allowed multimodal LLMs
  
  CLIP, DALL-E, Gemini, LLaVA
  
  You may adapt/fine-tune / whatever these models using the provided constrained data only. Mention if you use any other Multimodal LLMs.
  
  Allowed pretrained LMs:
  
  mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE (all in all publicly available model sizes)
  
  You may ONLY use the training data allowed for this year (linked below).
  
  You may use any publicly available automatic quality metric during your development
  
  Any basic linguistics tools (taggers, parsers, morphology analyzers, etc.)
- Non-constrained submissions may use other data or pretrained models, but need to specify what was used.
Training Data

The {Hindi, Bengali, Malayalam, Odia} Visual Genome consists of:

29k training examples

1k dev set

1.6k evaluation set

All the datasets use the same underlying set of images with a handful of differences due to sanity checks that were carried out in each of the languages independently.
Evaluation

WAT2025 Multi-Modal Task will be evaluated on:

1.6k evaluation set of {Hindi, Bengali, Malayalam, Odia} Visual Genome

1.4k challenge set of {Hindi, Bengali, Malayalam, Odia} Visual Genome

Means of evaluation:

Automatic metrics: BLEU, CHRF3, and others

Manual evaluation, subject to the availability of {Hindi, Bengali, Malayalam, Odia} speakers

Registration

Please register your team by sending the application following the WAT application here

Download Links

Hindi Visual Genome 1.1

Bengali Visual Genome 1.0

Malayalam Visual Genome 1.0

Odia Visual Genome (TBA)

Paper and References

Findings of WMT2024 English-to-Low Resource Multimodal Translation

Hindi Visual Genome: A Dataset for Multi-Modal English to Hindi Machine Translation

Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning

Hausa Visual Genome: A Dataset for Multi-Modal English-to-Hausa Machine Translation

WAT 2023 Proceedings

WAT 2022 Proceedings

WAT 2021 Proceedings

WAT 2020 Proceedings

WAT 2019 Proceedings

Silo NLP´s Participation at WAT2022

ViTA: Visual-Linguistic Translation by Aligning Object Tags

NLPHut’s Participation at WAT2021

IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task

Organizers

Shantipriya Parida (AMD Silo AI, Finland)

Ondřej Bojar (Charles University, Czech Republic)

Contact

email: wat-multimodal-task@ufal.mff.cuni.cz

WAT2025_English-to-Indic_Multimodal_Translation

Search form

English-to-Indic Multi-Modal Translation Task

Timeline

Task Description

Types of Submissions Expected

Training Data

Evaluation

Registration

Download Links

Paper and References

Organizers

Contact