After six years of evolving Multimodal Translation Tasks at the Workshop on Asian Translation (WAT), this year introduces a new merged task: the “English-to-Indic Multimodal Translation Task.” The task is based on our “{Hindi, Bengali, Malayalam, Odia} Visual Genome” datasets, which offer both text and image data to support English-to-{Hindi, Bengali, Malayalam, Odia} machine translation and multimodal research.
TBA: Translations need to be submitted to the organizers
September 29, 2025: System description paper submission deadline
November 03, 2025: Review feedback for system description
November 11, 2025: Camera-ready
December 23 or 24, 2025: Workshop takes place
The setup of the task is as follows:
Inputs:
An image,
A rectangular region in that image,
A short English caption of the rectangular region.
Expected Output:
The caption of the rectangular region in {Hindi, Bengali, Malayalam, Odia}.
Participants are welcome to submit outputs for any subsets of the task languages (Hindi, Bengali, Malayalam, Odia) for any subset of the following task modalities:
Text-only translation (Source image not used)
Image captioning (English source text not used)
Multi-modal translation (uses both the image and the text)
Participants must indicate to which track their translations belong:
Text-only / Image-only / Multi-modal
see above
Domain-Aware / Domain-Unaware
Whether or not the full (English) Visual Genome was used in training.
Bilingual / Multilingual
Whether or not your model is multilingual, translating from English into all of the desired languages, or whether you used individual pairwise models.
Constrained / Non-Constrained
The limitations for the constrained systems track are as follows:
Allowed pretrained LLMs:
Llama-2-7B, Llama-2-13B, Mistral-7B
You may adapt/fine-tune / whatever these models using the provided constrained data only.
Allowed multimodal LLMs
CLIP, DALL-E, Gemini, LLaVA
You may adapt/fine-tune / whatever these models using the provided constrained data only. Mention if you use any other Multimodal LLMs.
Allowed pretrained LMs:
mBART, BERT, RoBERTa, XLM-RoBERTa, sBERT, LaBSE (all in all publicly available model sizes)
You may ONLY use the training data allowed for this year (linked below).
You may use any publicly available automatic quality metric during your development
Any basic linguistics tools (taggers, parsers, morphology analyzers, etc.)
Non-constrained submissions may use other data or pretrained models, but need to specify what was used.
The {Hindi, Bengali, Malayalam, Odia} Visual Genome consists of:
29k training examples
1k dev set
1.6k evaluation set
All the datasets use the same underlying set of images with a handful of differences due to sanity checks that were carried out in each of the languages independently.
WAT2025 Multi-Modal Task will be evaluated on:
1.6k evaluation set of {Hindi, Bengali, Malayalam, Odia} Visual Genome
1.4k challenge set of {Hindi, Bengali, Malayalam, Odia} Visual Genome
Means of evaluation:
Automatic metrics: BLEU, CHRF3, and others
Manual evaluation, subject to the availability of {Hindi, Bengali, Malayalam, Odia} speakers
Please register your team by sending the application following the WAT application here
Odia Visual Genome (TBA)
Findings of WMT2024 English-to-Low Resource Multimodal Translation
Hindi Visual Genome: A Dataset for Multi-Modal English to Hindi Machine Translation
Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning
Hausa Visual Genome: A Dataset for Multi-Modal English-to-Hausa Machine Translation
IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task
Shantipriya Parida (AMD Silo AI, Finland)
Ondřej Bojar (Charles University, Czech Republic)