Multimodality in natural language processing

Guidelines

Integrating multimodality in natural language processing (NLP) is referred to as multimodal natural language processing. Research in this novel direction primarily aims at processing textual content using visual information (e.g., images and possibly video) to support various task (e.g., machine translation). Its motivation stems mainly from two linguistic challenges: lexical ambiguity and out of vocabulary words. Current studies show that visual information is indeed useful for translation resulting in modest but encouraging improvements in translation quality (Elliott et al. 2017, Calixto et al. 2017, Caglayan et al. 2018). Very recent work also evidences that using visual information helps in interpreting language when language is implicit (Collell et al. 2018). The aim of this work is to investigate the effect and of multimodal data processing in various NLP tasks.

References

Goodfellow, I., Y. Bengio, and A. Courville 2016. Deep learning. Cambridge, MA, USA: MIT press.

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California, June 2016. Association for Computational Linguistics.

Elliott, D., et al. (2016). Multi30K: Multilingual English-German image descriptions. Proc. of the 5th Workshop on Vision and Language (pp. 70-74).

Caglayan, O., et al. (2018). LIUM-CVC submissions for WMT18 multimodal translation task. Proc. WMT. Calixto, I., et al. (2017). Doubly-attentive decoder for multi-modal neural machine translation. Proc. ACL.

Libovický, J. & Helcl, J. (2017). Attention strategies for multi-source sequence-to-sequence learning. Proc. ACL.

Collell, G., et al. (2018). Acquiring common sense spatial knowledge through implicit spatial templates. Proc. AAAI.