Transfer Learning in NMT for Related and Less Related Languages

Guidelines

The state of the art in machine translation (MT) is based on deep learning. Neural MT (NMT) has surpassed the previous approach of phrase-based MT by far in most cases but it is known to perform poorly in small data conditions. At the same time, NMT offers an interesting opportunity of the so-called transfer learning where better performance in a primary task, i.e. translation quality, is achieved with the help of additional training data of a secondary task.

The goal of the thesis is to explore the area of transfer learning for NMT. While it seems natural that related language pairs should lend themselves better to transfer learning, experiments indicate that the training data size can be a more important factor. Explaining which of the commonalities of the data for the primary and secondary tasks are the critical ones for the performance gains would be a very interesting contribution. The explanation can be sought for in directly observable properties (e.g. typical sentence lengths, common patterns of sequences of surface tokens) as well as in deeper representations as learnt automatically by the network or as predicted by various linguistic theories (e.g. tectogrammatical lemmas corresponding to content-bearing words of a single lexical meaning regardless their surface form).

Depending on data availability, the majority of experiments should be carried out with morphologically-rich Indian languages (e.g. Hindi, Urdu, Telugu, Odia or Tamil) as well as with European languages.

References

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6000–6010. Curran Associates, Inc., 2017.

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California, June 2016. Association for Computational Linguistics.

Kocmi Tom, Bojar Ondřej: Trivial Transfer Learning for Low-Resource Neural Machine Translation. In: Proceedings of the Third Conference on Machine Translation, Volume 1: Research Papers, Copyright © Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-81-0, pp. 244-252, 2018

Parida Shantipriya, Bojar Ondřej: OdiEnCorp: Odia-English and Odia-Only Corpus for Machine Translation. Accepted for publication in: 2019

Goodfellow, I., Y. Bengio, and A. Courville 2016. Deep learning. Cambridge, MA, USA: MIT press.

Helcl Jindřich, Libovický Jindřich, Kocmi Tom, Musil Tomáš, Cífka Ondřej, Variš Dušan, Bojar Ondřej: Neural Monkey: The Current State and Beyond. In: The 13th Conference of The Association for Machine Translation in the Americas, Vol. 1: MT Researchers’ Track, Copyright © The Association for Machine Translation in the Americas, Stroudsburg, PA, USA, pp. 168-176, 2018

Petr Sgall, Eva Hajičová a Jarmila Panevová. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Praha/Dordrecht, 1986.

Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Ondřej Bojar, Silvie Cinková, Eva Fučíková, Marie Mikulová, Petr Pajas, Jan Popelka, Jiří Semecký, Jana Šindlerová, Jan Štěpánek, Josef Toman, Zdeňka Urešová a Zdeněk Žabokrtský. Announcing Prague Czech-English Dependency Treebank 2.0. In Proceedings of the Eighth International Language Resources and Evaluation Conference (LREC’12), str. 3153–3160, Istanbul, Turkey, květen 2012. ELRA, European Language Resources Association.