Neural machine translation for low-resource languages

Principal investigator (ÚFAL):

Ivana Kvapilíková

Project Manager (ÚFAL):

Hana Kubištová

Provider:

GAUK

Grant id:

1050119

Duration:

2019-2021

Tags:

Machine Translation

Monolingual

People:

Ondřej Bojar

The aim of the project is to investigate methods of training machine translation models with limited access to manually translated texts. Most of the current approaches to deep learning rely on the existence of labeled data (for machine translation, this means translated texts). The model architecture and training methods for unsupervised learning of deep neural networks are the subject of research for major players in the field, e.g. Facebook or Google. In the case of translation, the machine should be able to learn only based on monolingual texts or based on available parallel data for other language pairs.

The project will build upon the newest approaches and investigate their application on several low-resource language pairs (e.g. English-Albanian). In the first year, the work will be focused on the comparison of existing approaches to unsupervised machine translation. These include a multilingual model trained on other language pairs and a model of iterative training using monolingual texts.

Based on the results of the first experiments, the project will continue in one of the evaluated directions. The goal will be to find room for improvement of translation quality compared to the baseline. The area for investigation lies in the design of different model architectures and in the selection of training languages. There is a trade-off between learning from a limited amount of data in a related language and learning from large-scale data in a remote language.

Publications

Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar (2020): Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 255-262, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-952148-03-3

Ivana Kvapilíková, Tom Kocmi, Ondřej Bojar (2020): CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20. In: Fifth Conference on Machine Translation - Proceedings of the Conference, pp. 1123-1128, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-81-0

Ivana Kvapilíková, Dominik Macháček, Ondřej Bojar (2019): CUNI Systems for the Unsupervised News Translation Task in WMT 2019. In: Fourth Conference on Machine Translation - Proceedings of the Conference, pp. 241-248, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-950737-27-7

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

Publications