NPFL116 – Compendium of Neural Machine Translation

This seminar should make the students familiar with the current research trends in machine translation using deep neural networks. The students should most importantly learn how to deal with the ever-growing body of literature on empirical research in machine translation and critically asses its content. The semester consists of few lectures summarizing the state of the art, discussions on reading assignments and student presentation of selected papers.


SIS code: NPFL116
Semester: summer
E-credits: 3
Examination: 0/2 C
Instructors: Jindřich Libovický, Jindřich Helcl

Timespace Coordinates

The course is not taught in this semester. Looking forward to see you in 2020.


1. Introductory notes on machine translation and deep learning Logistics NN Intro Reading Questions

2. Neural architectures for NLP NN Intro Reading Questions

3. Attentive sequence-to-sequence learning using RNNs Slides Reading Questions

4. Sequence-to-sequence learning with self-attention, a.k.a Transformer Slides Reading: BPE Reading: Backtranslation Questions

5. Tricks for improving NMT performance Slides Reading Questions

6. Unsupervised Neural Machine Translation

7. Generative Adversarial Networks

8. Non-autoregressive Neural Machine Translation

9. Convolutional Sequence-to-sequence Learning

1. Introductory notes on machine translation and deep learning

 Feb 20 Logistics NN Intro


Reading  1.5 hour LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature 521.7553 (2015): 436.


  • Can you identify some implicit assumptions the authors make about sentence meaning while talking about NMT?
  • Do you think they are correct?
  • How do the properties that the authors attribute to LSTM networks correspond to your own ideas how should language be computationally processed?

2. Neural architectures for NLP

 Feb 28 NN Intro

Covered topics: embeddings, RNNs, vanishing gradient, LSTM, 1D convolution

Reading  2 hours

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).


  • The authors report 5 BLEU points worse score than the previous encoder-decoder architecture (Sutskever et al., 2014). Why is their model better then?
  • If someone asked you to create automatically a dictionary, would you use the attention mechanism for it? Why yes? Why not?

3. Attentive sequence-to-sequence learning using RNNs

 Mar 6 Slides

Covered topics: recurrent langauge model, RNN decoder, feedforward attention mechanism

Reading  2 hours

Vaswani, Ashish, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017.


  • The model uses the scaled dot-product attention which is a non-parametric variant of the attention mechanism. Why do you think it is sufficient in this setup?
  • Do you think it would work in the recurrent model as well?
  • The way the model processes the sequence is principally different from RNNs or CNNs. Does it agree with your intuition of how language should be processed?

4. Sequence-to-sequence learning with self-attention, a.k.a Transformer

 Mar 27 Slides

Covered topics:

Reading: BPE Reading: Backtranslation  2 hours in total

  • Sennrich, Rico, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ACL 2016.
  • Sennrich, Rico, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data ACL 2016.


  • Describe the main issues with open/large vocabulary.
  • Can you think of another method of how to address the open/large vocabulary problem?
  • Why do you think the artificial training data are used on the source side only?

5. Tricks for improving NMT performance

 Apr 3 Slides

Covered topics:

Reading  1 hour

Yun Chen, Kyunghyun Cho, Samuel R. Bowman, Victor O.K. Li: Stable and Effective Trainable Greedy Decoding for Sequence to Sequence Learning. Accepted to ICLR 2018.


  • The paper introduces a surprisingly simple way of fine-tuning already trained MT model. How would you interpret the achieved results?
  • What do you think that the existence of such a simple trick says about the way the MT systems are trained?

Student Presentations

6. Unsupervised Neural Machine Translation

 Apr 10

Unsupervised machine translation is an active research topic where the goal is creating a machine translation system without the necessity of having huge corpora of parallel data to train the models.

So far, there were two papers on this topic:

7. Generative Adversarial Networks

 Apr 17

Two years ago, there have been many papers attempting to use reinforcement learning for machine translation and optimize the model directly towards sentence-level BLUE score instead of cross-entropy which appears to be clearly sub-optimal. This methods have not been much successful, mainly because the inherent limitation of BLEU score.

Generative Adversarial Networks with the generator-discriminator setup are a follow-up of this research. A trained discriminator plays a role optimization metric, its goal is to discriminate between a generated and human translation, the generator on the other hand tries to fool the discriminator and generate as close translation to human reference as possible.

The following papers will be presented:

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative Adversarial Nets. In Advances in neural information processing systems (pp. 2672-2680).

  • Wu, L., Xia, Y., Zhao, L., Tian, F., Qin, T., Lai, J., & Liu, T. Y. (2017). Adversarial neural machine translation. arXiv preprint arXiv:1704.06933.

8. Non-autoregressive Neural Machine Translation

 Apr 24

Non-autoregressive regressive models can generate the whole output sequence in parallel and do not need to wait before the previous word is generated to update the hidden state.

9. Convolutional Sequence-to-sequence Learning

 May 15

Facebook recently came with a sequence-to-sequence architecture that is base entirely on convolutional networks. This allows parallel processing of the input sentence. The autoregressive nature of the decoder does not allow parallel decoding in the inference time, however it is still possible at the training time when the target sentence is known.

The architecture was introduced in series of two papers:

Reading assignments

There will be a reading assignment after every class. You will be given few question about the reading that you should submit before the next lecture.

Student presentations

Students will form teams and present one of the selected groups of papers to the fellow students. The students will not only prepare a presentation of the paper but also questions for discussion after the paper presentation.

Others should also get familiar with the paper, so they can participate in the discussion.

It is strongly encouraged to arrange a consultation with the course instructors at least one day before the presentation.

Final written test

There will be a final written test that will not be graded.