NPFL087 — Statistical Machine Translation

The course covers the area of machine translation (MT) in its current breadth, delving deep enough in each approach to let you know how to confuse every existing MT system. We put a balanced emphasis on several imporant types of state-of-the-art systems: phrase-based MT, surface-syntactic MT and (a typically Praguian) deep-syntactic MT. We do not forget common pre-requisities and surrounding fields: extracting translation equivalents from parallel texts (including word alignment techniques), MT evaluation or methods of system combination.

We aim to provide a unifying view of machine translation as statistical search in a large search space, well supported with practical experience during your project work in a team or alone. Finally, we also attempt to give a gist of emerging approaches in MT, such as neural networks.


SIS code: NPFL087
Semester: summer
E-credits: 5
Examination: 2/2 C+Ex
Instructor: Ondřej Bojar

New in 2021: Inverted Classroom / Flipped Class

Important news for 2021: We're trialling inverted classroom style. In other words, you are expected:

  • to watch the Main Lecture Video for the week beforehand
  • to write down any questions, corrections, possible extensions
    • be excessive, your goal is to 'grill the teacher' for as long as possible
    • you may write these to the Shared Doc for the week
  • to attend the lecture call
    • if you do not grill your teacher, the teacher will grill you ;-)
    • collectively take notes into the Shared Doc
    • try hard to have your camera on (use virtual background if needed)


  • The lectures from last year are pre-recorded.
  • The debates will not be recorded.
  • If you are absent, recoved by:
    • browsing last week's document
    • asking for details at the next week

Timespace Coordinates

Additional Sources


Key requirements:

  • Work on a project (alone or in a group of two to three).
  • Present project results (~30-minute talk).
  • Write a report (~4-page scientific paper).

Contributions to the grade:

  • 10% homework and activity,
  • 30% written exam,
  • 50% project report,
  • 10% project presentation.

Final Grade: ≥50% good, ≥70% very good, ≥90% excellent.

Legend: Slides Main Content Illustrative Content Optional Reading Homework Assignment

The dates below indicate when we talk about it. Remember to watch the Full Lecture Video much earlier.

If you see a 2020 date, the entry has not yet been updated.

1. Metrics of MT Quality

 Mar 11, 2021 Lecture Slides Full Lecture Video Shared Doc

MT Talks: Evaluation in General MT Talks: Automatic Evaluation (PER and BLEU)

  • The task of machine translation.
  • Methods of manual evaluation.
  • Methods of automatic evaluation.
  • Empirical confidence bounds, bootstrapping.
  • End-to-end vs. component evaluation.

2. Overview of Approaches to MT: SMT, PBMT, NMT

 Mar 18, 2021 Lecture Slides Full Lecture Video Shared Doc

MT Talks: Overview of MT (except NMT) MT Talks: MT that Deceives (Errors in MT)

  • Approaches to MT.
  • What makes MT statistical
    • Probability of a sentence, Bayes' law.
    • Log-linear model.
  • Phrase-Based MT.
    • Features used.
    • Training Pipeline.
    • Unjustified independence assumptions.
  • Neural MT.
    • Deep learning summary.
    • Representing text.
    • Encoder-decoder architecture overview.

3. Introduction to Neural Machine Translation (NMT)

 Mar 25, 2021 Lecture Slides Full Lecture Video Shared Doc

  • Basic NN building blocks for NMT.
  • Representing text in NNs.
  • Neural LMs.
  • Vanilla Sequence-to-Sequence Model (Encoder-Decoder Framework).
  • Attention.

4. Alignment

 Apr 1, 2021

Lecture Slides Koehn's Slides with Formulas Full Lecture Video Shared Doc

MT Talks: Data Acquisition MT Talks: Sentence Alignment (Gale&Church) MT Talks: Word Alignment (IBM1) Optional: Church&Gale 1993 Optional: Collins' Notes on IBM1 and IBM2 Homework: IBM Model 1

  • Parallel Data Acquisition.
  • Document Alignment.
  • Sentence Alignment.
  • Word Alignment, IBM1 in Detail.
  • Linguistic Adequacy of Word Alignment

5. Phrase-Based Machine Translation

 Apr 8, 2021

Lecture Slides Haddow's Slides for Recombination, Pruning, Future Cost Koehn's Slides for Future Cost Full Lecture Video Shared Doc MT Talks: Phrase-Based MT

  • PBMT Overview.
    • Phrase Extraction.
    • Reminder: Log-linear model.
  • PBMT Model.
    • Features Used.
    • Traditional PBMT "Training Pipeline"
  • Translating with PBMT (Decoding)
    • Translation Options and Stack-Based Beam Search
    • Interlude: MT is NP-Hard (in general, not just PBMT).
    • Pruning, Future Cost Estimation.
    • Local and Non-Local Features.
  • Minimum Error-Rate Training.

6. Morphology in MT

 Apr 15, 2021

Lecture Slides Koehn's Slides for Factored Models Full Lecture Video Shared Doc MT Talks: Rich Morphology Makes Everything Harder

  • Problems caused by rich morphology.
    • Morphological richness of Czech.
    • Margin for improvement in BLEU.
  • Combinatorial explosion of Czech word forms.
  • Morphology in PBMT:
    • Factored PBMT.
    • Reverse self-training.
  • Morphology in NMT.
    • Subword Units, BPE.

7. Syntax in SMT

 Apr 9, 2020 Raw Full Lecture Video Lecture Slides

MT Talks: Constituency Trees in MT MT Talks: Dependency Trees in MT MT Talks: Deep Syntax in MT

  • Motivation for grammar in MT.
  • Hierarchical Model.
  • Proper syntax: Constituency vs. dependency trees.

Constituency Syntax:

  • Context Free Grammars.
  • MT as parsing.
    • Synchronous CFG, LM integration.
  • Why real source/target parse trees make it harder.

Dependency Syntax:

  • Surface syntax (STSG), problems.
  • Deep syntax (t-layer); TectoMT, HMTM.

8. Transformer; Syntax in NMT

 Apr 16, 2020 Raw Full Lecture Video Lecture Slides Transformer Illustrated Transformer in Pytorch Transformer at Medium Replacing Linguists with Dummies Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed

  • Transformer Architecture (Attention is All You Need)
  • Syntax in NMT:
    • Source Syntax in Network Structure
    • Source Syntax Attached to Tokens
    • Target Syntax through Interleave or Multi-Task.

9. Does MT Understand? Word and Sentence Representations

 April 23, 2020 Raw Full Lecture Video Lecture Slides

  • Introducing Semiotics.
  • Do Current MT Systems Understand?
  • Continuous Representations.
    • What are Good Representations?
    • Continuous Word Representations.
    • Continuous Sentence Representations.
  • Aspects of Meaning.
  • Evaluating Sentence Representations
    • How Meaningful is Seq2Seq Representation?

10. Multi-Lingual MT

 April 30, 2020 Raw Full Lecture Video Lecture Slides

  • Motivation for using more than 2 languages.
  • Transfer Learning.
    • Catastrophic Forgetting.
    • Trivial Transfer Learning.
  • Multi-Lingual NMT.
  • Massively Multi-Lingual NMT.

11. Multi-Modal Translation

 May 7, 2020 Raw Full Lecture Video Lecture Slides

  • Overview of Multi-Modal Translation.
  • Spoken Language Translation = ASR + MT.
    • Problems at ASR-MT boundary.
    • End-to-end SLT approaches.
  • Visual information for MT.
    • Motivation
    • Are pictures really helpful?

Project Presentations

 May 14+21, 2020

  • Be prepared to present your project.
    • Up to 30 minutes per talk.
    • Present work in progress, no need to have final results.

** The following will still be updated for 2019/2020. **

Written Exam (and remaining presentations)

 May 30, 2019

  • Approximately hour-long written exam.
  • Seven open questions
    • For a full answer, you usually need to write half a page or a page, including illustrations
    • Examples of typical questions.
    • The questions cover everything we discussed in the lectures.
  • Presentations of those absent on May 23.

For older versions of the lectures, you can browse the course history in SVN:

Doc-Level Manual Evaluation and/or Manual Post-Editing

  • Czech speakers: Evaluate the provided translations:
    • Indicate: Adequacy, Fluency, Overall Quality and Number of Context Errors
  • Others: Provide your independent translation, and MT post-editing of a short document

IBM Model 1 Implementation

 Deadline: 20th April 2020

The exam is written and consists of 7 questions, each equally important. In general, the exam questions will cover the full range of topics discussed in the lectures.

Here are the exam questions used in the past, for illustration:

Training Data for MT

  • What types of data are critical for training MT systems and what are the stages of their preparation.

Word Alignment

  • Describe IBM Model 1 for word alignment, highlighting the EM structure of the algorithm. You may or may not use formulas.

  • Suggest limitations of IBM Model 1. Provide examples of sentences and their translations where the model is inadequate, suggest a solution for at least one of them.

  • Illustrate the problems of word alignment task as such.

  • Come up with as many problems as you can for automatic word alignment when used in phrase-based MT.

Phrase-Based MT

  • Use a graph and/or the notation of deductive logic to illustrate the full space of partial (incl. complete) derivations translating "Marii miluje Jan" into English given the following translation dictionary:

    • Jan = John, miluje = loves, Marii = Mary and a model that:
      • translates each input word exactly once
      • allows any permutations of words,
      • ignores translation probabilities.
  • Make up an example sentence and phrase table snippets. Illustrate the process of phrase-based translation. Remember to cover both the preparation of translation options as well as the hypothesis expansion.

  • Make up an example input sentence, phrase table snippets and the process of hypothesis expansion and pruning to illustrate why is future cost estimation needed in phrase-based MT. Ignore the cost of reordering.

  • In the first step of phrase-based translation, all relevant phrase translations are considered for an input sentence. How the phrase translations were obtained? What scores are associated with phrase translations? Roughly suggest how the scores can be estimated.

  • What is the relation between noisy channel model and log-linear model for MT? Try to use formulas. Remember to explain your notation.

  • Describe in detail the process of hypothesis expansion in phrase-based MT. Provide examples for local and non-local features for scoring the hypotheses. How can non-local features be turned into local ones?

Hierarchical MT, Treelet MT

  • Illustrate the extraction of "gappy phrases" for the hierarchical model from a word-aligned sentence pair (e.g. 4x5 words). List (some of) the extracted phrases in the order of extraction.

  • Illustrate chart parsing as used in both hierarchical and (surface-) syntactic translation model. You will need to provide a sample: input sentence, some rules, some rule applications.

  • What is the difference between the hierarchical and (surface-) syntactic translation model? What new complications does syntax bring and how they can be solved?

  • What everything causes data sparseness in (some variant of) treelet translation?

Syntax in MT

  • Make up a sample sentence containing non-projectivity.

  • Why is non-projectivity important in MT? Provide an example.

  • For (a) phrase-based model (think Moses) and (b) deep-syntactic translation (think TectoMT) provide examples of as many problems as you can (e.g. syntactic constructions where you can prove the model will fail, situations with a high risk of mismatch between training and test data).

  • Compare (a) phrase-based model (think Moses) and (b) constituency-based syntactic model (Joshua). Provide sample syntactic constructions for a language pair that includes English where (1) one of them is bound to fail and (2) both of them are bound to fail. Describe what new problems does the syntactic model bring and how to tackle them (hint: coverage and sparseness).

Factored MT, Language Models in MT

  • When factors are used for target-side morphology, what they are meant to solve? Provide a (not very frequent) counterexample when the part added to the setup hurts instead of helping.

  • Provide 3 examples of factored phrase-based MT setups addressing various linguistics phenomena, explaining what are their potential benefits.

  • Compare language models based on word forms and language models based on POS tags (N, V, A, ... or more detailed like Nsg, Npl at your option) by making up cases where the increased generality of the POS LM helps and where it hurts in distinguisting good vs. bad sentences. You may need to say which patterns are frequent in your training data prior to saying how this misleads the model given some test data. Use monolingual or bilingual examples as you wish.

  • Sketch the idea of the reverse self-training approach. What benefits it brings?


  • Why is MT NP-complete? Try providing a (polynomial) reduction of an NP-complete problem onto a task in MT.

  • What are "local" vs. "non-local" features in search? Provide examples for phrase-based MT and also for an arbitrary syntactic model you come up with. You will probably need to sketch a small sample of the search space of each of the models with partial hypotheses.

  • What are the complications of introducing a language model to the hierarchical model (model based on chart parsing)? Illustrate state splitting.

MT Evaluation

  • Describe BLEU. Explain its core properties and limitations, sketch the formula and provide its explanation.

  • How does BLEU defeat (score low) hypotheses like "The the the the the." and (separately) "The."?

  • Why does BLEU perform poorly when evaluating Czech? There are at least two reasons. Provide examples.

  • What are the problems of (a) (automatic) word alignment and (b) phrase extraction as used in the "Moses pipeline" in general or when used in phrase-based translation.

  • Suggest 3 different manual MT evaluation techniques and highlight their respective positive and negative aspects.

Model Optimization

  • Describe the loop of weight optimization for the log-linear model as used in phrase-based MT.

  • Describe MERT, minimum error-rate training. Remember to talk about both the outer loop and inner loop, as well as both situations where "lines" appear in the algorithm. Why is the outer loop needed?

Transfer-Based MT

  • Describe what a "transfer-based" MT architecture means, illustrate the design of the deep-syntactic layer used for Czech-English translation. What are the potential benefits of transferring at this deep-syntactic layer?

  • What are the problems of transfer-based MT?

  • Describe the statistical model that is used in TectoMT tree-to-tree transfer. What component of the model serves as a "language model"? What unit does this language model operate with?

MT System Combination

  • Describe one possible approach of combining an external MT system with a phrase-based MT system. What benefits can this approach have?

Neural MT

  • Sketch the structure of an encoder-decoder architecture of neural MT, remember to describe the components in the picture.

  • What problem does attention in neural MT address? Provide the key idea of the method.


All lecture materials for the years 2008—2017 are available in the course SVN:
For read-only access use username: student and password: student