# NPFL087 — Statistical Machine Translation

The course covers the area of machine translation (MT) in its current breadth, delving deep enough in each approach to let you know how to confuse every existing MT system. We put a balanced emphasis on several imporant types of state-of-the-art systems: phrase-based MT, surface-syntactic MT and (a typically Praguian) deep-syntactic MT. We do not forget common pre-requisities and surrounding fields: extracting translation equivalents from parallel texts (including word alignment techniques), MT evaluation or methods of system combination.

We aim to provide a unifying view of machine translation as statistical search in a large search space, well supported with practical experience during your project work in a team or alone. Finally, we also attempt to give a gist of emerging approaches in MT, such as neural networks.

SIS code: NPFL087
Semester: summer
E-credits: 6
Examination: 2/2 C+Ex
Instructor: Ondřej Bojar

### Timespace Coordinates

• lecture: Thursdays 9:00-10.30 in SU1 (lecture held in English)
• lab/discussions: Thursdays 10.40-12.20 in SU1

### Outline

For upcoming lectures, you can browse the course history in SVN:

### Requirements

Key requirements:

• Work on a project (alone or in a group of two to three).
• Present project results (~30-minute talk).
• Write a report (~4-page scientific paper).

• 10% homeword and activity,
• 30% written exam,
• 50% project report,
• 10% project presentation.

Final Grade: ≥50% good, ≥70% very good, ≥90% excellent.

Legend: Slides Video Homework assignment

** There was no lecture on Feb 21 **

### 1. Metrics of MT Quality

• The task of machine translation.
• Methods of manual evaluation.
• Methods of automatic evaluation.
• Empirical confidence bounds, bootstrapping.
• End-to-end vs. component evaluation.

### 2. Overview of Approaches to MT: SMT, PBMT, NMT

• Approaches to MT.

• What makes MT statistical

• Probability of a sentence, Bayes' law.
• Log-linear model.
• Phrase-Based MT.

• Features used.
• Training Pipeline.
• Unjustified independence assumptions.
• Neural MT.

• Deep learning summary.
• Representing text.
• Encoder-decoder architecture overview.
• Introduction to Neural Monkey

• Proposed project topics based on Neural Monkey:

• Multi-source translation (robustness)
• Visualizations
• Insertion-based Transformer
• 2D convolutions for sequence-to-sequence (arxiv: Pervasive attention using...)
• ...or choose your own paper.
• Homework:

• Decide on your project topic.

### 3. Introduction to Neural Machine Translation (NMT)

Mar 14 Lecture Slides

• Basic NN building blocks for NMT.
• Representing text in NNs.
• Neural LMs.
• Vanilla Sequence-to-Sequence Model (Encoder-Decoder Framework).
• Attention.

### 4. Alignment

• Parallel Data Acquisition.
• Document Alignment.
• Sentence Alignment.
• Word Alignment, IBM1 in Detail.
• Ultimate Goal of Alignment.

### 5. Phrase-Based Machine Translation

• PBMT Overview.
• Phrase Extraction.
• Reminder: Log-linear model.
• PBMT Model.
• Features Used.
• Translating with PBMT (Decoding)
• Translation Options and Stack-Based Beam Search
• Interlude: MT is NP-Hard (in general, not just PBMT).
• Pruning, Future Cost Estimation.
• Local and Non-Local Features.
• Minimum Error-Rate Training.

### 6. Morphology in MT

• Problems caused by rich morphology.
• Morphological richness of Czech.
• Margin for improvement in BLEU.
• Combinatorial explosion of Czech word forms.
• Morphology in PBMT:
• Factored PBMT.
• Reverse self-training.
• Morphology in NMT.
• Subword Units, BPE.

### 7. Syntax in SMT

• Motivation for grammar in MT.
• Constituency vs. dependency trees.

Constituency Syntax:

• Context Free Grammars.
• MT as parsing.
• Hierarchical phrase-based model (Hiero, Joshua).
• Synchronous CFG, LM integration.
• Using real source/target parse trees.
• Tricks to avoid data loss.

Dependency Syntax:

• Surface syntax (STSG), problems.
• Deep syntax (t-layer); factorization is a must.

** There was no lecture on Apr 18 **

### 8. Transformer; Syntax in SMT

• Transformer Architecture (Attention is All You Need)
• Syntax in NMT:
• Source Syntax Attached to Tokens
• Source Syntax in Network Structure
• Target Syntax through Interleave or Multi-Task.
• Intro: Dijskra and A-star search.
• MT is NP-hard.
• Fast and optimal decoding.
• Stacks and future cost.
• Cube pruning.
• Hypergraph decoding.

### 10. Word and Sentence Representations

May 9 Lecture Slides

• Semiotic Triangle: Towards Understanding.
• Continuous Word Representations.
• Continuous Phrase Representations.
• Continuous Sentence Representations.
• Relating Human and NN Meaning Representations.

### 11. Advanced NMT, Chef's Tips

May 16

• Multi-modal and multi-lingual MT.
• Components of best-performing setups. --

### Project Presentations

May 23

• Be prepared to present your project.
• Up to 30 minutes per talk.
• Present work in progress, no need to have final results.

### Written Exam (and remaining presentations)

May 30

• Approximately hour-long written exam.
• Seven open questions
• For a full answer, you usually need to write half a page or a page, including illustrations
• Examples of typical questions.
• The questions cover everything we discussed in the lectures.
• Presentations of those absent on May 23.

For older versions of the lectures, you can browse the course history in SVN:

### Doc-Level Manual Evaluation

• Evaluate the provided translations:
• Indicate: Adequacy, Fluency, Overall Quality and Number of Context Errors

### IBM1 Implementation

• Implement IBM Model 1 in your favourite language

The exam is written and consists of 7 questions, each equally important. In general, the exam questions will cover the full range of topics discussed in the lectures.

Here are the exam questions used in the past, for illustration:

### Word Alignment

• Describe IBM Model 1 for word alignment, highlighting the EM structure of the algorithm. You may or may not use formulas.

• Suggest limitations of IBM Model 1. Provide examples of sentences and their translations where the model is inadequate, suggest a solution for at least one of them.

• Illustrate the problems of word alignment task as such.

• Come up with as many problems as you can for automatic word alignment when used in phrase-based MT.

### Phrase-Based MT

• Use a graph and/or the notation of deductive logic to illustrate the full space of partial (incl. complete) derivations translating "Marii miluje Jan" into English given the following translation dictionary:

• Jan = John, miluje = loves, Marii = Mary and a model that:
• translates each input word exactly once
• allows any permutations of words,
• ignores translation probabilities.
• Make up an example sentence and phrase table snippets. Illustrate the process of phrase-based translation. Remember to cover both the preparation of translation options as well as the hypothesis expansion.

• Make up an example input sentence, phrase table snippets and the process of hypothesis expansion and pruning to illustrate why is future cost estimation needed in phrase-based MT. Ignore the cost of reordering.

• In the first step of phrase-based translation, all relevant phrase translations are considered for an input sentence. How the phrase translations were obtained? What scores are associated with phrase translations? Roughly suggest how the scores can be estimated.

• What is the relation between noisy channel model and log-linear model for MT? Try to use formulas. Remember to explain your notation.

• Describe in detail the process of hypothesis expansion in phrase-based MT. Provide examples for local and non-local features for scoring the hypotheses. How can non-local features be turned into local ones?

### Hierarchical MT, Treelet MT

• Illustrate the extraction of "gappy phrases" for the hierarchical model from a word-aligned sentence pair (e.g. 4x5 words). List (some of) the extracted phrases in the order of extraction.

• Illustrate chart parsing as used in both hierarchical and (surface-) syntactic translation model. You will need to provide a sample: input sentence, some rules, some rule applications.

• What is the difference between the hierarchical and (surface-) syntactic translation model? What new complications does syntax bring and how they can be solved?

• What everything causes data sparseness in (some variant of) treelet translation?

### Syntax in MT

• Make up a sample sentence containing non-projectivity.

• Why is non-projectivity important in MT? Provide an example.

• For (a) phrase-based model (think Moses) and (b) deep-syntactic translation (think TectoMT) provide examples of as many problems as you can (e.g. syntactic constructions where you can prove the model will fail, situations with a high risk of mismatch between training and test data).

• Compare (a) phrase-based model (think Moses) and (b) constituency-based syntactic model (Joshua). Provide sample syntactic constructions for a language pair that includes English where (1) one of them is bound to fail and (2) both of them are bound to fail. Describe what new problems does the syntactic model bring and how to tackle them (hint: coverage and sparseness).

### Factored MT, Language Models in MT

• When factors are used for target-side morphology, what they are meant to solve? Provide a (not very frequent) counterexample when the part added to the setup hurts instead of helping.

• Provide 3 examples of factored phrase-based MT setups addressing various linguistics phenomena, explaining what are their potential benefits.

• Compare language models based on word forms and language models based on POS tags (N, V, A, ... or more detailed like Nsg, Npl at your option) by making up cases where the increased generality of the POS LM helps and where it hurts in distinguisting good vs. bad sentences. You may need to say which patterns are frequent in your training data prior to saying how this misleads the model given some test data. Use monolingual or bilingual examples as you wish.

• Sketch the idea of the reverse self-training approach. What benefits it brings?

### Search

• Why is MT NP-complete? Try providing a (polynomial) reduction of an NP-complete problem onto a task in MT.

• What are "local" vs. "non-local" features in search? Provide examples for phrase-based MT and also for an arbitrary syntactic model you come up with. You will probably need to sketch a small sample of the search space of each of the models with partial hypotheses.

• What are the complications of introducing a language model to the hierarchical model (model based on chart parsing)? Illustrate state splitting.

### MT Evaluation

• Describe BLEU. Explain its core properties and limitations, sketch the formula and provide its explanation.

• How does BLEU defeat (score low) hypotheses like "The the the the the." and (separately) "The."?

• Why does BLEU perform poorly when evaluating Czech? There are at least two reasons. Provide examples.

• What are the problems of (a) (automatic) word alignment and (b) phrase extraction as used in the "Moses pipeline" in general or when used in phrase-based translation.

• Suggest 3 different manual MT evaluation techniques and highlight their respective positive and negative aspects.

### Model Optimization

• Describe the loop of weight optimization for the log-linear model as used in phrase-based MT.

• Describe MERT, minimum error-rate training. Remember to talk about both the outer loop and inner loop, as well as both situations where "lines" appear in the algorithm. Why is the outer loop needed?

### Transfer-Based MT

• Describe what a "transfer-based" MT architecture means, illustrate the design of the deep-syntactic layer used for Czech-English translation. What are the potential benefits of transferring at this deep-syntactic layer?

• What are the problems of transfer-based MT?

• Describe the statistical model that is used in TectoMT tree-to-tree transfer. What component of the model serves as a "language model"? What unit does this language model operate with?

### MT System Combination

• Describe one possible approach of combining an external MT system with a phrase-based MT system. What benefits can this approach have?

### Neural MT

• Sketch the structure of an encoder-decoder architecture of neural MT, remember to describe the components in the picture.

• What problem does attention in neural MT address? Provide the key idea of the method.

### Archive

All lecture materials for the years 2008—2017 are available in the course SVN:

https://svn.ms.mff.cuni.cz/projects/NPFL087