Introduction to Natural Language Processing (Úvod do zpracování přirozeného jazyka)

week	lecture	lab	homework
1: 3/10/2018	JH: Motivation for NLP. Basic notions from probability and information theory. [slides]	ZŽ: Using basic bash command line tools for text processing. Collecting counts for a bigram language model in bash (see the list of exercises) . Optional reading: Kenneth W. Church's Unix for Poets
2: 10/10/2018	PP: Language models. The noisy channel model. [slides]	ZŽ: Character encoding. [slides] optional reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
3: 17/10/2018	JH/PP(?): Markov Models. [slides]	ZŽ: Language Model exercises. Optional reading: Philipp Koehn's slides about Language Models in Statistical Machine Translation a chapter on N-gram models in a book by Jurafsky&Martin	HW01: diacritics restoration in Czech texts using a letter-trigram model, deadline: see below
4: 24/10/2018	ZŽ: Language data resources. [slides]	ZŽ: Evaluation measures in NLP. [slides] (provisional!)	register as a user of the Czech National Corpus (you will need it in the following week).
5: 31/10/2018	PP: Introduction to information retrieval, Boolean model, Inverted index. [slides]	PP: Vector space model, TF-IDF weighting, Evaluation.
6: 7/11/2018	PP: Probabilistic models for information retrieval. [slides]	PP: Language models for information retrieval.	HW02: Experiments with an open-source IR toolkit. [slides], deadline T.B.A.
7: 14/11/2018	DZ: Morphological analysis. [slides]	DZ: Czech National Corpus. [notes/googledoc]
8: 21/11/2018	DZ: Syntactic analysis. [slides]	DZ: Syntactically annotated corpora. [slides]	HW03: valency dictionary of verbs, deadline 31.12.2018
9: 28/11/2018	JL: Intorudction to deep Learning in NLP [slides]	JL: Sentence classification in Pytorch [slides], [ipython]
10: 5/12/2018	JL: Application of deep learning in application NLP [slides]	JL: Recurrent Neural Netowrks for checking y/i spelling in Czech in TensorFlow [slides], [ipython]
11: 12/12/2018	OB: Machine Translation (overview, evaluation) and alignment. [slides]	OB: Word alignment.	Finish IBM1, start working on HW04
12: 19/12/2018	OB: Statistical Machine Translation: PBMT and NMT. [main slides, extra illustrations: PBMT decoding (P. Koehn)]	OB: to be updated: Neural MT with Marian at MetaCentrum.	HW04: Empirical comparison of NMT attention and your IBM1 alignment. Deadline 09/01/2019
13: 2/1/2019	NO CLASS	NO LAB	NO CLASS
14: 9/1/2019	OB: Linguistic features in SMT and NMT, Advanced NMT. [to be updated main slides, factored PBMT (P. Koehn), TectoMT (M. Popel), Neural MT (R. Sennrich), ACL 2016 tutorial on Neural MT (T. Luong, K. Cho, C. Manning)].	Finalize HW04, resolve any issues.
Most probably 16/01/2019	Written final exam test

Instructors

Homework tasks

HW01 - diacritics restoration
- Implement a program that reads a Czech text with removed diacritics from STDIN and print the same text with restored diacritics to STDOUT.
- Possible solution: build a Czech corpus of your own (e.g. by downloading a few e-books or news or Wikipedia or ...) that contains at least one 100k words. Create a mutation of the corpus in which all Czech diacritics is removed. Extract a mapping from words without diacritics to words with diacritics. For out-of-vocabulary words use letter-trigram language model.
- Evaluate the accuracy of the restoration as a percentage of correct non-white characters in the output.
- Evaluation datasets - randomly chosen two recent articles from vesmir.cz:
  - development set
  - evaluation set (to be used only for evaluation the very final version of your system!)
- You can use any programming language as long as it can be compiled/executed on a Linux without too much tweaking (esp. without purchasing any license). Recommended choice: Python 3.
- You can use the devtest data any times you need, but you should use the etest data for evaluation only once.
- Organize the execution of the whole experiment into a Makefile that (after typing make all) downloads your training data, as well as the development and evaluation test sets from the links above, trains the model, applies it on the development data and evaluates the accuracy.
- Submission: please send an archive file containg your source codes, Makefile and short README describing your approach and achieved accuracy on devtest and etest by email to Zdeněk Žabokrtský
- Deadline: 7th November 2018, 23:59:59
- Extended deadline: 21th November 2018 (warning: points obtained will be reduced by 1/2 in the case of submission in the extended deadline)
HW03 – valency dictionary of verbs, extracted from a treebank
Knowing the number and nature of arguments (valency) of a verb is important for decoding the meaning of a sentence and mapping the words to semantic roles, that is, “who did what to whom.” Sometimes different valency frames signal completely different meanings of two otherwise identical verbs. Valency is normally indicated in dictionaries, if they are available. In this assignment, we will approximate a valency dictionary using information acquired from a treebank.
- download the latest release of the Universal Dependencies (UD) treebanks (there is a download link for the entire collection down on the title page; at the time of assignment, the most recent release is 2.3)
- write a tool that extracts information about core arguments of verbs in any UD language
- ideally, the solution should not depend on a particular operating system; nevertheless, it will be tested on Ubuntu Linux, so if you cannot guarantee platform independence, at least make sure that it runs correctly on Ubuntu
- within these limits, any programming language can be used. Python (both 2.7 and 3), Perl, Bash, Java (1.6, 1.7, 1.8), JavaScript, C, C++ all should be fine; but please avoid non-standard libraries. Ideally, your program should be able to run on a common Linux system without installing new stuff. When in doubt, get in touch with me before you start coding
- make sure to document how the program is invoked from the console, whether it reads simply STDIN and writes STDOUT, and if not, then how the input path and the output path can be indicated on the command line
- for each of the input treebanks, add the output file (but do not add the input files – they are too large for e-mail communication and I already have them)
- pack it all (script + documentation + outputs) into a zip file and submit it by e-mail to zeman@ufal.mff.cuni.cz (please note “FEL-HW03” in the subject); for deadline, see the table above
HW04 - Comparing your IBM1 attention with attention of sequence to sequence model
- See the details in Lab 10&11 page.

Requirements for passing the course

obtaining the course credit
- There will be 4 homework assignments.
- For each assignment, you will get up to 12.5 points, i.e. up to 50 points in total.
- Solutions of homework tasks are to be created by each student individually; any plagiarism will be strongly penalized.
- All assignments will have a fixed deadline (usually two weeks).
- If you submit the assignment after the deadline, you will get:
  1. up to 50% of the maximum points if it is less than 2 weeks after the deadline;
  2. 0 points if it is more than 2 weeks after the deadline.
- To be allowed to write the exam test, you need to get at least 50% of the total points from the assignments.

passing the exam
- each student must write the final written test
- the final grade will be fully determined by the integer-rounded number of points obtained as follows (accordings to the dean's directive):
  - A - excellent 90-100 points
  - B - very good 80-89 points
  - C - good 70-79
  - D - satisfactory 60-69 points
  - E - sufficient 50-59 points
  - E - failed <50 points
- the total number of points will be determined as the summation of:
  - homework tasks points: maximum 50 points (all four tasks equally weighted)
  - written test points: maximum 50 points (see the set of possible test questions)
- Example:
  - Honza Hloupý submitted all four homework solutions. His solutions were quite good (though not perfect) and by coincidence he gained 10 points (out of 12.5) for each. He was late with the second homework, but luckily he completed it within the extended deadline. This leads to 10+5+10+10=35 points.
  - Honza received 46 points (out of 50) from the final written test.
  - Total points: 35+46=81. Final grade: very good.