Introduction to Natural Language Processing (Úvod do zpracování přirozeného jazyka)
week | lecture | lab | homework |
---|---|---|---|
1: 3/10/2018 | JH: Motivation for NLP. Basic notions from probability and information theory. [slides] |
ZŽ: Using basic bash command line tools for text processing. Collecting counts for a bigram language model in bash (see the list of exercises) . Optional reading:
|
|
2: 10/10/2018 | PP: Language models. The noisy channel model. [slides] |
ZŽ: Character encoding. [slides] optional reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) |
|
3: 17/10/2018 | JH/PP(?): Markov Models. [slides] |
ZŽ: Language Model exercises. Optional reading: |
HW01: diacritics restoration in Czech texts using a letter-trigram model, deadline: see below |
4: 24/10/2018 | ZŽ: Language data resources. [slides] |
ZŽ: Evaluation measures in NLP. [slides] (provisional!) |
register as a user of the Czech National Corpus (you will need it in the following week). |
5: 31/10/2018 | PP: Introduction to information retrieval, Boolean model, Inverted index. [slides] | PP: Vector space model, TF-IDF weighting, Evaluation. | |
6: 7/11/2018 | PP: Probabilistic models for information retrieval. [slides] | PP: Language models for information retrieval. | HW02: Experiments with an open-source IR toolkit. [slides], deadline T.B.A. |
7: 14/11/2018 | DZ: Morphological analysis. [slides] |
DZ: Czech National Corpus. [notes/googledoc] |
|
8: 21/11/2018 | DZ: Syntactic analysis. [slides] |
DZ: Syntactically annotated corpora. [slides] |
HW03: valency dictionary of verbs, deadline 31.12.2018 |
9: 28/11/2018 | JL: Intorudction to deep Learning in NLP [slides] | JL: Sentence classification in Pytorch [slides], [ipython] | |
10: 5/12/2018 | JL: Application of deep learning in application NLP [slides] | JL: Recurrent Neural Netowrks for checking y/i spelling in Czech in TensorFlow [slides], [ipython] | |
11: 12/12/2018 | OB: Machine Translation (overview, evaluation) and alignment. [slides] | OB: Word alignment. | Finish IBM1, start working on HW04 |
12: 19/12/2018 | OB: Statistical Machine Translation: PBMT and NMT. [main slides, extra illustrations: PBMT decoding (P. Koehn)] | OB: to be updated: Neural MT with Marian at MetaCentrum. | HW04: Empirical comparison of NMT attention and your IBM1 alignment. Deadline 09/01/2019 |
13: 2/1/2019 | NO CLASS | NO LAB | NO CLASS |
14: 9/1/2019 | OB: Linguistic features in SMT and NMT, Advanced NMT. [to be updated main slides, factored PBMT (P. Koehn), TectoMT (M. Popel), Neural MT (R. Sennrich), ACL 2016 tutorial on Neural MT (T. Luong, K. Cho, C. Manning)]. | Finalize HW04, resolve any issues. | |
Most probably 16/01/2019 | Written final exam test |
Instructors
- JH: Prof. RNDr. Jan Hajič, Dr.
- ZŽ: Doc. Ing. Zdeněk Žabokrtský, Ph.D.
- DZ: RNDr. Daniel Zeman, Ph.D.
- PP: Doc. RNDr. Pavel Pecina, Ph.D.
- OB: RNDr. Ondřej Bojar, Ph.D.
- JL: Mgr. Jindřich Libovický
Homework tasks
- HW01 - diacritics restoration
- Implement a program that reads a Czech text with removed diacritics from STDIN and print the same text with restored diacritics to STDOUT.
- Possible solution: build a Czech corpus of your own (e.g. by downloading a few e-books or news or Wikipedia or ...) that contains at least one 100k words. Create a mutation of the corpus in which all Czech diacritics is removed. Extract a mapping from words without diacritics to words with diacritics. For out-of-vocabulary words use letter-trigram language model.
- Evaluate the accuracy of the restoration as a percentage of correct non-white characters in the output.
- Evaluation datasets - randomly chosen two recent articles from vesmir.cz:
- development set
- evaluation set (to be used only for evaluation the very final version of your system!)
- You can use any programming language as long as it can be compiled/executed on a Linux without too much tweaking (esp. without purchasing any license). Recommended choice: Python 3.
- You can use the devtest data any times you need, but you should use the etest data for evaluation only once.
- Organize the execution of the whole experiment into a Makefile that (after typing make all) downloads your training data, as well as the development and evaluation test sets from the links above, trains the model, applies it on the development data and evaluates the accuracy.
- Submission: please send an archive file containg your source codes, Makefile and short README describing your approach and achieved accuracy on devtest and etest by email to Zdeněk Žabokrtský
- Deadline: 7th November 2018, 23:59:59
- Extended deadline: 21th November 2018 (warning: points obtained will be reduced by 1/2 in the case of submission in the extended deadline)
- HW03 – valency dictionary of verbs, extracted from a treebank
Knowing the number and nature of arguments (valency) of a verb is important for decoding the meaning of a sentence and mapping the words to semantic roles, that is, “who did what to whom.” Sometimes different valency frames signal completely different meanings of two otherwise identical verbs. Valency is normally indicated in dictionaries, if they are available. In this assignment, we will approximate a valency dictionary using information acquired from a treebank.
- download the latest release of the Universal Dependencies (UD) treebanks (there is a download link for the entire collection down on the title page; at the time of assignment, the most recent release is 2.3)
- write a tool that extracts information about core arguments of verbs in any UD language
- the tool should be able to take any CoNLL-U file in any language as input
- you must test it on at least two UD languages:
- select only treebanks that contain both lemmas (the LEMMA column) and features (the FEATS column), and their size is at least 20K tokens (see the UD website for an overview):
- choose one language freely (provided the above constraints are met)
- the other is assigned pseudo-randomly: order the English names of languages alphabetically (as on http://universaldependencies.org/), sort your last name in it, take the next language after your last name (or the first language in the list if your surname appears at the end of the list)
- for each treebank, concatenate its training + development + test data and produce one output for the entire treebank
- the tool finds all occurrences of non-auxiliary verbs (the tag is VERB) in the data
- for each verb it finds all its arguments in the sentence; the arguments can be distinguished
from non-arguments by their
dependency relation label,
i.e., the value of the DEPREL column in the CoNLL-U file format. The following count as arguments:
- nsubj, csubj, obj, iobj, ccomp, xcomp, expl;
- relations that start with one of the above labels and contain a language-specific extension, e.g. “obj:caus”;
- relations labeled “obl:arg” or “obl:agent”; but not any other extension of “obl” and not the bare “obl” itself.
- a “verb valency frame” for the purpose of this task is the following information:
- lemma of the verb
- VerbForm and Voice features of the verb, if available
- for each argument of the verb:
- its dependency relation to the verb (e.g. “nsubj” or “obj”)
- its Case feature, if available
- if the argument has any dependent with the “case” or “mark” relation (usually prepositions and subordinating conjunctions), the lemma of this dependent is also included; if there are several such dependents, all are included
- order of the arguments in the sentence is not significant, i.e. Czech “dal slečně kytku”, “dal kytku slečně” and “jemu dal dárek” are instances of the same valency frame
- the frame does not include the actual word form, neither of the verb, nor of any dependent! Thus Czech “koupil auto”, “koupím auto” and “koupím dům” are not three different frames! These are three instances of the same frame, “koupit obj-Acc” (the verb “koupit” = “to buy” with just one argument, which is an accusative object).
- output of the tool: list of verbs and their valency frames, either as a plain text, or as a table in HTML
- one frame per line; the frame is accompanied by its frequency at the end of the line
- verb lemmas are sorted alphabetically
- within one verb, frames are ordered by their VerbForm and Voice features (that is, the value of VerbForm is the first sorting criterion, Voice is the second)
- within one verb lemma + VerbForm + Voice combination, frames are sorted according to their frequencies (the most frequent frame comes first)
- example output:
adaptovat Fin Act : iobj-Acc(case-na), nsubj-Nom, obj-Acc = 1 adaptovat Inf null: obj-Acc(case-na) = 2 adaptovat Inf null: obj-Acc = 1 adaptovat Part Act : iobj-Acc(case-na), obj-Acc = 1
- So the first frame in the above example says that there was a sentence containing an active finite form of the verb adaptovat, and in the same sentence there were three words whose parent (head) was this verb, and their dependency relations (deprel) were iobj, nsubj and obj, respectively. The two objects (obj and iobj) had the feature Case=Acc. The subject (nsubj) had the feature Case=Nom. Furthermore, in the sentence there was a preposition whose head was the first accusative object, the lemma of the preposition was na and the relation (deprel) between the object and the preposition was case.
- ideally, the solution should not depend on a particular operating system; nevertheless, it will be tested on Ubuntu Linux, so if you cannot guarantee platform independence, at least make sure that it runs correctly on Ubuntu
- within these limits, any programming language can be used. Python (both 2.7 and 3), Perl, Bash, Java (1.6, 1.7, 1.8), JavaScript, C, C++ all should be fine; but please avoid non-standard libraries. Ideally, your program should be able to run on a common Linux system without installing new stuff. When in doubt, get in touch with me before you start coding
- make sure to document how the program is invoked from the console, whether it reads simply STDIN and writes STDOUT, and if not, then how the input path and the output path can be indicated on the command line
- for each of the input treebanks, add the output file (but do not add the input files – they are too large for e-mail communication and I already have them)
- pack it all (script + documentation + outputs) into a zip file and submit it by e-mail to zeman@ufal.mff.cuni.cz (please note “FEL-HW03” in the subject); for deadline, see the table above
- HW04 - Comparing your IBM1 attention with attention of sequence to sequence model
- See the details in Lab 10&11 page.
Requirements for passing the course
- obtaining the course credit
- There will be 4 homework assignments.
- For each assignment, you will get up to 12.5 points, i.e. up to 50 points in total.
- Solutions of homework tasks are to be created by each student individually; any plagiarism will be strongly penalized.
- All assignments will have a fixed deadline (usually two weeks).
- If you submit the assignment after the deadline, you will get:
- up to 50% of the maximum points if it is less than 2 weeks after the deadline;
- 0 points if it is more than 2 weeks after the deadline.
- To be allowed to write the exam test, you need to get at least 50% of the total points from the assignments.
- passing the exam
- each student must write the final written test
- the final grade will be fully determined by the integer-rounded number of points obtained as follows (accordings to the dean's directive):
- A - excellent 90-100 points
- B - very good 80-89 points
- C - good 70-79
- D - satisfactory 60-69 points
- E - sufficient 50-59 points
- E - failed <50 points
- the total number of points will be determined as the summation of:
- homework tasks points: maximum 50 points (all four tasks equally weighted)
- written test points: maximum 50 points (see the set of possible test questions)
- Example:
- Honza Hloupý submitted all four homework solutions. His solutions were quite good (though not perfect) and by coincidence he gained 10 points (out of 12.5) for each. He was late with the second homework, but luckily he completed it within the extended deadline. This leads to 10+5+10+10=35 points.
- Honza received 46 points (out of 50) from the final written test.
- Total points: 35+46=81. Final grade: very good.