Starting in 2025/2026, this course substitutes the discountinued NPFL067 and NPFL068.
SIS code: NPFL147
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Lecturers: Pavel Pecina, pecina@ufal.mff.cuni.cz; Jindřich Helcl helcl@ufal.mff.cuni.cz
Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.
No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts. No background in NLP is necessary.
To pass the course, students need to complete three homework assignments and pass a written test. See grading for more details.
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
Deadline: 25th November 2025, 20:00
In this assignment you will explore the entropy of natural language and n-gram language smoothing across multiple languages. Your task is to obtain a dataset from the Hugging Face repository and calculate conditional entropy of three languages. You will experiment with how the entropy changes with regard to the tokenization strategy (i.e. how you split text into sequential inputs for the model). Then, you will implement interpolated smoothing for trigram language models and use the EM algorithm to optimize smoothing parameters and evaluate model performance through cross-entropy on test data.
The submissions will consist of a single Google Colab notebook, plus a filled in checklist.
For detailed instructions please see the assignment slides.
Deadline: 16th December 2025, 20:00
In the second assignment you will practice your coding skills by implementing the word class algorithm presented during the lectures. You will download Czech and English data from the Guthenberg project (translations of the same book, in fact) and compute the class hierarchy on a smaller sample, observing the similarities and differences across the two languages. In the second part of this assignment you will obtain pre-trained word embeddings for selected words and visualize the classes by their embeddings. Then you will run a clustering algorithm on the embeddings and compare the clusters with the original word classes.
The submissions will consist of a single Google Colab notebook, plus a filled in checklist.
For detailed instructions please see the assignment slides.
Deadline: 27th January 2026, 20:00
In the third assignment you will experiment with different approaches to part-of-speech tagging. You will get data from Universal Dependencies, evaluate an off-the-shelf neural-network-based tagger, and compare it to a supervised HMM-based tagger that you train. You will use the interpolated smoothing algorithm from assignment 1 and implement the Viterbi algorithm for decoding the tag sequence from a trained HMM. Additionally, you will try to improve a weak model trained on a fraction of the labeled data using non-labeled data with the Baum-Welch algorithm.
The submissions will consist of a single Google Colab notebook, plus a filled in checklist.
For detailed instructions please see the assignment slides.
Manning, C. D. and H. Schütze. MIT Press. 1999. ISBN 0-262-13360-1.
Jurafsky, D. and J. H. Martin. Prentice-Hall. 2000. ISBN 0-13-095069-6.
Allen, J.. Benajmins/Cummings Publishing Company 1994. ISBN 0-8053-0334-0.
Cover, T. M. and J. A. Thomas. Wiley. 1991. ISBN 0-471-06259-6.
Charniak, E. MIT Press. 1996. ISBN 0-262-53141-0.
Jelinek, F. MIT Press. 1998. ISBN 0-262-10066-5.