Starting in 2025/2026, this course substitutes the discountinued NPFL067 and NPFL068.
SIS code: NPFL147
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Lecturers: Pavel Pecina, pecina@ufal.mff.cuni.cz; Jindřich Helcl helcl@ufal.mff.cuni.cz
Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.
No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts. No background in NLP is necessary.
To pass the course, students need to complete three homework assignments and pass a written test. See grading for more details.
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
Deadline: 25th November 2025, 20:00
In this assignment you will explore the entropy of natural language and n-gram language smoothing across multiple languages. Your task is to obtain a dataset from the Hugging Face repository and calculate conditional entropy of three languages. You will experiment with how the entropy changes with regard to the tokenization strategy (i.e. how you split text into sequential inputs for the model). Then, you will implement interpolated smoothing for trigram language models and use the EM algorithm to optimize smoothing parameters and evaluate model performance through cross-entropy on test data.
The submissions will consist of a single Google Colab notebook, plus a filled in checklist.
For detailed instructions please see the assignment slides.
Deadline: 23rd December 2025, 20:00
In the second assignment you will practice your coding skills by implementing the word class algorithm presented during the lectures. You will download Czech and English data from the Guthenberg project (translations of the same book, in fact) and compute the class hierarchy on a smaller sample, observing the similarities and differences across the two languages. In the second part of this assignment you will obtain pre-trained word embeddings for selected words and visualize the classes by their embeddings. Then you will run a clustering algorithm on the embeddings and compare the clusters with the original word classes.
The submissions will consist of a single Google Colab notebook, plus a filled in checklist.
For detailed instructions please see the assignment slides.
Manning, C. D. and H. Schütze. MIT Press. 1999. ISBN 0-262-13360-1.
Jurafsky, D. and J. H. Martin. Prentice-Hall. 2000. ISBN 0-13-095069-6.
Allen, J.. Benajmins/Cummings Publishing Company 1994. ISBN 0-8053-0334-0.
Cover, T. M. and J. A. Thomas. Wiley. 1991. ISBN 0-471-06259-6.
Charniak, E. MIT Press. 1996. ISBN 0-262-53141-0.
Jelinek, F. MIT Press. 1998. ISBN 0-262-10066-5.