Statistical Methods in Natural Language Processing

Starting in 2025/2026, this course substitutes the discountinued NPFL067 and NPFL068.

About

SIS code: NPFL147
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Lecturers: Pavel Pecina, pecina@ufal.mff.cuni.cz; Jindřich Helcl helcl@ufal.mff.cuni.cz

Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.

Timespace Coordinates

Tuesdays, 12:20 - 13:50 (S9), 14:00-15:30 (S1)
Consultations upon request.

News

Assignment 1 published. Deadline Nov 25, 2025 at 8pm CET.
No lecture/practicals Oct 28, 2025 and Nov 25, 2025
The course will start on Oct 7, 2025.

Prerequisites

No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts. No background in NLP is necessary.

Passing Requirements

To pass the course, students need to complete three homework assignments and pass a written test. See grading for more details.

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

Entropy and LM Smoothing

Deadline: 25th November 2025, 20:00

Submission form

In this assignment you will explore the entropy of natural language and n-gram language smoothing across multiple languages. Your task is to obtain a dataset from the Hugging Face repository and calculate conditional entropy of three languages. You will experiment with how the entropy changes with regard to the tokenization strategy (i.e. how you split text into sequential inputs for the model). Then, you will implement interpolated smoothing for trigram language models and use the EM algorithm to optimize smoothing parameters and evaluate model performance through cross-entropy on test data.

The submissions will consist of a single Google Colab notebook, plus a filled in checklist.

For detailed instructions please see the assignment slides.

Word Classes and Embeddings

Deadline: 16th December 2025, 20:00

Submission form

In the second assignment you will practice your coding skills by implementing the word class algorithm presented during the lectures. You will download Czech and English data from the Guthenberg project (translations of the same book, in fact) and compute the class hierarchy on a smaller sample, observing the similarities and differences across the two languages. In the second part of this assignment you will obtain pre-trained word embeddings for selected words and visualize the classes by their embeddings. Then you will run a clustering algorithm on the embeddings and compare the clusters with the original word classes.

The submissions will consist of a single Google Colab notebook, plus a filled in checklist.

For detailed instructions please see the assignment slides.

Part-of-Speech Tagging

Deadline: 27th January 2026, 20:00

Submission form

In the third assignment you will experiment with different approaches to part-of-speech tagging. You will get data from Universal Dependencies, evaluate an off-the-shelf neural-network-based tagger, and compare it to a supervised HMM-based tagger that you train. You will use the interpolated smoothing algorithm from assignment 1 and implement the Viterbi algorithm for decoding the tag sequence from a trained HMM. Additionally, you will try to improve a weak model trained on a fraction of the labeled data using non-labeled data with the Baum-Welch algorithm.

The submissions will consist of a single Google Colab notebook, plus a filled in checklist.

For detailed instructions please see the assignment slides.

Homework assignments

There are three homework assignments during the semester with a fixed deadline announced on the webpage.
The assignments are to be worked on independently and require a substantial amount of programming, experimentation, and reporting to complete.
The assignments will be awarded by 0-100 points each.
Late submissions received up to 2 weeks after the deadline will be penalized by 50% point reduction.
Submissions received later than 2 weeks after the deadline will be awarded 0 points.
One two-week no-penalty extension will be granted upon request sent by email before the deadline.

Exam

The exam in a form of a written test takes places at the end of semester (up to three terms).
The maximum duration of the test is 90 minutes.
The test will be graded by 0-100 points.

Final Grading

Completion of both the homework assignments and exam is required to pass the course.
The students need to earn at least 50 points for each assignment (before any late submission penalization) and at least 50 points for the test.
The points received for the assignments and the test will be available in SIS.
The final grade will be based on the average results of the exam test and the three homework assignments, all four weighted equally:
- ≥ 90%: grade 1 (excellent)
- ≥ 70%: grade 2 (very good)
- ≥ 50%: grade 3 (good)
- < 50%: grade 4 (fail)

Plagiarism

No plagiarism will be tolerated.
All cases of plagiarism will be reported to the Student Office.

Required Reading

Foundations of Statistical Natural Language Processing

Manning, C. D. and H. Schütze. MIT Press. 1999. ISBN 0-262-13360-1.

Recommended & Reference Readings

Speech and Language Processing

Jurafsky, D. and J. H. Martin. Prentice-Hall. 2000. ISBN 0-13-095069-6.

Natural Language Understanding

Allen, J.. Benajmins/Cummings Publishing Company 1994. ISBN 0-8053-0334-0.

Elements of Information Theory

Cover, T. M. and J. A. Thomas. Wiley. 1991. ISBN 0-471-06259-6.

Statistical Language Learning

Charniak, E. MIT Press. 1996. ISBN 0-262-53141-0.

Statistical Methods for Speech Recognition

Jelinek, F. MIT Press. 1998. ISBN 0-262-10066-5.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

Statistical Methods in Natural Language Processing

About

Timespace Coordinates

News

Prerequisites

Passing Requirements

License

Entropy and LM Smoothing

Submission form

Word Classes and Embeddings

Submission form

Part-of-Speech Tagging

Submission form

Homework assignments

Exam

Final Grading

Plagiarism

Required Reading

Foundations of Statistical Natural Language Processing

Recommended & Reference Readings

Speech and Language Processing

Natural Language Understanding

Elements of Information Theory

Statistical Language Learning

Statistical Methods for Speech Recognition