Statistical Methods in Natural Language Processing

Starting in 2025/2026, this course substitutes the discountinued NPFL067 and NPFL068.

About

SIS code: NPFL147
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Lecturers: Pavel Pecina, pecina@ufal.mff.cuni.cz; Jindřich Helcl helcl@ufal.mff.cuni.cz

Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.

Timespace Coordinates

  • Tuesdays, 12:20 - 13:50 (S9), 14:00-15:30 (S1)
  • Consultations upon request.

News

  • Assignment 1 published. Deadline Nov 25, 2025 at 8pm CET.
  • No lecture/practicals Oct 28, 2025 and Nov 25, 2025
  • The course will start on Oct 7, 2025.

Prerequisites

No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts. No background in NLP is necessary.

Passing Requirements

To pass the course, students need to complete three homework assignments and pass a written test. See grading for more details.

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

Entropy and LM Smoothing

 Deadline: 25th November 2025, 20:00

Submission form

In this assignment you will explore the entropy of natural language and n-gram language smoothing across multiple languages. Your task is to obtain a dataset from the Hugging Face repository and calculate conditional entropy of three languages. You will experiment with how the entropy changes with regard to the tokenization strategy (i.e. how you split text into sequential inputs for the model). Then, you will implement interpolated smoothing for trigram language models and use the EM algorithm to optimize smoothing parameters and evaluate model performance through cross-entropy on test data.

The submissions will consist of a single Google Colab notebook, plus a filled in checklist.

For detailed instructions please see the assignment slides.

Word Classes and Embeddings

 Deadline: 23rd December 2025, 20:00

Submission form

In the second assignment you will practice your coding skills by implementing the word class algorithm presented during the lectures. You will download Czech and English data from the Guthenberg project (translations of the same book, in fact) and compute the class hierarchy on a smaller sample, observing the similarities and differences across the two languages. In the second part of this assignment you will obtain pre-trained word embeddings for selected words and visualize the classes by their embeddings. Then you will run a clustering algorithm on the embeddings and compare the clusters with the original word classes.

The submissions will consist of a single Google Colab notebook, plus a filled in checklist.

For detailed instructions please see the assignment slides.

Homework assignments

  • There are three homework assignments during the semester with a fixed deadline announced on the webpage.
  • The assignments are to be worked on independently and require a substantial amount of programming, experimentation, and reporting to complete.
  • The assignments will be awarded by 0-100 points each.
  • Late submissions received up to 2 weeks after the deadline will be penalized by 50% point reduction.
  • Submissions received later than 2 weeks after the deadline will be awarded 0 points.
  • One two-week no-penalty extension will be granted upon request sent by email before the deadline.

Exam

  • The exam in a form of a written test takes places at the end of semester (up to three terms).
  • The maximum duration of the test is 90 minutes.
  • The test will be graded by 0-100 points.

Final Grading

  • Completion of both the homework assignments and exam is required to pass the course.
  • The students need to earn at least 50 points for each assignment (before any late submission penalization) and at least 50 points for the test.
  • The points received for the assignments and the test will be available in SIS.
  • The final grade will be based on the average results of the exam test and the three homework assignments, all four weighted equally:
    • ≥ 90%: grade 1 (excellent)
    • ≥ 70%: grade 2 (very good)
    • ≥ 50%: grade 3 (good)
    • < 50%: grade 4 (fail)

Plagiarism

  • No plagiarism will be tolerated.
  • All cases of plagiarism will be reported to the Student Office.

Required Reading

Foundations of Statistical Natural Language Processing

Manning, C. D. and H. Schütze. MIT Press. 1999. ISBN 0-262-13360-1.



Recommended & Reference Readings

Speech and Language Processing

Jurafsky, D. and J. H. Martin. Prentice-Hall. 2000. ISBN 0-13-095069-6.




Natural Language Understanding

Allen, J.. Benajmins/Cummings Publishing Company 1994. ISBN 0-8053-0334-0.





Elements of Information Theory

Cover, T. M. and J. A. Thomas. Wiley. 1991. ISBN 0-471-06259-6.





Statistical Language Learning

Charniak, E. MIT Press. 1996. ISBN 0-262-53141-0.





Statistical Methods for Speech Recognition

Jelinek, F. MIT Press. 1998. ISBN 0-262-10066-5.