SIS code: 
NPFL067, NPFL068
winter + summer
2/2 C+Ex, 2/2 C+Ex

Statistical NLP I (NPFL067),  II (NPFL068) - 2017/2018

[Statistické metody zpracování přirozených jazyků I, II]

Lectures (fall semester):
Where: Malostranské nam. 25, 1st floor, S1
When: Tue 12:20-13:50

Seminars (fall semester):
Where: Malostranské nam. 25, 1st floor, S1
When: Tue 14:00-15:30

Instructor: Jan Hajič/Pavel Pecina
Office(s): MFF UK Malostranske nam., 4th floor, rm 420/422

Prerequisites & Relation to Other Courses

Students should have a substantial programming experience in either C, C++, Java and/or Perl, and have preferably taken Data Structures (Datove struktury, NTIN060), Unix (NSWI095), and Intro to Probability (NMAI059) or their equivalents, even though all the probability theory needed will be re-explained. Knowledge of, or willingness to learn the basics of Perl as-you-go (and on your own) is also important. One of the benefits of the course is that it is given in English; it should enable you to read current literature on NLP more smoothly, since the literature is almost exclusively in English. Czech terminology will be explained for those interested.

The material covered in this course is selected in such a way that at its completion you should be able to understand papers in the field of Natural Language Processing, and it should also make your life easier when taking more advanced courses either at UFAL MFF UK or elsewhere.

No background in NLP is necessary.




Manning/Schuetze cover small Manning, C. D. and H. Schütze: Foundations of Statistical Natural Language Processing. The MIT Press. 1999. ISBN 0-262-13360-1. 

Eight copies of this book are available at the CS library for borrowing. Please be considerate to other students and do not keep the book(s) longer than absolutely necessary.

Recommended & Reference Readings:

Jurafsky, Martin cover small Jurafsky, D. and J. H. Martin: Speech and Language Processing. Prentice-Hall. 2000. ISBN 0-13-095069-6. 

Three copies of Jurafsky's book are available at UFAL's library.
Wall et al. cover small Wall, L., Christiansen, T. and R. L. Schwartz: Programming PERL, 3rd ed.. O'Reilly. 1996. ISBN 0-596-00027-8. (Sorry no large cover picture available.)
Allen cover small Allen, J.: Natural Language Understanding. The Benajmins/Cummings Publishing Company Inc. 1994. ISBN 0-8053-0334-0.
Cover/Thomas cover small Cover, T. M. and J. A. Thomas: Elements of Information Theory. Wiley. 1991. ISBN 0-471-06259-6.
Charniak cover small Charniak, E.: Statistical Language Learning. The MIT Press. 1996. ISBN 0-262-53141-0.
Jelinek cover small Jelinek, F.: Statistical Methods for Speech Recognition. The MIT Press. 1998. ISBN 0-262-10066-5. Four copies of Jelinek's book are available at UFAL's library, but they are primarily reserved for those taking Nino Peterek's and/or Filip Jurcicek's courses.


Proceedings of major conferences (related to Natural Language Processing):

Some of the Proceedings are available at UFAL's library, physically and/or in electronic form. Most of them are, however, freely available through the ACL Anthology, including all volumes of the Computational Linguistics journal and the newTransactions of the ACL journal.

Other Resources:

  • CLSP Workshops: Language Engineering for Students and Professionals Integrating Research and Education


Assignments & Due Dates

Unix lab accounts

For MFF UK students, please see For others, please visit

Turning in the Assignments


  • Use a separate directory for each assigment. Create a main web page called index.html or index.htm in that directory. Create as many other web pages as necessary. Put all the other necessary files (.ps files, pictures, source code, ...) into the same directory and make relative links to them from your main or other linked web pages. If you use some "content creation" tools related to MSFT software please make sure the references use the correct case (matching uppercase/lowercase).


  • Pack everything into a single .tgz file:

    tar -czvf ~/username.assignx.tgz ./*


  • Send the resulting file by e-mail (as an attachment) to

    with the following subject line:

    Subject: Subject:


    Subject: Jan.Novak 2

    for Jan Novák, turning in the second assignment.


No plagiarism will be tolerated. The assignments are to be worked on on your own; please respect it. If the instructor determines that there are substantial similarities exceeding the likelihood of such an event, he will call the two (or more) students to explain them and possibly to take an immediate test (or assignment, at the discretion of the instructor, not to exceed four hours of work) to determine the student's abilities related to the offending work. *All* cases of confirmed plagiarism will be reported to the Student Office.


For each day your submission is late, 5 points will be subtracted from the points awarded to the solution or a part of it, up to max. of 50 points per homework. Submissions received less then 4 weeks before the closing date of the term will not be graded and will be awarded 0 points.

The Assignments

No. Course Due date Task Resources
 #1  NPFL067 February 28, 2018 Exploring Entropy and  Language Modeling TEXTEN1.txt (large!) TEXTCZ1.txt(large!)
 #2  NPFL068 June 30, 2018 Word Classes TEXTEN1.txt (large!) TEXTCZ1.txt(large!) TEXTEN1.ptg (large!)TEXTCZ1.ptg (large!)
 #3  NPFL068 July 31, 2018 Tagging texten2.ptg (large!) textcz2.ptg(large!)


Exam Date, Time Where
NPFL067 Jan. 16, 2018, 12:20-13:30 S1
NPFL068 May 22, 2018, 9:00-10:45 S1
NPFL068. 2nd date tbd tbd

Both the mid-term and the final exams will be written (not oral), with about 6 major questions and some subquestions. You will have 60 minutes for the mid-term (fall semester or "ZS", NPFL067), and up to 90 minutes for the final exam (i.e., spring semester or "LS", NPFL068) to write down the answers.

To get an idea of the type of exam questions, please see the questionaire for one of the previous year's final exam (Questionnaire).

As stated above, your final grade (or pass/fail for PhD students) will be determined by both the final exam and your assignment results in a 50:50 ratio (NPFL067), and 1:1:1 (or in other words, roughly 33:33:33) for NPFL068.

In special circumstances (long-term absence etc.), some other schedule and grading scheme could be worked out individually, but please try hard to hand in all assignments in time and come for the final exam on the regular date.

NPFL067 Grades

The official, 'usual' grading table is now available here. You will need a username and password to access it - I will email it to you.

NPFL068 Grades

The official, 'usual' grading table is now available here. You will need a username and password to access it - same as above (it has been mailed to you).


The web pages from 2012/2013, including grading (password needed as usual) are available at

The original web pages for this course are also still active at