Students should have a substantial programming experience in either C, C++, Java and/or Perl, and have preferably taken Data Structures (NTIN060), Unix (NSWI095), and Intro to Probability (NMAI059) or their equivalents, even though all the probability theory needed will be re-explained. Knowledge of, or willingness to learn the basics of Perl as-you-go (and on your own) is also important. One of the benefits of the course is that it is given in English; it should enable you to read current literature on NLP more smoothly, since the literature is almost exclusively in English. Czech terminology will be explained for those interested.
The material covered in this course is selected in such a way that at its completion you should be able to understand papers in the field of Natural Language Processing, and it should also make your life easier when taking more advanced courses either at UFAL MFF UK or elsewhere.
No background in NLP is necessary.
The original web pages for this course are also still active at http://www.cs.jhu.edu/~hajic/courses/cs465/syllabus.html.
Deadline: Jan 31 23:59 100 points
For all parts of this homework, work alone. On top of the results/requirements specific to a certain part of the homework, turn in all of your code, commented in such a way that it is possible to determine what, how and why you did what you did solely from the comments, and a discussion/comments of/on the results (in a plain text/html) file. Technically, follow the usual pattern (see below):
In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.
First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.
(For this part of Assignment 2, there is no need to split the data in any way.)
Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, disregarding pairs in which one or both words appear less than 10 times in the corpus, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.
Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets.
These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:
where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.
Similarly for the English texts (except the tags are shorter of course).
Compute a full class hierarchy of words using the first 8,000 words of those data, and only for words occurring 10 times or more (use the same setting for both languages). Ignore the other words for building the classes, but keep them in the data for the bigram counts. For details on the algorithm, use the Brown et al. paper distributed in the class; some formulas are wrong, however, so please see the corrections on the web (Class 12, formulas for Trick #4). Note the history of the merges, and attach it to your homework. Now run the same algorithm again, but stop when reaching 15 classes. Print out all the members of your 15 classes and attach them too.
The initial mutual information is (English, words, limit 8000):
4.99726326162518 (if you add one extra word at the beginning of the data)
4.99633675507535 (if you use the data as they are and are carefull at the beginning and end).
NB: the above numbers are finally confirmed from an independent source :-).
The first 5 merges you get on the English data should be:
The loss of Mutual Information when merging the words
Minimal loss: 0.00219656653357569 for case+subject
Use the same original data as above, but this time, you will compute the classes for tags (the strings after slashes). Compute tag classes for all tags appearing 5 times or more in the data. Use as much data as time allows. You will be graded relative to the other student's results. Again, note the full history of merges, and attach it to your homework. Pick three interesting classes as the algorithm goes (English data only; Czech optional), and comment on them (why you think you see those tags there together (or not), etc.).
Deadline: Jan 31 23:59 100 points
For all parts of this homework, work either alone or in a group of max. two people (identical grade will be assigned to both of you in such a case - thus please make sure you understand what your colleague is doing, and that s/he is doing it right!). On top of the results/requirements specific to a certain part of the homework, turn in all of your code, commented in such a way that it is possible to determine what, how and why you did what you did solely from the comments, and a discussion/comments of/on the results (in a plain text/html) file. Technically, follow the usual pattern (see the Syllabus): For this whole homework, use data found in
In the following, "the data" refers to both English and Czech, as usual.
Split the data in the following way: use last 40,000 words for testing (data S), and from the remaining data, use the last 20,000 for smoothing (data H, if any). Call the rest "data T" (training).
Download Eric Brill's supervised tagger from UFAL's course assignment space. Install it (i.e., uncompress (gunzip), untar, and make).
You might need to make some changes in his makefile of course (it's and OLD program, in this fast changing world...).
After installation, get the data, train it on as much data from T as time allows (in the package, there is an extensive documentation on how to train it on new data), and evaluate on data S. Tabulate the results.
Do cross-validation of the results: split the data into S', [H',] T' such that S' is the first 40,000 words, and T' is the last but the first 20,000 words from the rest. Train Eric Brill's tagger on T' (again, use as much data as time allows) and evaluate on S'. Again, tabulate the results.
Do three more splits of your data (using the same formula: 40k/20k/the rest) in some way or another (as different as possible), and get another three sets of results. Compute the mean (average) accuracy and the standard deviation of the accuracy. Tabulate all results.
Use the datasets T, H, and S. Estimate the parameters of an HMM tagger using supervised learning off the T data (trigram and lower models for tags). Smooth (both the trigram tag model as well as the lexical model) in the same way as in Homework No. 1 (use data H). Evaluate your tagger on S, using the Viterbi algorithm.
Now use only the first 10,000 words of T to estimate the initial (raw) parameters of the HMM tagging model. Strip off the tags from the remaining data T. Use the Baum-Welch algorithm to improve on the initial parameters. Smooth as usual. Evaluate your unsupervised HMM tagger and compare the results to the supervised HMM tagger.
Tabulate and compare the results of the HMM tagger vs. the Brill's tagger.
Create a separate directory
assign for your submission. Create a main web page called
index.htm in that directory. Create as many other web pages as necessary. Put all the other necessary files (.ps and .pdf files, pictures, source code, ...) into the same directory and make relative links to them from your main or other linked web pages. If you use some "content creation" tools related to MSFT software please make sure the references use the correct case (matching uppercase/lowercase).
Pack everything into a single
tar -czvf ~/FirstName.LastName.assignX.tgz ./*
e.g. for Jan Novák turning in Assignment !
tar -czvf ~/Jan.Novak.assign1.tgz ./*
Send the resulting file by e-mail (as an attachment) to
with the following subject line:
Subject: FirstName.LastName NPFL068 Assignment X
e.g. for Jan Novák
Subject: Jan.Novak NPFL068 Assignment X
For MFF UK students, please see http://www.ms.mff.cuni.cz/labs/unix. For others, please visit http://www.ms.mff.cuni.cz/students/externisti.html.
The grading table is available in SIS.
The final grade (or pass/fail for PhD students) will be determined by both the final exam and your assignment results in a 1:1:1 ratio.
To get an idea of the type of exam questions, please see the questionaire for one of the previous year's final exam (Questionnaire).
No plagiarism will be tolerated. The assignments are to be worked on on your own; please respect it. If the instructor determines that there are substantial similarities exceeding the likelihood of such an event, he will call the two (or more) students to explain them and possibly to take an immediate test (or assignment, at the discretion of the instructor, not to exceed four hours of work) to determine the student's abilities related to the offending work. All cases of confirmed plagiarism will be reported to the Student Office.
For each day your submission is late, 5 points will be subtracted from the points awarded to the solution or a part of it, up to max. of 50 points per homework.
Submissions received less then 4 weeks before the closing date of the term will not be graded and will be awarded 0 points.
Manning, C. D. and H. Schütze. MIT Press. 1999. ISBN 0-262-13360-1.
Eight copies of this book are available at the CS library for borrowing. Please be considerate to other students and do not keep the book(s) longer than absolutely necessary.
Jurafsky, D. and J. H. Martin. Prentice-Hall. 2000. ISBN 0-13-095069-6.
Three copies of Jurafsky's book are available at UFAL's library.
Wall, L., Christiansen, T. and R. L. Schwartz. O'Reilly. 1996. ISBN 0-596-00027-8.
(Sorry no large cover picture available.)
Allen, J.. Benajmins/Cummings Publishing Company 1994. ISBN 0-8053-0334-0.
Cover, T. M. and J. A. Thomas. Wiley. 1991. ISBN 0-471-06259-6.
Charniak, E. MIT Press. 1996. ISBN 0-262-53141-0.
Jelinek, F. MIT Press. 1998. ISBN 0-262-10066-5.
Four copies of Jelinek's book are available at UFAL's library, but they are primarily reserved for those taking Nino Peterek's and/or Filip Jurcicek's courses.
Some of the Proceedings are available at UFAL's library, physically and/or in electronic form. Most of them are, however, freely available through the ACL Anthology, including all volumes of the Computational Linguistics journal and the new Transactions of the ACL journal.
CLSP Workshops: Language Engineering for Students and Professionals Integrating Research - and Education
Eric Brill's short guide to Perl.