Asignment #2: PFL067 Statistical NLP

Words and The Company They Keep

Instructor: Jan Hajic <hajic@ufal.mff.cuni.cz>
TA: Pavel Pecina <pecina@ufal.mff.cuni.cz>

Back to syllabus.

Requirements

For all parts of this homework, work alone. On top of the results/requirements specific to a certain part of the homework, turn in all of your code, commented in such a way that it is possible to determine what, how and why you did what you did solely from the comments, and a discussion/comments of/on the results (in a plain text/html) file. Technically, follow the usual pattern (see the Syllabus):

1. Best Friends

In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.

First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.

TEXTEN1.txt and TEXTCZ1.txt

(For this part of Assignment 2, there is no need to split the data in any way.)

Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, disregarding pairs in which one or both words appear less than 10 times in the corpus, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.

Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets.

2. Word Classes

The data

Get

TEXTEN1.ptg
TEXTCZ1.ptg

These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:

rady/NNFS2-----A----
,/Z:-------------

where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.

Similarly for the English texts (except the tags are shorter of course).

The Task

Compute a full class hierarchy of words using the first 8,000 words of those data, and only for words occurring 10 times or more (use the same setting for both languages). Ignore the other words for building the classes, but keep them in the data for the bigram counts. For details on the algorithm, use the Brown et al. paper distributed in the class; some formulas are wrong, however, so please see the corrections on the web (Class 12, formulas for Trick #4). Note the history of the merges, and attach it to your homework. Now run the same algorithm again, but stop when reaching 15 classes. Print out all the members of your 15 classes and attach them too.

Hints:

The initial mutual information is (English, words, limit 8000):

4.99726326162518 (if you add one extra word at the beginning of the data)
4.99633675507535 (if you use the data as they are and are carefull at the beginning and end).

NB: the above numbers are finally confirmed from an independent source :-).

The first 5 merges you get on the English data should be:

case subject
cannot may
individuals structure
It there
even less

The loss of Mutual Information when merging the words "case" and "subject":

Minimal loss: 0.00219656653357569 for case+subject

3. Tag Classes

Use the same original data as above, but this time, you will compute the classes for tags (the strings after slashes). Compute tag classes for all tags appearing 5 times or more in the data. Use as much data as time allows. You will be graded relative to the other student's results. Again, note the full history of merges, and attach it to your homework. Pick three interesting classes as the algorithm goes (English data only; Czech optional), and comment on them (why you think you see those tags there together (or not), etc.).