TA: Gideon S. Mann
In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.
First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.
barley:~hajic/cs465/TEXTEN1.txt and barley:~hajic/cs465/TEXTCZ1.txt
(For this part of Assignment 2, there is no need to split the data in any way.)
Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, disregarding pairs in which one or both words appear less than 10 times in the corpus, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.
Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets.
barley:~hajic/cs465/TEXTEN1.ptg
barley:~hajic/cs465/TEXTCZ1.ptg
These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:
rady/NNFS2-----A----
,/Z:-------------
where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.
Similarly for the English texts (except the tags are shorter of course).
The initial mutual information is (English, words, limit 8000):
4.99726326162518
(if you add one extra word at the beginning of the data)
4.99633675507535
(if you use the data as they are and are carefull at the beginning and end).
NB: the above numbers are finally confirmed from an independent source :-).
The first 5 merges you get on the English data should be:
case subject
cannot may
individuals structure
It there
even less
The loss of Mutual Information when merging the words "case" and "subject":
Minimal loss: 0.00219656653357569 for case+subject