Every morphological analyzer must somehow encode two things: the lexicon, and the morphological rules. Of these two, the lexicon is more difficult to obtain. It is not just a list of words in the language. For each word it must also encode its category, i.e. part of speech and inflectional type (paradigm).
In this lab exercise, we will experiment with heuristics that may be helpful in rapid development of morphological analyzers for new languages. We will try to automatically categorize words found in corpora (both raw corpora and tagged corpora).
Our English data comes without part-of-speech tags but it is not really raw text. It comes from the Penn Treebank / Wall Street Journal. I removed part of speech tags and syntactic annotation but I kept tokenization (i.e. punctuation symbols are not stuck to the neighboring words). Furthermore, there are traces from the syntactic annotation that do not correspond to any surface word. Open a console/terminal window, download and unpack the data to your machine:
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/wsj.txt.gz gunzip wsj.txt.gz
Hint: Are you bored by getting results that are already described on this page? Try another English corpus and see how the results differ. Here is how you get text from the English Web Corpus (the UD_English-EWT treebank of Universal Dependencies):
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/ewt.txt.gz gunzip ewt.txt.gz
Look what is inside (press "q" to quit):
less wsj.txt
Transform the text so that every word is on a separate line:
cat wsj.txt | perl -CDS -pe 's/\s+/\n/g' > wsj-wpl.txt less wsj-wpl.txt
Note: There are of course numerous possible ways to achieve our goals. You can use your own favorite method. The examples here heavily rely on the Perl scripting language, and filtering of the lists is done using Perl regular expressions. You can find tons of documentation on Perl RE on the web, e.g. perlretut at perldoc.
Count occurrencies of every word and create a list of unique words with frequencies:
cat wsj-wpl.txt | perl -CDS -e 'while(<>) { chomp; $h{$_}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > wsj-freq.txt less wsj-freq.txt
Remove the traces from the Penn Treebank, i.e. remove all words containing the "*" character:
cat wsj-freq.txt | grep -vP '\*' > wsj-01.txt less wsj-01.txt
Some words are capitalized because they occurred in a sentence-initial position. We do not want to count The and the as two distinct word types. We may thus want to lowercase all words before adding them to the list. That of course means that we also lose the possibility to detect proper nouns, which would be useful too. But detecting them would be more difficult, let's just ignore proper nouns here and lowercase everything. The following modification of the above commands (note the lc function in the Perl code) will do the trick.
cat wsj-wpl.txt | perl -CDS -e 'while(<>) { chomp; $h{lc($_)}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > wsj-freq.txt cat wsj-freq.txt | grep -vP '\*' > wsj-01.txt
We can use the wc command to count the words on the list:
cat wsj-01.txt | wc -l
43764
For our morphological lexicon we are interested in real words. Not numbers and not punctuation symbols. Remove words that contain any punctuation or digit. Remember that our current list also contains frequencies, i.e. every line contains at least one digit. That is why the second filter is more complex, looking for digits in the first column. The first filter specifically mentions the grave accent ("`") and the dollar sign ("$") because they are not considered punctuation.
cat wsj-01.txt | grep -vP '[\pP\`\$]' > wsj-02.txt cat wsj-02.txt | grep -vP '\d.*\t' > wsj-03.txt
You can use the diff command to check what words were removed between two versions of the list. Note that we actually removed abbreviations (because they contain the period) and compounds with hyphen (e.g. third-quarter).
diff wsj-01.txt wsj-03.txt | grep -P '^<' | less
Now look at the list (less wsj-03.txt). Many of the most frequent words are closed-class. For them it may be easier to just enumerate them manually (of course only if we have enough information on the target language to identify them!) Look for pronouns, determiners, numerals, auxiliary verbs, pronominal adverbs, prepositions, conjunctions, particles.
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/en-closed-class-list.txt perl -CDS -e 'open(CCL, "en-closed-class-list.txt"); while(<CCL>) { chomp; $ccl{lc($_)}++ } while(<>) { ($w, $n) = split("\t"); next if exists $ccl{$w}; print }' < wsj-03.txt > wsj-04.txt
The file wsj-04.txt contains over 30K open-class words. There are nouns, verbs, adjectives and adverbs. Can we tell them apart and identify their base forms? Without manually tagging each occurrence of each word? The answer is yes—partially. Knowing how English grammar works, we can find words that follow typical behavior of nouns, verbs etc. For example, if we see both book and books, we can deduce that either book is a singular noun and books is the corresponding plural form, or book is a verb and books is its 3rd person singular present form. We will miss many words that did not occur in both forms in our corpus. But we still have a good chance of identifying thousands of words which are very likely to be either nouns or verbs (or both).
You may write the program to find pairs like book-books in your favorite programming language, or you may download the following Perl script, en-lexicon-patterns.pl, inspect it and modify for the subsequent tasks.
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/en-lexicon-patterns.pl
chmod 755 en-lexicon-patterns.pl
cat wsj-04.txt | perl en-lexicon-patterns.pl > wsj-pairs-05.txt
cat wsj-pairs-05.txt | wc -l
4448
Sometimes a tagged corpus is available but the morphological analyzer is not, and we have to build it ourselves. We can use the tags to determine the part of speech of each word. However, we also need to separate words of different inflection classes: our MA lexicon has to know the inflection class for each word.
Our Czech data comes from two treebanks, PDT and CAC, together comprising about 2M words. Every word appears on a separate line, empty lines delimit sentences. Non-empty lines always contain the word, then a TAB character (referred to by \t in regular expressions) and the morphological tag, which is usually a string of 15 characters. See here for documentation of the tagset. In our experiments we will pretend that the underlying MA is not available although it actually can be downloaded and is also available as a web service. You can try the analysis online and there is also a reversed interface where you can enter a lemma and generate all forms with tags.
As with the English data, we will download and unpack the corpus. Then we will remove the empty lines between sentences and convert the text to a list with frequencies. The unit of the list is now not just the word, but a word-tag pair. We will lowercase the words but not the tags.
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/cs-tagged.txt.gz gunzip cs-tagged.txt.gz cat cs-tagged.txt | grep -vP '^\s*$' | grep -vP '[\d\pP\`\$\|].*\t' > cs-tagged-nempty.txt cat cs-tagged-nempty.txt | perl -CSD -e 'while(<>) { chomp; ($w, $t) = split(/\t/); $h{lc($w)."\t$t"}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > cs-tagged-freq.txt
Now let's look at Czech nouns. Each of the three genders inflects differently, so let's focus on just one gender, the feminine nouns. Filtering them is easy as the tags encode the gender in their third character. We will also require the last (fifteenth) character to be "-", which should rule out abbreviations and non-standard forms.
cat cs-tagged-freq.txt | grep -P '\tNNF...........-\t' > cs-nf-01.txt
Unfortunately, the gender is not enough to determine the inflection class of a noun. Czech feminine nouns are divided into four main inflection types, traditionally identified by model nouns: žena (woman), růže (rose), píseň (song) and kost (bone). Here is an overview:
Model | Singular suffixes | Plural suffixes | Examples |
žena | a y e u o e ou | y 0 ám y y ách ami | Praha, koruna, doba, strana, vláda |
růže | e e i i e i í | e í ím e e ích emi | práce, země, informace, situace, akcie |
píseň | 0 e i 0 i i í | e í ím e e ích emi | úroveň, soutěž, daň, zbraň, Plzeň |
kost | 0 i i 0 i i í | i í em i i ech mi | společnost, oblast, činnost, možnost, souvislost |
We can focus on the base forms, i.e. nominative singular, and sort them according to the final letters:
cat cs-nf-01.txt | grep -P 'a\tNNFS1' > cs-nf-02-zena.txt cat cs-nf-01.txt | grep -P '[eě]\tNNFS1' > cs-nf-02-ruze.txt cat cs-nf-01.txt | grep -P '[jňřxž]\tNNFS1' > cs-nf-02-pisen.txt cat cs-nf-01.txt | grep -P '[bmst]\tNNFS1' > cs-nf-02-kost.txt
The following letters almost never occur at the end of domestic feminine lemmas (one known exception being paní (lady)): d, é, f, g, h, i, í, k, n, o, ó, p, q, r, u, ú, ů, w, y, ý. If they appear in loanwords or foreign names, the noun probably does not inflect in Czech. However, there are still a number of consonants that do not disambiguate between the píseň and kost classes:
cat cs-nf-01.txt | grep -P '[cčďlšťvz]\tNNFS1' > cs-nf-02-pisen-kost.txt
If we want to add these to the lexicon, we have to look for their non-base forms. Moreover, we may want to search the non-base forms for nouns that never occurred in the nominative singular.
Pick a language, get a corpus, extract the lexicon—as large and as good as possible. Focus on the principal open classes: nouns, verbs, adjectives and adverbs. Remember that the purpose of the resulting lexicon is to be used as a part of a morphological analyzer/generator (and you will probably want to use this lexicon when creating your own analyzer for Homework 2). We will cover the details of such a system later but here we can at least summarize the important properties that the lexicon should have.
I am providing corpora of a few languages here (below) in this section. If you prefer to work with another language, you can do so but you have to obtain the corpus on your own (but talk to me first – maybe I can help with the data).
Submission: Your solution should contain 1. The file with the resulting lexicon; 2. The script(s) necessary to create the lexicon from the corpus, ideally including a Makefile containing the commands needed to run the scripts and regenerate the lexicon on a Linux system (the path to the input corpus can be configurable in the Makefile but it should not be hardcoded in the scripts); 3. Documentation/report file: What language do you work with, how does the input look like (plaint text/tagged?), what part of the language did you cover, what heuristics did you use, any other interesting observations you made. You should also document how your scripts are invoked, unless you provided a clearly readable and commented Makefile. If you choose to work with your own data, also briefly describe where the data comes from. Zip all these files as hw1.zip
and send them by e-mail to zeman@ufal.mff.cuni.cz.