Lexicon Acquisition (lab)

Every morphological analyzer must somehow encode two things: the lexicon, and the morphological rules. Of these two, the lexicon is more difficult to obtain. It is not just a list of words in the language. For each word it must also encode its category, i.e. part of speech and inflectional type (paradigm).

In this lab exercise, we will experiment with heuristics that may be helpful in rapid development of morphological analyzers for new languages. We will try to automatically categorize words found in corpora (both raw corpora and tagged corpora).

English Untagged

Our English data comes without part-of-speech tags but it is not really raw text. It comes from the Penn Treebank / Wall Street Journal. I removed part of speech tags and syntactic annotation but I kept tokenization (i.e. punctuation symbols are not stuck to the neighboring words). Furthermore, there are traces from the syntactic annotation that do not correspond to any surface word. Open a console/terminal window, download and unpack the data to your machine:

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/wsj.txt.gz
gunzip wsj.txt.gz

Hint: Are you bored by getting results that are already described on this page? Try another English corpus and see how the results differ. Here is how you get text from the English Web Corpus (the UD_English-EWT treebank of Universal Dependencies):

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/ewt.txt.gz
gunzip ewt.txt.gz

Look what is inside (press "q" to quit):

less wsj.txt

Transform the text so that every word is on a separate line:

cat wsj.txt | perl -pe 's/\s+/\n/g' > wsj-wpl.txt
less wsj-wpl.txt

Note: There are of course numerous possible ways to achieve our goals. You can use your own favorite method. The examples here heavily rely on the Perl scripting language, and filtering of the lists is done using Perl regular expressions. You can find tons of documentation on Perl RE on the web, e.g. perlretut at perldoc.

Count occurrencies of every word and create a list of unique words with frequencies:

cat wsj-wpl.txt | perl -e 'while(<>) { chomp; $h{$_}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > wsj-freq.txt
less wsj-freq.txt

Remove the traces from the Penn Treebank, i.e. remove all words containing the "*" character:

cat wsj-freq.txt | grep -vP '\*' > wsj-01.txt
less wsj-01.txt

Some words are capitalized because they occurred in a sentence-initial position. We do not want to count The and the as two distinct word types. We may thus want to lowercase all words before adding them to the list. That of course means that we also lose the possibility to detect proper nouns, which would be useful too. But detecting them would be more difficult, let's just ignore proper nouns here and lowercase everything. The following modification of the above commands (note the lc function in the Perl code) will do the trick.

cat wsj-wpl.txt | perl -e 'while(<>) { chomp; $h{lc($_)}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > wsj-freq.txt
cat wsj-freq.txt | grep -vP '\*' > wsj-01.txt

We can use the wc command to count the words on the list:

cat wsj-01.txt | wc -l

For our morphological lexicon we are interested in real words. Not numbers and not punctuation symbols. Remove words that contain any punctuation or digit. Remember that our current list also contains frequencies, i.e. every line contains at least one digit. That is why the second filter is more complex, looking for digits in the first column. The first filter specifically mentions the grave accent ("`") and the dollar sign ("$") because they are not considered punctuation.

cat wsj-01.txt | grep -vP '[\pP\`\$]' > wsj-02.txt
cat wsj-02.txt | grep -vP '\d.*\t' > wsj-03.txt

You can use the diff command to check what words were removed between two versions of the list. Note that we actually removed abbreviations (because they contain the period) and compounds with hyphen (e.g. third-quarter).

diff wsj-01.txt wsj-03.txt | grep -P '^<' | less

Now look at the list (less wsj-03.txt). Many of the most frequent words are closed-class. For them it may be easier to just enumerate them manually (of course only if we have enough information on the target language to identify them!) Look for pronouns, determiners, numerals, auxiliary verbs, pronominal adverbs, prepositions, conjunctions, particles.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/en-closed-class-list.txt
perl -e 'open(CCL, "en-closed-class-list.txt"); while(<CCL>) { chomp; $ccl{lc($_)}++ } while(<>) { ($w, $n) = split("\t"); next if exists $ccl{$w}; print }' < wsj-03.txt > wsj-04.txt

The file wsj-04.txt contains over 30K open-class words. There are nouns, verbs, adjectives and adverbs. Can we tell them apart and identify their base forms? Without manually tagging each occurrence of each word? The answer is yes—partially. Knowing how English grammar works, we can find words that follow typical behavior of nouns, verbs etc. For example, if we see both book and books, we can deduce that either book is a singular noun and books is the corresponding plural form, or book is a verb and books is its 3rd person singular present form. We will miss many words that did not occur in both forms in our corpus. But we still have a good chance of identifying thousands of words which are very likely to be either nouns or verbs (or both).

You may write the program to find pairs like book-books in your favorite programming language, or you may download the following Perl script, en-lexicon-patterns.pl, inspect it and modify for the subsequent tasks.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/en-lexicon-patterns.pl
chmod 755 en-lexicon-patterns.pl
cat wsj-04.txt | perl en-lexicon-patterns.pl > wsj-pairs-05.txt
cat wsj-pairs-05.txt | wc -l

Czech Tagged

Sometimes a tagged corpus is available but the morphological analyzer is not, and we have to build it ourselves. We can use the tags to determine the part of speech of each word. However, we also need to separate words of different inflection classes: our MA lexicon has to know the inflection class for each word.

Our Czech data comes from two treebanks, PDT and CAC, together comprising about 2M words. Every word appears on a separate line, empty lines delimit sentences. Non-empty lines always contain the word, then a TAB character (referred to by \t in regular expressions) and the morphological tag, which is usually a string of 15 characters. See here for documentation of the tagset. In our experiments we will pretend that the underlying MA is not available although it actually can be downloaded and is also available as a web service. You can try the analysis online and there is also a reversed interface where you can enter a lemma and generate all forms with tags.

As with the English data, we will download and unpack the corpus. Then we will remove the empty lines between sentences and convert the text to a list with frequencies. The unit of the list is now not just the word, but a word-tag pair. We will lowercase the words but not the tags.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/cs-tagged.txt.gz
gunzip cs-tagged.txt.gz
cat cs-tagged.txt | grep -vP '^\s*$' | grep -vP '[\d\pP\`\$\|].*\t' > cs-tagged-nempty.txt
cat cs-tagged-nempty.txt | perl -CSD -e 'while(<>) { chomp; ($w, $t) = split(/\t/); $h{lc($w)."\t$t"}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > cs-tagged-freq.txt

Now let's look at Czech nouns. Each of the three genders inflects differently, so let's focus on just one gender, the feminine nouns. Filtering them is easy as the tags encode the gender in their third character. We will also require the last (fifteenth) character to be "-", which should rule out abbreviations and non-standard forms.

cat cs-tagged-freq.txt | grep -P '\tNNF...........-\t' > cs-nf-01.txt

Unfortunately, the gender is not enough to determine the inflection class of a noun. Czech feminine nouns are divided into four main inflection types, traditionally identified by model nouns: žena (woman), růže (rose), píseň (song) and kost (bone). Here is an overview:

ModelSingular suffixesPlural suffixesExamples
ženaa y e u o e ouy 0 ám y y ách amiPraha, koruna, doba, strana, vláda
růžee e i i e i íe í ím e e ích emipráce, země, informace, situace, akcie
píseň0 e i 0 i i íe í ím e e ích emiúroveň, soutěž, daň, zbraň, Plzeň
kost0 i i 0 i i íi í em i i ech mispolečnost, oblast, činnost, možnost, souvislost

We can focus on the base forms, i.e. nominative singular, and sort them according to the final letters:

cat cs-nf-01.txt | grep -P 'a\tNNFS1' > cs-nf-02-zena.txt
cat cs-nf-01.txt | grep -P '[eě]\tNNFS1' > cs-nf-02-ruze.txt
cat cs-nf-01.txt | grep -P '[jňřxž]\tNNFS1' > cs-nf-02-pisen.txt
cat cs-nf-01.txt | grep -P '[bmst]\tNNFS1' > cs-nf-02-kost.txt

The following letters do not occur at the end of domestic feminine lemmas: d, é, f, g, h, i, í, k, n, o, ó, p, q, r, u, ú, ů, w, y, ý. If they appear in loanwords or foreign names, the noun probably does not inflect in Czech. However, there are still a number of consonants that do not disambiguate between the píseň and kost classes:

cat cs-nf-01.txt | grep -P '[cčďlšťvz]\tNNFS1' > cs-nf-02-pisen-kost.txt

If we want to add these to the lexicon, we have to look for their non-base forms. Moreover, we may want to search the non-base forms for nouns that never occurred in the nominative singular.


Pick a language, get a corpus, extract the lexicon—as large and as good as possible. Focus on the principal open classes: nouns, verbs, adjectives and adverbs.

I am providing corpora of a few languages here in this section. If you prefer to work with another language, you can do so but you have to obtain the corpus on your own (but talk to me first – maybe I can help with the data). Send an e-mail to zeman@ufal.mff.cuni.cz with the lexicon and with a text describing how you proceeded (a commented list of commands or a commented script / source code should suffice). If you choose to work with your own data, also briefly describe what language it is and where the data comes from.