Lexicon Acquisition (lab)

Every morphological analyzer must somehow encode two things: the lexicon, and the morphological rules. Of these two, the lexicon is more difficult to obtain. It is not just a list of words in the language. For each word it must also encode its category, i.e. part of speech and inflectional type (paradigm).

In this lab exercise, we will experiment with heuristics that may be helpful in rapid development of morphological analyzers for new languages. We will try to automatically categorize words found in corpora (both raw corpora and tagged corpora).

English Untagged

Our English data comes without part-of-speech tags but it is not really raw text. It comes from the Penn Treebank / Wall Street Journal. I removed part of speech tags and syntactic annotation but I kept tokenization (i.e. punctuation symbols are not stuck to the neighboring words). Furthermore, there are traces from the syntactic annotation that do not correspond to any surface word. Open a console/terminal window, download and unpack the data to your machine:

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/wsj.txt.gz
gunzip wsj.txt.gz

Hint: Are you bored by getting results that are already described on this page? Try another English corpus and see how the results differ. Here is how you get text from the English Web Corpus (the UD_English-EWT treebank of Universal Dependencies):

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/ewt.txt.gz
gunzip ewt.txt.gz

Look what is inside (press "q" to quit):

less wsj.txt

Transform the text so that every word is on a separate line:

cat wsj.txt | perl -CDS -pe 's/\s+/\n/g' > wsj-wpl.txt
less wsj-wpl.txt

Note: There are of course numerous possible ways to achieve our goals. You can use your own favorite method. The examples here heavily rely on the Perl scripting language, and filtering of the lists is done using Perl regular expressions. You can find tons of documentation on Perl RE on the web, e.g. perlretut at perldoc.

Count occurrencies of every word and create a list of unique words with frequencies:

cat wsj-wpl.txt | perl -CDS -e 'while(<>) { chomp; $h{$_}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > wsj-freq.txt
less wsj-freq.txt

Remove the traces from the Penn Treebank, i.e. remove all words containing the "*" character:

cat wsj-freq.txt | grep -vP '\*' > wsj-01.txt
less wsj-01.txt

Some words are capitalized because they occurred in a sentence-initial position. We do not want to count The and the as two distinct word types. We may thus want to lowercase all words before adding them to the list. That of course means that we also lose the possibility to detect proper nouns, which would be useful too. But detecting them would be more difficult, let's just ignore proper nouns here and lowercase everything. The following modification of the above commands (note the lc function in the Perl code) will do the trick.

cat wsj-wpl.txt | perl -CDS -e 'while(<>) { chomp; $h{lc($_)}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > wsj-freq.txt
cat wsj-freq.txt | grep -vP '\*' > wsj-01.txt

We can use the wc command to count the words on the list:

cat wsj-01.txt | wc -l
43764

For our morphological lexicon we are interested in real words. Not numbers and not punctuation symbols. Remove words that contain any punctuation or digit. Remember that our current list also contains frequencies, i.e. every line contains at least one digit. That is why the second filter is more complex, looking for digits in the first column. The first filter specifically mentions the grave accent ("`") and the dollar sign ("$") because they are not considered punctuation.

cat wsj-01.txt | grep -vP '[\pP\`\$]' > wsj-02.txt
cat wsj-02.txt | grep -vP '\d.*\t' > wsj-03.txt

You can use the diff command to check what words were removed between two versions of the list. Note that we actually removed abbreviations (because they contain the period) and compounds with hyphen (e.g. third-quarter).

diff wsj-01.txt wsj-03.txt | grep -P '^<' | less

Now look at the list (less wsj-03.txt). Many of the most frequent words are closed-class. For them it may be easier to just enumerate them manually (of course only if we have enough information on the target language to identify them!) Look for pronouns, determiners, numerals, auxiliary verbs, pronominal adverbs, prepositions, conjunctions, particles.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/en-closed-class-list.txt
perl -CDS -e 'open(CCL, "en-closed-class-list.txt"); while(<CCL>) { chomp; $ccl{lc($_)}++ } while(<>) { ($w, $n) = split("\t"); next if exists $ccl{$w}; print }' < wsj-03.txt > wsj-04.txt

The file wsj-04.txt contains over 30K open-class words. There are nouns, verbs, adjectives and adverbs. Can we tell them apart and identify their base forms? Without manually tagging each occurrence of each word? The answer is yes—partially. Knowing how English grammar works, we can find words that follow typical behavior of nouns, verbs etc. For example, if we see both book and books, we can deduce that either book is a singular noun and books is the corresponding plural form, or book is a verb and books is its 3rd person singular present form. We will miss many words that did not occur in both forms in our corpus. But we still have a good chance of identifying thousands of words which are very likely to be either nouns or verbs (or both).

You may write the program to find pairs like book-books in your favorite programming language, or you may download the following Perl script, en-lexicon-patterns.pl, inspect it and modify for the subsequent tasks.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/en-lexicon-patterns.pl
chmod 755 en-lexicon-patterns.pl
cat wsj-04.txt | perl en-lexicon-patterns.pl > wsj-pairs-05.txt
cat wsj-pairs-05.txt | wc -l
4448

Czech Tagged

Sometimes a tagged corpus is available but the morphological analyzer is not, and we have to build it ourselves. We can use the tags to determine the part of speech of each word. However, we also need to separate words of different inflection classes: our MA lexicon has to know the inflection class for each word.

Our Czech data comes from two treebanks, PDT and CAC, together comprising about 2M words. Every word appears on a separate line, empty lines delimit sentences. Non-empty lines always contain the word, then a TAB character (referred to by \t in regular expressions) and the morphological tag, which is usually a string of 15 characters. See here for documentation of the tagset. In our experiments we will pretend that the underlying MA is not available although it actually can be downloaded and is also available as a web service. You can try the analysis online and there is also a reversed interface where you can enter a lemma and generate all forms with tags.

As with the English data, we will download and unpack the corpus. Then we will remove the empty lines between sentences and convert the text to a list with frequencies. The unit of the list is now not just the word, but a word-tag pair. We will lowercase the words but not the tags.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-lexicon/cs-tagged.txt.gz
gunzip cs-tagged.txt.gz
cat cs-tagged.txt | grep -vP '^\s*$' | grep -vP '[\d\pP\`\$\|].*\t' > cs-tagged-nempty.txt
cat cs-tagged-nempty.txt | perl -CSD -e 'while(<>) { chomp; ($w, $t) = split(/\t/); $h{lc($w)."\t$t"}++ } @k = sort {$h{$b} <=> $h{$a}} keys(%h); foreach $w (@k) { print("$w\t$h{$w}\n") }' > cs-tagged-freq.txt

Now let's look at Czech nouns. Each of the three genders inflects differently, so let's focus on just one gender, the feminine nouns. Filtering them is easy as the tags encode the gender in their third character. We will also require the last (fifteenth) character to be "-", which should rule out abbreviations and non-standard forms.

cat cs-tagged-freq.txt | grep -P '\tNNF...........-\t' > cs-nf-01.txt

Unfortunately, the gender is not enough to determine the inflection class of a noun. Czech feminine nouns are divided into four main inflection types, traditionally identified by model nouns: žena (woman), růže (rose), píseň (song) and kost (bone). Here is an overview:

Model	Singular suffixes	Plural suffixes	Examples
žena	a y e u o e ou	y 0 ám y y ách ami	Praha, koruna, doba, strana, vláda
růže	e e i i e i í	e í ím e e ích emi	práce, země, informace, situace, akcie
píseň	0 e i 0 i i í	e í ím e e ích emi	úroveň, soutěž, daň, zbraň, Plzeň
kost	0 i i 0 i i í	i í em i i ech mi	společnost, oblast, činnost, možnost, souvislost

We can focus on the base forms, i.e. nominative singular, and sort them according to the final letters:

cat cs-nf-01.txt | grep -P 'a\tNNFS1' > cs-nf-02-zena.txt
cat cs-nf-01.txt | grep -P '[eě]\tNNFS1' > cs-nf-02-ruze.txt
cat cs-nf-01.txt | grep -P '[jňřxž]\tNNFS1' > cs-nf-02-pisen.txt
cat cs-nf-01.txt | grep -P '[bmst]\tNNFS1' > cs-nf-02-kost.txt

The following letters almost never occur at the end of domestic feminine lemmas (one known exception being paní (lady)): d, é, f, g, h, i, í, k, n, o, ó, p, q, r, u, ú, ů, w, y, ý. If they appear in loanwords or foreign names, the noun probably does not inflect in Czech. However, there are still a number of consonants that do not disambiguate between the píseň and kost classes:

cat cs-nf-01.txt | grep -P '[cčďlšťvz]\tNNFS1' > cs-nf-02-pisen-kost.txt

If we want to add these to the lexicon, we have to look for their non-base forms. Moreover, we may want to search the non-base forms for nouns that never occurred in the nominative singular.

Homework

Pick a language, get a corpus, extract the lexicon—as large and as good as possible. Focus on the principal open classes: nouns, verbs, adjectives and adverbs. Remember that the purpose of the resulting lexicon is to be used as a part of a morphological analyzer/generator (and you will probably want to use this lexicon when creating your own analyzer for Homework 2). We will cover the details of such a system later but here we can at least summarize the important properties that the lexicon should have.

An entry in the lexicon is (as with human-oriented dictionaries) a lexeme, not individual word form. You should create a text file with one lexeme per line. There will be several pieces of information for each lexeme, but not the list of its word forms; the morphological analyzer/generator will be responsible for generating the word forms given the base form in the lexicon and the inflection rules defined for the language.
While the morphological rules are not part of the lexicon, it is necessary to signal for each lexeme which rules apply to it. We must provide an identifier of its inflectional class. Typically, the class must contain the part-of-speech category (different rules apply to nouns and different to verbs), but part of speech itself may not be enough. For example, nouns in Slavic languages use different suffixes depending on their gender, and even within one gender there are multiple inflection classes, often identified by prototypical members. Hence the feminine nouns in Czech are divided into four main inflection classes, identified as žena (woman), růže (rose), píseň (song), kost (bone). Then, e.g., the class of the Czech noun koruna (crown) could have the label NFzena (noun, feminine, žena inflection).

English nouns may have up to four classes depending on how they form the plural: -s (dog-dogs), -es (business-businesses), no change (fish-fish), irregular (cactus-cacti); but in English the class may not be so crucial, as the no change / irregular cases are rare, and the -es vs. -s distinction can largely be solved by context-aware rules. In other languages, the class may be less predictable, e.g., in German there are several possible plural suffixes, and in addition the stem-internal vowel may or may not change in plural.

A lexeme is identified by its lemma, i.e., the canonical citation form. (For instance, the usual citation form of Slavic nouns is the singular nominative.) Besides the lemma, the morphological analyzer will also need to know the stem, which is not necessarily identical to the lemma. In the Czech example above (crown), the lemma is koruna but the stem is only korun. So the lexeme line in your lexicon should contain three TAB-separated items: lemma, stem, inflection_class_tag. When we will explore the tools for building morphological analyzers, we will see that various tools have their own input formats, but the format outlined here is reasonably close to them and simple so the data can be used easily.
If your source data is already tagged, the task is simpler and finding the stems and inflection classes is the main/only value you can add.

I am providing corpora of a few languages here (below) in this section. If you prefer to work with another language, you can do so but you have to obtain the corpus on your own (but talk to me first – maybe I can help with the data).

Submission: Your solution should contain 1. The file with the resulting lexicon; 2. The script(s) necessary to create the lexicon from the corpus, ideally including a Makefile containing the commands needed to run the scripts and regenerate the lexicon on a Linux system (the path to the input corpus can be configurable in the Makefile but it should not be hardcoded in the scripts); 3. Documentation/report file: What language do you work with, how does the input look like (plaint text/tagged?), what part of the language did you cover, what heuristics did you use, any other interesting observations you made. You should also document how your scripts are invoked, unless you provided a clearly readable and commented Makefile. If you choose to work with your own data, also briefly describe where the data comes from. Zip all these files as hw1.zip and send them by e-mail to zeman@ufal.mff.cuni.cz.

Slovak: sk-tagged.txt.gz. Tags from the tagset of the Slovak National Corpus. (See also here.) For example, tags of feminine nouns start with SSf. You may want to search for online grammar descriptions, e.g. Slovak declension in Wikipedia, or conjugation of Slovak verbs.
Polish: pl.txt.gz. The data is from the Polish Dependency Base, annotation has been stripped. You may want to search for online grammar descriptions, e.g. the Polish course on Wikibooks.
Croatian: hr.txt.gz. Data from the Universal Dependencies Croatian treebank, annotation has been stripped. You may want to search for online grammar descriptions, e.g. Serbo-Croatian grammar in Wikipedia, or conjugation tables of Serbo-Croatian verbs in Wiktionary.
Ukrainian: uk.txt.gz. The data is from UD, annotation has been stripped. You may want to search for online grammar descriptions
Russian: ru.txt.gz. The data is from SynTagRus, annotation has been stripped. You may want to search for online grammar descriptions, e.g. Russian declension and conjugation in Wikipedia.
Spanish: es.txt.gz. Data from the Universal Dependencies Spanish treebanks, annotation has been stripped. You may want to search for online grammar descriptions, e.g. Spanish verbs in Wikipedia.
German: de-tagged.txt.gz. The data is from the Tiger Corpus but the morphological tags have been converted to the Universal Dependencies tags and features. You may want to search for online grammar descriptions, e.g. declension tables or verbs at German for English speakers.
Irish: ga.txt.gz. The data is from the Irish Dependency Treebank, annotation has been stripped. You may want to search for online grammar descriptions, e.g. Irish declension and conjugation in Wikipedia.
Greek: el-tagged.txt.gz. The data is from the Greek UD Treebank, annotation has been reduced to UPOS+FFEATS.
Marathi: mr.xml.gz. The data is from the EMILLE Corpus and it contains limited amount of XML markup (document and paragraph level), which can be easily discarded. You may want to search for online grammar descriptions, e.g. Marathi alphabet, Marathi grammar in Wikipedia, or Marathi verbs on Polymath.
Central Romani: rmc.txt.gz. Very small and biased dataset from the Wikipedia Incubator, taken from the Wiki Dump and extracted using Giuseppe Attardi's WikiExtractor (with this patch). For more information on the Romani language, see Wikipedia.