In this lab session we will combine finite-state machines with a unification grammar and feature structures. We will experiment with extended capabilities of the PCKIMMO program; the unification grammar is called word grammar there. You should have the PCKIMMO binary executable from the previous lab session; if you don't, this is how you get it:
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-twolm/pckimmo2-binary-linux.zip unzip pckimmo2-binary-linux.zip chmod 755 bin/* wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-twolm/englex.zip unzip englex.zip
The englex description of English includes a grammar (you may remember that last time we had to call “set grammar off” to get it out of our way). Try recognizing the word enlargements and see what kinds of outputs PCKIMMO delivers. There should be the lexical string and glosses as before, plus a derivation tree and the resulting feature structure.
You need a different set of Czech files than last time. Download them now:
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-ug/pckimmo-cs-gram.zip unzip pckimmo-cs-gram.zip cd cs-gram ../bin/pckimmo -t cs.tak r žena žen+a N(žena)+a(žena) 1: Word _____|_____ N INFL žen +a N(žena) +a(žena) Word: [ cat:Word pos:N pat:žena gen:fem num:sg case:nom ] 1 parse found
Inspect the grammar file cs.grm, see also the feature shortcuts in lexicon entries. Note that continuation (ALTERNATION) classes no longer play a key role in determining which stem combines with which endings. Instead, the pat feature identifies the “pattern” (inflectional class) in both stem and ending.
Note that within this package, the noun classes žena and růže are currently solved slightly differently. Try recognizing the -e form in both classes: r ženě and r růže. The former gives you multiple (but identical) segmentations to morphemes, each with one parse. The latter gives you just one segmentation with three parses. Semantically the two approaches are equivalent of course. They could be unified if one of the classes is redesigned internally: Instead of listing ambiguous endings of žena as multiple entries with varying features, we could have it just once with disjunction of feature structures (as it is done with růže now).
Finally, features are also used to check that superlative prefix does not occur if there is no comparative suffix. Inspect the grammar rules at the end of the cs.grm file and figure out how they work. Check that they work and that, again, a correct superlative (e.g., nejmladší “youngest”) is recognized, while an incorrect combination of the prefix with a base positive form (e.g., nejmladý) is rejected.
r nejmladší ... 27: Word_57 __________|__________ ADeg_2+ INFL_58 ______|______ +í SUPERL_3+ ADeg_4+ +í(jarní) nej+ | SUPERL+ A_5+ mlad+š A(mladý)+COMP Word: [ cat:Word pos:Adj pat:jarní deg:sup gen:fem num:sg case:dat ] 27 parses found r nejmladý *** NONE ***
Another long-distance dependency problem is the vowel change (“umlaut”) inside the German noun stems with certain classes of plural inflection. Unfortunately, the word grammar in PCKIMMO cannot access information whether and where a finite-state transducer approved a vowel change. So we cannot just try both options and then have them checked by the grammar, as we did with Czech superlatives.
Download a set of German files for PCKIMMO using the link below. There is a sample of German nouns with varying plural formation patterns:
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-ug/pckimmo-de-gram.zip unzip pckimmo-de-gram.zip cd de ../bin/pckimmo -t de.tak
Here the word grammar fills out the number feature in the feature structure, and it also makes sure that stems are not combined with plural suffixes that don't belong to them:
r Fenster Fenster+λ N(Fenster)+PL 1: Word ____|_____ N NSUF Fenster +λ N(Fenster) +PL Word: [ cat:Word pos:N number:plur ] 1 parse found Fenster N(Fenster) 1: Word | N Fenster N(Fenster) Word: [ cat:Word pos:N number:sing ] 1 parse found r Fenstern *** NONE ***
However, not even the word grammar can enforce that vowels are only umlauted in cases where it is required. The finite-state transducer tries at least to make sure that it is accompanied by a suffix that might (but not necessarily does) signal plural. So the analyzer accepts both Bücher (correct) and Bucher (wrong), as well as Kuchen (correct) and Küchen (wrong).
Task: Redesign the source files so that they also check correct umlauting. Hint: word grammar does not see anything more fine-grained than the “morpheme”. Hint 2: Don't expect the solution to look extremely elegant.
You can use the script test.tak to test your output against the gold standard:
../bin/pckimmo -t test.tak ; git diff test-out.txt test-gold.txt