Morfessor (lab)

Morfessor is one of the popular tools for unsupervised morphemic segmentation.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-morfessor/morfessor.zip
unzip morfessor.zip
cd morfessor/train
make
zcat segmentation.final.gz

Patch note: In the file morfessor/train/Makefile on line 298, I had to insert a minus (“-”) before the first command. It is because grep returns 1 when no matching line is found, and while this is not an error, make interprets the 1 as an error status and stops. With the minus sign, make will ignore the exit status on this line.

If you want to try Morfessor on your own data, replace the file mydata.gz and re-run make. Input format: frequency, space, word form, newline. Note that we create a new morfessor folder in the following example in order to avoid questions whether we want to overwrite intermediate files.

The first experiment above finished instantly because the data consisted only of two Czech nominal paradigms and nothing else. Processing the WSJ English data takes about 8 minutes on my computer. Processing the German data linked below takes about 14 minutes.

cd ../..
mv morfessor morfessor1
unzip morfessor.zip
cd morfessor/train
# Take the filtered and lowercased English word list from the last lab exercise.
# Either copy your version from a neighboring folder, or download my version from the web.
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-morfessor/wsj-04.txt
# We need to swap the columns (words and frequencies).
cat wsj-04.txt | perl -pe 'chomp; @f=split(/\t/); $_="$f[1] $f[0]\n"' | gzip -c > mydata.gz
make
zless segmentation.final.gz
cd ../..
mv morfessor morfessor2
unzip morfessor.zip
cd morfessor/train
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-morfessor/ud-de-gsd-freqlist.txt
cat ud-de-gsd-freqlist.txt | gzip -c > mydata.gz
make
zless segmentation.final.gz
How to get a word frequency list from a CoNLL-U file (Universal Dependencies):
cat *.conllu | perl -CDS -e 'while(<>){if(m/^\d/){@f=split(/\t/); next if($f[1]=~m/(\pP|\d)/); $h{lc($f[1])}++}} @k=sort{$h{$b}<=>$h{$a}}(keys(%h)); foreach my $k (@k) {print("$h{$k} $k\n");}' | less