Morfessor is one of the popular tools for unsupervised morphemic segmentation.
wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-morfessor/morfessor.zip unzip morfessor.zip cd morfessor/train make zcat segmentation.final.gz
Patch note: In the file morfessor/train/Makefile
on line 298, I had to insert a minus (“-”) before the first command. It is because grep
returns 1 when no matching line is found, and while this is not an error, make
interprets the 1 as an error status and stops. With the minus sign, make
will ignore the exit status on this line.
If you want to try Morfessor on your own data, replace the file mydata.gz and re-run make. Input format: frequency, space, word form, newline. Note that we create a new morfessor folder in the following example in order to avoid questions whether we want to overwrite intermediate files.
The first experiment above finished instantly because the data consisted only of two Czech nominal paradigms and nothing else. Processing the WSJ English data takes about 8 minutes on my computer. Processing the German data linked below takes about 14 minutes.
cd ../.. mv morfessor morfessor1 unzip morfessor.zip cd morfessor/train # Take the filtered and lowercased English word list from the last lab exercise. # Either copy your version from a neighboring folder, or download my version from the web. wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-morfessor/wsj-04.txt # We need to swap the columns (words and frequencies). cat wsj-04.txt | perl -pe 'chomp; @f=split(/\t/); $_="$f[1] $f[0]\n"' | gzip -c > mydata.gz make zless segmentation.final.gz
cd ../.. mv morfessor morfessor2 unzip morfessor.zip cd morfessor/train wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-morfessor/ud-de-gsd-freqlist.txt cat ud-de-gsd-freqlist.txt | gzip -c > mydata.gz make zless segmentation.final.gzHow to get a word frequency list from a CoNLL-U file (Universal Dependencies):
cat *.conllu | perl -CDS -e 'while(<>){if(m/^\d/){@f=split(/\t/); next if($f[1]=~m/(\pP|\d)/); $h{lc($f[1])}++}} @k=sort{$h{$b}<=>$h{$a}}(keys(%h)); foreach my $k (@k) {print("$h{$k} $k\n");}' | less