Morfessor (lab)

Morfessor is one of the popular tools for unsupervised morphemic segmentation.

wget http://ufal.mff.cuni.cz/~zeman/vyuka/morfosynt/lab-morfessor/morfessor.zip
unzip morfessor.zip
cd morfessor/train
make
zcat segmentation.final.gz

If you want to try Morfessor on your own data, replace the file mydata.gz and re-run make. Input format: frequency, space, word form, newline. Note that we create a new morfessor folder in the following example in order to avoid questions whether we want to overwrite intermediate files.

The first experiment above finished instantly because the data consisted only of two Czech nominal paradigms and nothing else. Processing the WSJ English data takes about 8 minutes on my computer.

cd ../..
mv morfessor morfessor1
unzip morfessor.zip
cd morfessor/train
cat ../../../lab-lexicon/wsj-04.txt | perl -pe 'chomp; @f=split(/\t/); $_="$f[1] $f[0]\n"' | gzip -c > mydata.gz
make
zless segmentation.final.gz
How to get a word frequency list from a CoNLL-U file (Universal Dependencies):
cat *.conllu | perl -CDS -e 'while(<>){if(m/^\d/){@f=split(/\t/); next if($f[1]=~m/(\pP|\d)/); $h{lc($f[1])}++}} @k=sort{$h{$b}<=>$h{$a}}(keys(%h)); foreach my $k (@k) {print("$h{$k}\t$k\n");}' | less