1. Install Eman (and within Eman, we will install Moses and GIZA++) git clone https://redmine.ms.mff.cuni.cz/ufal-smt/eman.git 2. Make sure eman is in your PATH: export PATH=$HOME/eman/bin/:$PATH echo "export PATH=$HOME/eman/bin/:$PATH" >> ~/.bashrc 3. Get SMT Playground git clone https://redmine.ms.mff.cuni.cz/ufal-smt/playground.git 4. Test, if eman runs eman --man 4a. Possibly fix Perl dependencies 4a1. Set up a local Perl repository. see http://stackoverflow.com/questions/2980297 (you just need to use .bashrc, not .profile) 4a2. Install the required packages cpanm YAML::XS 5. Start compiling Moses and GIZA cd playground/playground BJAMARGS=" --no-xmlrpc-c --max-kenlm-order=12 link=shared " \ eman init mosesgiza --start 6. Find and watch the log in the new s.mosesgiza.* directory. 7. Get a parallel corpus from OPUS (http://opus.nlpl.eu/) - Pick a language pair, pick a corpus, download the "moses" format. - The corpus should have ~0.1M sentence pairs. - Amharic (am) -- English (en) Tanzil corpus is a good choice. 8. Get my gizawrapper to run giza: wget https://raw.githubusercontent.com/ufal/qtleap/master/cuni_train/bin/gizawrapper.pl chmod 755 gizawrapper.pl 9. Run gizawrapper + symal (takes ~20 min on Am-En): ./gizawrapper.pl \ --bindir=/FULL/PATH/TO/playground/s.mosesgiza.????????.2018????-????/bin/ \ am-en/Tanzil.am-en.am \ am-en/Tanzil.am-en.?? \ --dirsym=left,right,int,union | gzip > am-en.ali.gz 10. Get my alignment viewer alitextview: wget http://ufal.mff.cuni.cz/~zeman/langtech/npfl120/alitextview.pl chmod 755 alitextview.pl 11. Observe the Alignment paste am-en/Tanzil.am-en.am \ am-en/Tanzil.am-en.en \ <(zcat am-en.ali.gz ) \ | cut -f 1,2,5 | ./alitextview.pl | less 12. Improve the Alignment by Coarsening Tokens - e.g. lowercase and chop ("stem") words to 4 characters for f in Tanzil.am-en.??; do cat $f | ../playground/scripts/lowercase.pl \ | ../playground/scripts/stem_factor.pl \ > $f.lcstem4 done ... and run alignment again 13. Have a Look at Moses Training Tutorial: http://www.statmt.org/moses/?n=Moses.Baseline 14. Construct Target-Language Language Model (we will be translating from Amharic to English) playground/s.mosesgiza.????????.2018????-????/moses/bin/lmplz \ -o 3 < am-en/Tanzil.am-en.en > en.Tanzil.lm 15. Construct Phrase Table and Moses Model # the following command will run GIZA needlessly again (but we have not saved # the gdfa symmetrization, so we would have to construct it ourselves from left # and right, let's waste CPUs a little) PL=FULL/PATH/TO/playground/playground/ # first remove dangerous pairs from the corpus $PL/s.mosesgiza.????????.????????-????/moses/scripts/training/clean-corpus-n.perl \ am-en/Tanzil.am-en am en \ am-en/Tanzil.am-en.cleaned 1 80 # ... this will create two files: am-en/Tanzil.am-en.cleaned.{en,am} $PL/s.mosesgiza.????????.????????-????/moses/scripts/training/train-model.perl \ --external-bin-dir=$PL/s.mosesgiza.????????.????????-????/bin/ \ -corpus=am-en/Tanzil.am-en.cleaned \ -f am -e en \ -reordering msd-bidirectional-fe \ -lm 0:3:$(pwd)/en.Tanzil.lm \ -root-dir=train # In the first attempt, train-model.perl actually crashed for Ondrej because he # has not observed the training tutorial and he has not not selected only the # nice and clean sentences. 16. Have a look at the obtained phrase table ...it should be located in train/model/ttable*gz 18. *If* you had a development corpus, you also *should* do the tuning (running mert-moses.pl) according to the Moses Baseline tutoal. 17. Homework: - Visually compare the left, right and intersection alignments ... check in how many sentences you see the 'garbage alignments' that all fall onto one word - Compare the intersection alignment for the baseline and improved alignments. - Write a small script that reads: 1. source tokens 2. target tokens 3. alignment and emits all pairs of aligned words. If run through 'sort | uniq -c | sort -n', this would be a translation dictionary. - Continue the moses tutorial to train a phrase-based model (apply mert-moses.pl). - Apply the trained model. - Compare the translations from the default run and from the run with these model flags: -dl=0 -max-phrase-length 1