1. Install Eman (and within Eman, we will install Moses and GIZA++)

git clone https://redmine.ms.mff.cuni.cz/ufal-smt/eman.git

2. Make sure eman is in your PATH:

export PATH=$HOME/eman/bin/:$PATH
echo "export PATH=$HOME/eman/bin/:$PATH" >>  ~/.bashrc

3. Get SMT Playground

git clone https://redmine.ms.mff.cuni.cz/ufal-smt/playground.git

4. Test, if eman runs

eman --man

4a. Possibly fix Perl dependencies

4a1. Set up a local Perl repository.
     see http://stackoverflow.com/questions/2980297
     (you just need to use .bashrc, not .profile)
4a2. Install the required packages
     cpanm YAML::XS

5. Start compiling Moses and GIZA

cd playground/playground
BJAMARGS=" --no-xmlrpc-c --max-kenlm-order=12 link=shared " \
  eman init mosesgiza --start

6. Find and watch the log in the new s.mosesgiza.* directory.

<moses and giza will compile for about 8 minutes>

7. Get a parallel corpus from OPUS (http://opus.nlpl.eu/)

- Pick a language pair, pick a corpus, download the "moses" format.
  - The corpus should have ~0.1M sentence pairs.
  - Amharic (am) -- English (en) Tanzil corpus is a good choice.

8. Get my gizawrapper to run giza:

wget https://raw.githubusercontent.com/ufal/qtleap/master/cuni_train/bin/gizawrapper.pl
chmod 755 gizawrapper.pl

9. Run gizawrapper + symal (takes ~20 min on Am-En):

./gizawrapper.pl \
  --bindir=/FULL/PATH/TO/playground/s.mosesgiza.????????.2018????-????/bin/ \
  am-en/Tanzil.am-en.am \
  am-en/Tanzil.am-en.?? \
  --dirsym=left,right,int,union | gzip > am-en.ali.gz


10. Get my alignment viewer alitextview:

wget http://ufal.mff.cuni.cz/~zeman/langtech/npfl120/alitextview.pl
chmod 755 alitextview.pl

11. Observe the Alignment

paste am-en/Tanzil.am-en.am \
      am-en/Tanzil.am-en.en \
      <(zcat am-en.ali.gz ) \
| cut -f 1,2,5 | ./alitextview.pl | less

12. Improve the Alignment by Coarsening Tokens

- e.g. lowercase and chop ("stem") words to 4 characters

for f in Tanzil.am-en.??; do
  cat $f 
  | ../playground/scripts/lowercase.pl \
  | ../playground/scripts/stem_factor.pl \
  > $f.lcstem4
done

... and run alignment again


13. Have a Look at Moses Training Tutorial:

  http://www.statmt.org/moses/?n=Moses.Baseline

14. Construct Target-Language Language Model
(we will be translating from Amharic to English)

playground/s.mosesgiza.????????.2018????-????/moses/bin/lmplz \
  -o 3 < am-en/Tanzil.am-en.en > en.Tanzil.lm

15. Construct Phrase Table and Moses Model

# the following command will run GIZA needlessly again (but we have not saved
# the gdfa symmetrization, so we would have to construct it ourselves from left
# and right, let's waste CPUs a little)
PL=FULL/PATH/TO/playground/playground/

# first remove dangerous pairs from the corpus
$PL/s.mosesgiza.????????.????????-????/moses/scripts/training/clean-corpus-n.perl \
  am-en/Tanzil.am-en am en \
  am-en/Tanzil.am-en.cleaned 1 80
# ... this will create two files: am-en/Tanzil.am-en.cleaned.{en,am}

$PL/s.mosesgiza.????????.????????-????/moses/scripts/training/train-model.perl \
  --external-bin-dir=$PL/s.mosesgiza.????????.????????-????/bin/ \
  -corpus=am-en/Tanzil.am-en.cleaned \
  -f am -e en \
  -reordering msd-bidirectional-fe \
  -lm 0:3:$(pwd)/en.Tanzil.lm \
  -root-dir=train

# In the first attempt, train-model.perl actually crashed for Ondrej because he
# has not observed the training tutorial and he has not not selected only the
# nice and clean sentences.

16. Have a look at the obtained phrase table

...it should be located in train/model/ttable*gz

18. *If* you had a development corpus, you also *should* do the tuning (running
mert-moses.pl) according to the Moses Baseline tutoal.

17. Homework:

- Visually compare the left, right and intersection alignments
  ... check in how many sentences you see the 'garbage alignments' that all
      fall onto one word
- Compare the intersection alignment for the baseline and improved alignments.
- Write a small script that reads:
  1. source tokens
  2. target tokens
  3. alignment
  and emits all pairs of aligned words.
  If run through 'sort | uniq -c | sort -n', this would be a translation
  dictionary.
- Continue the moses tutorial to train a phrase-based model (apply
  mert-moses.pl).
- Apply the trained model.
- Compare the translations from the default run and from the run with these
  model flags:
      -dl=0 -max-phrase-length 1