Workshop Shared Task: Statistical Machine Translation

ACL 2007 SECOND WORKSHOP
ON STATISTICAL MACHINE TRANSLATION

Shared Task: Baseline System

June 23, in conjunction with ACL 2007 in Prague, Czech Republic

[HOME] [SHARED TASK] | RESULTS] | [BASELINE SYSTEM] | [PROCEEDINGS] | [PROGRAM]

Using the open source Moses it is possible to build a baseline system that is on par with the best submission from last year's workshop. What follows below are step-for-step instructions. This may look like a long list at first glance, but it should make the process of building machine translation system and all its components, and its tuning, testing, and evaluation transparent.

Install Moses Support Libraries

Create some workspace directory for all this work and enter it.
Download SRILM and install it.
Download GIZA and mkcls. The original versions don't compile on modern systems. Try Chris Dyer's patched versions instead.
cd GIZA++-v2/ make make snt2cooc.out cd ../mkcls-v2/ make
Copy GIZA++ and mkcls to a bin location for Moses Scripts
mkdir -p bin cp GIZA++-v2/GIZA++ bin/ cp GIZA++-v2/snt2cooc.out bin/ cp mkcls-v2/mkcls bin/

Install Moses

Check out the Moses code via Subversion:
mkdir -p moses svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses
Compile Moses
cd moses ./regenerate-makefiles.sh ./configure --with-srilm=/path-to-srilm make -j 4

Install Moses Scripts

Compile Moses Scripts
The support scripts used by Moses are "released" by a Makefile which edits their paths to match your local environment. First, you need to edit the Makefile definition of two variables:
mkdir -p bin/moses-scripts ###Edit moses/scripts/Makefile TARGETDIR=/full-path-to-workspace/bin/moses-scripts BINDIR=/full-path-to-workspace/bin ### cd moses/scripts/ make release
This will create a folder named bin/moses-scripts/scripts-YYYYMMDD-HHMM with released versions of all the scripts. You will call these versions when training/tuning Moses.
Moses scripts also require a SCRIPTS_ROOTDIR environment variable to be set. The output of make release should indicate this.
export SCRIPTS_ROOTDIR=/full-path-to-workspace/bin/moses-scripts/scripts-YYYYMMDD-HHMM

Install Additional Scripts

Download scripts.tgz and extract them:
tar xzf scripts.tgz
These scripts include:
- Tokenizer scripts/tokenizer.perl
- Lowercaser scripts/lowercase.perl
- SGML-Wrapper scripts/wrap-xml.perl
Dowload the NIST BLEU scoring tool:
wget ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl

Prepare Data

Tokenize training data
mkdir -p working-dir/corpus scripts/tokenizer.perl -l fr < wmt07/training/europarl-v3.fr-en.fr > working-dir/corpus/europarl.tok.fr scripts/tokenizer.perl -l en < wmt07/training/europarl-v3.fr-en.en > working-dir/corpus/europarl.tok.en
Filter out long sentences
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/clean-corpus-n.perl working-dir/corpus/europarl.tok fr en working-dir/corpus/europarl.clean 1 40
Lowercase training data
scripts/lowercase.perl < working-dir/corpus/europarl.clean.fr > working-dir/corpus/europarl.lowercased.fr scripts/lowercase.perl < working-dir/corpus/europarl.clean.en > working-dir/corpus/europarl.lowercased.en

Build Language Model

Tokenize English language model data
mkdir -p working-dir/lm scripts/tokenizer.perl -l en < wmt07/training/europarl-v3.en > working-dir/lm/europarl.tok
Lowercase language model data
scripts/lowercase.perl < working-dir/lm/europarl.tok > working-dir/lm/europarl.lowercased
Use SRILM to build language model
SRILM makes a platform-specific folder within its bin directory, this instruction assumes i686.
/path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm

Train Model

Run training script:
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -scripts-root-dir bin/moses-scripts/scripts-YYYYMMDD-HHMM -root-dir working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:5:working-dir/lm/europarl.lm:0

Tuning (i.e., Optimize System Component Weights, a.k.a. Minimum Error Rate Training)

Tokenize tuning sets
mkdir -p working-dir/tuning scripts/tokenizer.perl -l fr < wmt07/dev/dev2006.fr > working-dir/tuning/input.tok scripts/tokenizer.perl -l en < wmt07/dev/dev2006.en > working-dir/tuning/reference.tok
Lowercase tuning sets
scripts/lowercase.perl < working-dir/tuning/input.tok > working-dir/tuning/input scripts/lowercase.perl < working-dir/tuning/reference.tok > working-dir/tuning/reference
Run tuning script
Note that this step can take many hours, even days, to run.
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/mert-moses.pl working-dir/tuning/input working-dir/tuning/reference moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir working-dir/tuning --rootdir bin/moses-scripts/scripts-YYYYMMDD-HHMM
Insert weights into configuration file scripts/reuse-weights.perl working-dir/tuning/moses.ini < working-dir/model/moses.ini > working-dir/tuning/moses.weight-reused.ini



Run System on Development Test Set


Tokenize test set



mkdir -p working-dir/evaluation

scripts/tokenizer.perl -l fr < wmt07/devtest/devtest2006.fr > working-dir/evaluation/devtest2006.input.tok

scripts/tokenizer.perl -l en < wmt07/devtest/devtest2006.en > working-dir/evaluation/devtest2006.reference.tok



Lowercase test set



scripts/lowercase.perl < working-dir/evaluation/devtest2006.input.tok > working-dir/evaluation/devtest2006.input

scripts/lowercase.perl < working-dir/evaluation/devtest2006.reference.tok > working-dir/evaluation/devtest2006.reference



Filter the model to fit into memory



bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/filter-model-given-input.pl working-dir/evaluation/filtered.devtest2006 working-dir/tuning/moses.weight-reused.ini working-dir/evaluation/devtest2006.input



Decode with Moses



moses/moses-cmd/src/moses -config working-dir/evaluation/filtered.devtest2006/moses.ini -input-file working-dir/evaluation/devtest2006.input > working-dir/evaluation/devtest2006.output




Evaluation


Train recaser


bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/train-recaser.perl -train-script bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -ngram-count /path-to-srilm/bin/i686/ngram-count -corpus working-dir/lm/europarl.tok -dir recaser



Recase the output



bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/recase.perl -model recaser/moses.ini -in working-dir/evaluation/devtest2006.output -moses moses/moses-cmd/src/moses > working-dir/evaluation/devtest2006.output.recased



Detokenize the output



scripts/detokenizer.perl -l en < working-dir/evaluation/devtest2006.output.recased > working-dir/evaluation/devtest2006.output.detokenized



Wrap the output in SGML



scripts/wrap-xml.perl wmt07/devtest/test2006-ref.en.sgm en < working-dir/evaluation/devtest2006.output.detokenized > working-dir/evaluation/devtest2006.output.sgm



Score with NIST BLEU scoring tool


mteval-v11b.pl -r wmt07/devtest/test2006-ref.en.sgm -t working-dir/evaluation/devtest2006.output.sgm -s wmt07/devtest/test2006-src.fr.sgm -c

ACL 2007 SECOND WORKSHOP ON STATISTICAL MACHINE TRANSLATION