Shared Task: Baseline System
June 23, in conjunction with ACL 2007 in Prague, Czech Republic
[HOME] [SHARED TASK] | RESULTS] | [BASELINE SYSTEM] | [PROCEEDINGS] | [PROGRAM]
Using the open source Moses it is possible to build a baseline system that is on par with the best submission from last year's workshop.
What follows below are step-for-step instructions. This may look like a long list at first glance, but it should make the process of building machine translation system and all its components, and its tuning, testing, and evaluation transparent.
Install Moses Support Libraries
- Create some workspace directory for all this work and enter it.
- Download SRILM and install it.
- Download GIZA and mkcls. The original versions don't compile on modern systems. Try Chris Dyer's patched versions instead.
cd GIZA++-v2/
make
make snt2cooc.out
cd ../mkcls-v2/
make
- Copy GIZA++ and mkcls to a bin location for Moses Scripts
mkdir -p bin
cp GIZA++-v2/GIZA++ bin/
cp GIZA++-v2/snt2cooc.out bin/
cp mkcls-v2/mkcls bin/
Install Moses
- Check out the Moses code via Subversion:
mkdir -p moses
svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses
- Compile Moses
cd moses
./regenerate-makefiles.sh
./configure --with-srilm=/path-to-srilm
make -j 4
Install Moses Scripts
- Compile Moses Scripts
The support scripts used by Moses are "released" by a Makefile which edits their paths to match your local environment. First, you need to edit the Makefile definition of two variables:
mkdir -p bin/moses-scripts
###Edit moses/scripts/Makefile
TARGETDIR=/full-path-to-workspace/bin/moses-scripts
BINDIR=/full-path-to-workspace/bin
###
cd moses/scripts/
make release
This will create a folder named bin/moses-scripts/scripts-YYYYMMDD-HHMM with released versions of all the scripts. You will call these versions when training/tuning Moses.
Moses scripts also require a SCRIPTS_ROOTDIR environment variable to be set. The output of make release should indicate this.
export SCRIPTS_ROOTDIR=/full-path-to-workspace/bin/moses-scripts/scripts-YYYYMMDD-HHMM
Install Additional Scripts
- Download scripts.tgz and extract them:
tar xzf scripts.tgz
-
These scripts include:
- Tokenizer
scripts/tokenizer.perl
- Lowercaser
scripts/lowercase.perl
- SGML-Wrapper
scripts/wrap-xml.perl
- Dowload the NIST BLEU scoring tool:
wget ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl
Prepare Data
- Tokenize training data
mkdir -p working-dir/corpus
scripts/tokenizer.perl -l fr < wmt07/training/europarl-v3.fr-en.fr > working-dir/corpus/europarl.tok.fr
scripts/tokenizer.perl -l en < wmt07/training/europarl-v3.fr-en.en > working-dir/corpus/europarl.tok.en
- Filter out long sentences
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/clean-corpus-n.perl working-dir/corpus/europarl.tok fr en working-dir/corpus/europarl.clean 1 40
- Lowercase training data
scripts/lowercase.perl < working-dir/corpus/europarl.clean.fr > working-dir/corpus/europarl.lowercased.fr
scripts/lowercase.perl < working-dir/corpus/europarl.clean.en > working-dir/corpus/europarl.lowercased.en
Build Language Model
- Tokenize English language model data
mkdir -p working-dir/lm
scripts/tokenizer.perl -l en < wmt07/training/europarl-v3.en > working-dir/lm/europarl.tok
- Lowercase language model data
scripts/lowercase.perl < working-dir/lm/europarl.tok > working-dir/lm/europarl.lowercased
- Use SRILM to build language model
SRILM makes a platform-specific folder within its bin directory, this instruction assumes i686.
/path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm
Train Model
- Run training script:
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -scripts-root-dir bin/moses-scripts/scripts-YYYYMMDD-HHMM -root-dir working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:5:working-dir/lm/europarl.lm:0
Tuning (i.e., Optimize System Component Weights, a.k.a. Minimum Error Rate Training)
- Tokenize tuning sets
mkdir -p working-dir/tuning
scripts/tokenizer.perl -l fr < wmt07/dev/dev2006.fr > working-dir/tuning/input.tok
scripts/tokenizer.perl -l en < wmt07/dev/dev2006.en > working-dir/tuning/reference.tok
- Lowercase tuning sets
scripts/lowercase.perl < working-dir/tuning/input.tok > working-dir/tuning/input
scripts/lowercase.perl < working-dir/tuning/reference.tok > working-dir/tuning/reference
- Run tuning script
Note that this step can take many hours, even days, to run.
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/mert-moses.pl working-dir/tuning/input working-dir/tuning/reference moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir working-dir/tuning --rootdir bin/moses-scripts/scripts-YYYYMMDD-HHMM
- Insert weights into configuration file
scripts/reuse-weights.perl working-dir/tuning/moses.ini < working-dir/model/moses.ini > working-dir/tuning/moses.weight-reused.ini
Run System on Development Test Set
- Tokenize test set
mkdir -p working-dir/evaluation
scripts/tokenizer.perl -l fr < wmt07/devtest/devtest2006.fr > working-dir/evaluation/devtest2006.input.tok
scripts/tokenizer.perl -l en < wmt07/devtest/devtest2006.en > working-dir/evaluation/devtest2006.reference.tok
- Lowercase test set
scripts/lowercase.perl < working-dir/evaluation/devtest2006.input.tok > working-dir/evaluation/devtest2006.input
scripts/lowercase.perl < working-dir/evaluation/devtest2006.reference.tok > working-dir/evaluation/devtest2006.reference
- Filter the model to fit into memory
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/filter-model-given-input.pl working-dir/evaluation/filtered.devtest2006 working-dir/tuning/moses.weight-reused.ini working-dir/evaluation/devtest2006.input
- Decode with Moses
moses/moses-cmd/src/moses -config working-dir/evaluation/filtered.devtest2006/moses.ini -input-file working-dir/evaluation/devtest2006.input > working-dir/evaluation/devtest2006.output
Evaluation
- Train recaser
bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/train-recaser.perl -train-script bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -ngram-count /path-to-srilm/bin/i686/ngram-count -corpus working-dir/lm/europarl.tok -dir recaser
- Recase the output
bin/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/recase.perl -model recaser/moses.ini -in working-dir/evaluation/devtest2006.output -moses moses/moses-cmd/src/moses > working-dir/evaluation/devtest2006.output.recased
- Detokenize the output
scripts/detokenizer.perl -l en < working-dir/evaluation/devtest2006.output.recased > working-dir/evaluation/devtest2006.output.detokenized
- Wrap the output in SGML
scripts/wrap-xml.perl wmt07/devtest/test2006-ref.en.sgm en < working-dir/evaluation/devtest2006.output.detokenized > working-dir/evaluation/devtest2006.output.sgm
- Score with NIST BLEU scoring tool
mteval-v11b.pl -r wmt07/devtest/test2006-ref.en.sgm -t working-dir/evaluation/devtest2006.output.sgm -s wmt07/devtest/test2006-src.fr.sgm -c