Prospector
- Automatic search in experiment configurations.
- Several available search algorithms.
- Uses eman to create/clone/manage individual experiments.
- Independent of the machine translation playground.
- Already usable but still under development.
Configuration
In order to run Prospector, the user has to create a configuration directory containing the following files:
- rules
- vars
- traceback
- score
The file score
is an executable that produces a number corresponding to
the score or ``fitness'' of a given final eman step.
Apart from these files, the directory also has to contain a subdirectory
chunks
with parts of eman tracebacks.
Traceback
This file defines how the individual chunks should be combined:
eval mert tm align lm
The lines correspond to names of files in the directory chunks
and
indentation defines which steps depend on each other. Multiple eman steps can be
defined in one file. This file exists to add flexibility to configuration. In
the future, we would like to extend it to allow alternative/conditional
definitions.
The chunks can contain variable slots surrounded by '#' sign:
+- s.align.e357fb70.20120221-1115 | | ALILABEL=en-#SRCFACTOR#-cs-#TGTFACTOR# | | ALISYMS=gdfa | | CORPUS=czeng-news | | GIZASTEP=s.mosesgiza.fcfbe812.20120221-1114 | | SRCALIAUG=en+#SRCFACTOR# | | TGTALIAUG=cs+#TGTFACTOR#
Variables
Each line of this file defines a variable and its values. The first column is the variable name (must match the slot name in the traceback) and the second column (separated by any number of spaces) contains possible values for that variable, separated by commas.
Rules
This file places restrictions on possible combinations of variable values. The user can optionally define these to avoid evaluating nonsensical configurations or to direct the search. It contains lines with 4 space-delimited columns, for example:
TMSRCAUG /\+/ STEPS t0a1-0 TMSRCAUG /^[^+]*$/ STEPS t0-0
Each line can be viewed as an if-statement: if 1 matches 2 then 3 must equal 4. Numbers represent the columns on the line. The second column is evaluated as a Perl regular expression. The fourth column must contain an exact value.
Prediction
The user can optionally also include a file predict
. It has to be an
executable that will, given the path to the final step, output the complexity or
cost estimate for such an experiment. If the user also specifies a limit,
Prospector will query this program before running each experiment and it will
discard experiments with cost over the threshold.
Running Prospector
Usage:
prospector [options] config-directory
Command-Line Arguments
- max-running
- Maximum number of experiments running in parallel.
- search
- Search type. Possible values:
- genetic
- exhaustive
- line
- random
- genetic-population-size
- Size of one generation in genetic search.
- genetic-nbest
- The number of best configurations considered as parents for next generation.
- genetic-mutation-prob
- Probability of mutation in genetic search.
- random-limit
- The total number of experiments created in random search.
- max-allowed-prediction
- Threshold value of prediction.
- verbose
- Be verbose.