eman
...a feature-packed experiment manager.
Last update: 2018-02-07
- eman
- SYNOPSIS
- DESCRIPTION
- Why Should You Use eman?
- Structure of a Step Directory
- Special Files in Directory of Steps
- Life Cycle of Individual Steps
- STEP AND EXPERIMENT CLONING
- FINDING STEPS
- USAGE PATTERNS
- COMMON PROBLEMS
- WRITING SEEDS: CORE CONVENTIONS
- WRITING SEEDS: TIPS
- FUTURE EXTENSIONS
- SEE ALSO
- AUTHOR
eman
eman, experiment manager
SYNOPSIS
VAR=val eman init STEPTYPE # create new step of the given type VAR=val eman clone SPEC # create new step based on SPEC
eman clone < traceback # create step by cloning incl. predecessors eman redo EXPSPEC # equals 'eman tb --vars SPEC | eman clone' # good predecessors are reused by default --reuse=SPEC # reuse the given step (incl. predecs.) --ignore=STEPTYPE # reuse steps of the given type --avoid=SPEC # don't reuse the given step in the clone --redo=STEPTYPE # don't reuse any steps of the given type --all-avoid # avoid all input steps --start # after init/clone/redo, submit exp to queue --prepare # after init/clone/redo, prepare the step
eman prepare SPEC # prepare inited step eman run SPEC # run prepared step eman continue SPEC # continue a single step that failed eman clean SPEC # erase files other than eman* and log* to make # the step reentrant and 'eman continue' safe eman start EXPSPEC # prepare and run, incl. all predecessors --priority -200 # cluster scheduling priority (default: -100) --mem 15g # required memory (default: 6g) --disk 20g # required space in /mnt/h/tmp --cores 4 # number of cores to use
eman guess SPEC # guess a *single* step based on the jobid # or a substring of the hash, the tag, the # date or final score eman qstat # call SGE's qstat and guess steps for eman jobs
eman list STEPTYPE ... # list all steps of the given type eman status SPEC/STEPTYPE # like 'list --status', abbr. 'stat' eman vars SPEC/STEPTYPE # like 'list --vars' eman tag SPEC/STEPTYPE # like 'list --tag' eman annotate <txt >txt # adds --stat, --tag, ... after stepname eman users SPEC ... # list all steps that use the given step eman traceback EXPSPEC ... # show tree of the steps and predecessors --tag --vars --status # include extra information # tracebacks with --vars fully specify # the experiment -s /foo/bar/ # modify vars; can be repeated # implies --vars # highlights diff if to terminal -sv VAR=newvalue # use -s to set the given VAR; can be repeated --ignore STEPTYPE # do not include STEPTYPE in traceback eman traceforward EXPSPEC # show tree of the steps and users (successors) eman duplicates # show groups of 2+ steps having same vars --tag --vars --status # include relevant information --log --jobid # and the tail of the log or SGE job ID
eman abolish SPEC ... # destroy all step files except # metadata => can still be cloned eman collect # collect results of all experiments eman reindex # re-create index of steps
eman wait SPEC ... # block until the jobs are FAILED or DONE # die if FAILED or DONE is not reachable
eman select QUERY # output steps matching the QUERY; # query syntax is documented below
eman addremote DIR ALIAS # add a remote eman playground as ALIAS # you can use remote steps, clone them etc. eman adddir DIR # add a subdirectory for steps
SPEC is a step specifier, i.e. a text snippet capable of identifying a step uniquely. This can be a full stepdir name such as 's.align.12345678.20121010-1010' or any sufficient portion of it such as '678.2013'. You can even use '.' and '`pwd`' as SPEC. And once results are collected, any unambigous result number can be also used to refer to the respective step directory.
EXPSPEC is an experiment specifier. Formally, EXPSPEC and SPEC are identical, they refer to a particular unambiguous step directory. The difference lies in the command itself: SPEC-commands operate on the single step directory while EXPSPEC-commands operate on the whole structure of the step and its predecessors.
STEPTYPE specifies the seed to be used, e.g. 'align'.
DESCRIPTION
Command aliases:
abolish rm clone cl continue cont duplicates dups list ls prepare pr prepare prep retag retag select sel start st status stat tabulate tab traceback tb traceback tr traceforward tf traceforward traceusers traceforward tu
Eman is an experiment manager, useful mainly for deriving steps and step chains, i.e. complex experiment scenarios.
In the following:
a step ... is a single unit of work an experiment ... is a directed acyclic graph (DAG) of depending steps. an experiment can be also called a workflow. eman currently displays DAGs as trees, repeating shared steps a step seed ... is a recipe to build individual steps
Why Should You Use eman?
Eman is designed to speed up your 'experimental loop' and broaden the range of explored experiment configurations while maintaining the reproducibility of all the various experiment runs. The specific subject of your experiments is not important for eman---all commands to run etc. are encoded in your custom 'seeds'.
Structure of a Step Directory
Each step is represented as a single directory s.STEPTYPE.HASH.TIMESTAMP. Apart from any files needed or produced by the step, the following files are always present in the step directory:
*eman.tag ... one-line "readable" summary of vars Often manually edited to contain special flags. #eman.vars ... the variables configuring the step *eman.deps ... list of prerequisites of this step !*eman.status ... the status of the step eman.jobid ... the jobid of the most recent (re)run eman.seed ... the script used to init and prepare the step #eman.command ... the script used to run the step eman.derived_from ... the name of the step used when deriving eman.init_env ... all environment variables at init time
Files marked with '#' have to be provided by your 'seed' scripts. Files marked with '*' can be provided by your 'seed' scripts. Other files are created by eman. The only file marked with '!', i.e. 'eman.status', is used to check if a directory is a valid step. To manually fake a step, create a directory called s.ANYTHING.TIMESTAMP and write 'DONE' into eman.status in there. Forged steps can be only depended upon, cloning or other advanced operations result in undefined behaviour.
Special Files in Directory of Steps
In the directory containing all your steps, eman uses the following files:
#eman.seeds ... the directory of all step seeds eman.index ... index of steps for quick check for identities #eman.results.conf ... name wildcard pattern and regex to extract result eman.results ... collected results from all steps
Again, you are responsible for providing the items marked '#'.
Life Cycle of Individual Steps
Each (successful) step goes through these core phases:
init ... become part of structure of experiments, depend on other steps and allow other steps depend on me prepare ... quickly check that all input files exist run ... long computation, submitted to cluster
Other than that, steps are considered immutable. You can modify existing steps as you like (changing status, variables, contents) but you are sacrificing the reproducibility of your experiments. Our best practice involves a lot of hacking of existing steps in early stages of implementation interleaved with frequent cloning and reruning of the experiments from scratch. Later, once all new tweaks are exposed as variables of the respective steps, we absolutely avoid modifying existing steps and use cloning only.
The progress of a step is achieved by eman following this procedure:
1. The commands 'eman init STEPTYPE' and 'eman clone s.STEPTYPE....' create a timestamped step directory, e.g. s.test.hash1234.20101115-1213.
2. The file eman.seeds/STEPTYPE is copied there as eman.seed.
3. The seed is run and expected to 'init' (i.e. produce the file eman.vars and optionally eman.deps). The seed may also produce eman.status with the content 'DONE' or 'PREPARED' to skip some of the following phases (XXX unimplemented). By default, the status becomes 'INITED'.
4. The seed is run and expected to 'prepare' (i.e. produce the file eman.command). The status becomes 'PREPARED'.
5. eman.command is run and expected to write 'DONE' or 'FAILED' to eman.status.
The following statuses are recognized:
NONEXISTENT ... not created yet / irreversibly deleted INITED ... the step was just created INITFAILED ... the initialization failed PREPARED ... prepared using 'eman prepare' PREPFAILED ... 'eman prepare' failed WAITING: ... ... submitted by 'eman start', prereqs still run (not used; waiting jobs are marked 'running') STARTING ... just before 'running' RUNNING ... running FAILED ... the run failed DONE ... the run succeeded ABOLISHED ... has just vars, deps but no more content OUTDATED ... you can manually set this to prevent reuse
STEP AND EXPERIMENT CLONING
The support for cloning steps and whole experiments (sequences of steps) is a key feature of eman. Cloning could be also called 'deriving', because we allow the clone to bear different variable values.
Cloning a step (the command 'eman clone SPEC') means creating a completely new step and providing it with variables from the source step and possibly adding or modifying some.
Cloning a sequence of steps (the commands 'eman clone < traceback' and 'eman redo') is slightly trickier: imagine we change a variable in an early step in the sequence. All the following steps in the experiment then have to be instructed to use this modified step. Eman achieves this by explicitly replacing the original step name with the name of the new step in variables of subsequent steps. The immutability of steps naturally requires to clone the subsequent steps as well.
FINDING STEPS
Once you have many complex experiments in your playground, using 'ls' becomes less friendly. Eman can provides a few tools to locate experiments matching you criteria.
While direct access of the filesystem (ls, grep s.*/eman.vars, ...) is perhaps faster, it does not search in (remote) subdirs of your playground.
eman ls STEPTYPE ... just lists existing steps eman select QUERY-ARGUMENTS
Eman `select' Query Syntax
You can use the following filters in eman select:
t <type> ... only steps of type <type> d ... only DONE steps f ... only FAILED steps i ... only INITED steps p ... only PREPARED steps o ... only OUTDATED steps r ... only RUNNING steps s <status> ... only steps with status <status> v <VAR>=<value> ... only steps with variable <VAR> that has value <value> vre <regex> ... only steps where the variable matches <regex> (the expression can contain the name of the variable) tre <regex> ... only steps where tags (incl. autotags) match <regex> l <count> ... latest <count> steps (according to timestamp, i.e. order of steps created within the same minute in undefined) lh ... latest 10 steps nq ... only steps not currently known to qstat (current user, any cluster job status) u <criteria> ... only steps that have at least one user with given properties; rest of query defines the user br <criteria> ... only steps that have at least one (transitive) predecessor with given properties (backward recursion) fr <criteria> ... only steps that have at least one (transitive) successor with given properties (forward recursion) not ... negation of the remaining filters e.g. eman select not d remote ... only remote steps, implies --remote date <date> ... only steps created on <date> (yyyymmdd) today ... only steps created today
Examples:
eman sel t mert d # MERT steps that are DONE eman sel lh # last 10 steps eman sel t tm v DECODINGSTEPS=t0-0 # tm steps with one 1-factor t-step eman sel t tm vre '^DECODINGSTEPS=t0-0$' # equvivalent to the above eman sel t mert br vre 'ALIAUG.*lcstem4' # MERT steps where word alignment # was done on lcstem4
# The result (and speed) depends on the order of filters, e.g.: eman select t mert l 10 # output last 10 merts eman select l 10 t mert # output merts that were among 10 latest steps
USAGE PATTERNS
eman traceback SPEC -s '/.../.../' # preview the experiment with some vars replaced # append "--colorize | less" to preview
eman traceback SPEC -s '/.../.../' | eman clone # clone the whole subtree of steps replacing some vars
VAR=x eman redo SPEC # clone the top step (and unusable predecessors) replacing VAR # with some new value x
eman redo SPEC --start --outdate # redo all failed/outdated steps, mark them as outdated
eman td SPEC --stat # see what all was derived (by redo or clone) from SPEC # useful e.g. for chasing redone experiments and finding the last one
eman annotate --stat < my-notes.txt > new-notes.txt # change "s.anything.123 (anything)" into "s.anything.123 (RUNNING)" # useful for making your notes up-to-date
eman abolish `eman select t STEPTYPE f` # remove all content files of all failing steps of a STEPTYPE
eman stat `qstat | cut -c 1-8 | skip 2` # show status of all running/held jobs in SGE # WARNING: This only works if there are no non-eman cluster jobs. # Otherwise, eman will just say "Failed to guess step from: $job_id". # the skip command is available here: # http://www.cuni.cz/~obo/textutils/#skip
eman sel t MERT fr t EVAL tre FOO # list all MERT steps that were used to construct EVAL steps tagged # with the tag FOO
eman sel t lm vre tag --stat --vars | grep 'ORDER\|s.lm' # what all orders of LM do I have over morphological 'tag's
eman select r nq # which failed jobs failed to fail correctly? # eman believes they are still running but cluster does not know them
eman sel t evaluator br t lm vre 'CORP.*=MYCORPUS' # all 'evaluator' steps which use a language model trained on MYCORPUS
eman sel t tm not fr t evaluator # all 'tm' steps which have not been evaluated (i.e. not step of type # 'evaluator' transitively depends on them)
eman sel f br vre 'ALICORP=.*un.*' # list all failed steps depending (transitively) on ALICORP=*un* # (e.g. un.fr-en or un.es-en)
eman select t translate not u t evaluator # list all translate steps whose users (intransitive) do not include any # evaluator step
eman ls > steps-all.txt eman traceback --notree \ `qstat | grep $USER | grep -P '\ss\.' | cut -d' ' -f1 | grep -v 4296093` \ | sort -u > steps-tabu.txt perl -e ' open(TABU, "steps-tabu.txt") or die; while(<TABU>) { $tabu{$_}++ } open(ALL, "steps-all.txt") or die; while(<ALL>) { print unless($tabu{$_}) }' > steps-free.txt for i in `cat steps-free.txt` ; do echo $i ; cp -r $i /net/cluster/TMP/$USER/new_playground done for i in `cat steps-free.txt` ; do echo $i ; rm -rf $i ; done # I want to move the playground to another disk because I am hitting the # quota. But I need to know first which steps are running or prerequisities # of something running (immovable at the moment). Note the "grep -v 4296093" # above. Eman will die if any of the running jobs don't match a known step. # (BEWARE: Copying steps will also separate numerous hardlinks we have there. # The target disk usage will be much higher until we rejoin the hardlinks # again or erase them.)
If a Cluster Node Completely Dies
A cluster node completely dies when (your) job takes too much RAM. Assuming 2472256 is the SGE job ID of the failed (but still allegedly running) job, this is probably what you want to do:
qdel 2472256 # remove it from the cluster eman fail 2472256 # mark it FAILED for eman eman redo 2472256 --start --mem 20g # clone and re-run it with some more memory\ # or: eman continue 2472256 --mem 20g # re-run it with more memory # eman redo will clone the failed step (and reuse steps it depends on) # eman continue will rerun the original step instead (run eman.command # without generating it anew); note that running users of the step cannot # be saved like this (run eman redo on the last step afterwards)
COMMON PROBLEMS
Outdated Index File
If you run e.g. 'eman users ...', get a step directory, but subsequent eman commands fail to find it ("Failed to guess step from:..."), try running:
eman reindex
The step may indeed be a zombie, a removed directory.
Multiple Step Instances, Some Failed, Some OK
By cloning, you can easily end up with several instances of the same step (i.e. two distinct step directories with identical variables). Sometimes, some of the instances may be even failed and some may be finished successfully.
When such a step is further used in an experiment, and you clone the experiment, eman will automatically use the oldest plausible instance (FAILED, OUTDATED and ABOLISHED instances are not considered plausible).
To pick a specific instance of a step manually (including implausible instances), use --reuse. To avoid some instances, use --avoid.
You may wish to use 'eman dups' every now and then to get rid (or abolish) some of the unused instances.
WRITING SEEDS: CORE CONVENTIONS
Seeds (in eman.seeds) have to follow some conventions.
- executable
- respond to environment variables
- exit code 0 for success, other for failure
- init and prepare by default
- init only if $INIT_ONLY==yes
When Initing
- create the file: eman.vars
- optionally also create: eman.tag, eman.deps
Note that for reliable cloning, deps must be directly determined from the vars. It is actually best to include the full name of the dependence in one of the variables.
When Preparing
- create the file: eman.command
WRITING SEEDS: TIPS
Inheriting Variables
It is often useful to propagate a value of a variable from a dependence to the current step. This can be easily achieved:
INHERITED=`cat ../$DEPENDENCE/eman.vars | grep TO_INHERIT | cut -d= -f2` echo INHERITED=$INHERITED >> eman.vars # save as our variable
Note that your seed can be asked to perform the init before the $DEPENDENCE was prepared and likewise, it can be asked to prepare before $DEPENDENCE was run. So avoid asking for files in $DEPENDENCE too early.
Note DZ: The complicated code above is probably outdated. There is now support for variable inheriting directly in eman. This is an example of what to include in the seed of the inheriting step:
eman defvar INHERITED inherit=PREVIOUSSTEP eman defvar INHERITED2 inherit=PREVIOUSSTEP:VARIABLE
Inheriting Dependencies
Consider the following traceback:
+- s.tm.ABC.20101127-1856 | | ALIGN=s.align.DEF.20101127-1856 | | BINARIES=s.binaries.GHI.20101127-1856 | +- s.align.DEF.20101127-1856 | | | BINARIES=s.binaries.GHI.20101127-1856 | | +- s.binaries.GHI.20101127-1856 | +- s.binaries.GHI.20101127-1856
The step 'binaries' is used by 'tm' directly but could be 'inherited' from 'align', so that we don't have to specify it when initing 'tm' and also the traceback is simpler:
+- s.tm.ABC.20101127-1856 | +- s.align.DEF.20101127-1856 | | +- s.binaries.GHI.20101127-1856
The best technique to achieve this simplification is:
1. 'align' should have BINARIES as a variable as well as a dependence.
2. 'tm' should have BINARIES only as a variable and not as a dependence.
3. The seed of 'tm' should use the given ALIGN to copy BINARIES from there.
4. For best flexibility, 'tm' should allow for using a different 'binaries' step. If and only if this happens, 'tm' should add the extra dependence:
+- s.tm.ABC.20101127-1856 | | ALIGN=s.align.DEF.20101127-1856 | | BINARIES=s.binaries.JKL.20101130-1100 | +- s.align.DEF.20101127-1856 | | | BINARIES=s.binaries.GHI.20101127-1856 | | +- s.binaries.GHI.20101127-1856 | +- s.binaries.GHI.20101120-1100
Here a is proposed bash solution for the 'tm' seed:
INHERITED=`cat ../$ALIGN/eman.vars | grep BINARIES | cut -d= -f2` if [ -z "$BINARIES" ] || [ "$BINARIES" == "$INHERITED" ]; then # inheriting BINARIES=$INHERITED else # using our own echo $BINARIES >> eman.deps fi # surely store the var echo BINARIES=$BINARIES >> eman.vars
The topic of inherited dependencies is related to the question whether we see the experiment as a tree or a directed acyclic graph.
FUTURE EXTENSIONS
eman.requests
A script in step directory that will, based on the size of step's data etc., predict the requirements on disk and memory and output a string such as:
--mem 30g --disk 100g
The output would be used as additional arguments for eman start/continue.
The script would be optional and command-line specified parameters would override its output.
SEE ALSO
Eman is somewhat similar but also different from other experiment managements systems.
While other experiment management systems treat the whole experiment as the main goal (allowing to represent variations within the experiment and reusing parts of previous experiment runs), eman works primarily with the individual steps. The complete experiments emerge rather as side-effects. Later, they can be easily displayed using 'eman traceback' as well as reused or modified using 'eman clone|redo'. We like to say that this make eman more flexible.
Essentially, an eman traceback can be seen as a sample workflow and 'eman clone < traceback' can be used to instantiate the workflow.
A feature very natural for eman but still unique compared to other systems is the command-line interface to construct variations of steps or experiments. The exploration of the space of configurations can thus be quickly automated.
Related experiment or workflow management systems:
LoonyBin
http://www.cs.cmu.edu/~jhclark/loonybin/
LoonyBin is a clickable Java tool. The good sides of LoonyBin are: support for multiple clusters and schedulers.
Moses Experiment Management System
http://www.statmt.org/moses/?n=FactoredTraining.EMS
Moses EMS (experiment.perl) is centered around a single (customizable) experiment which consists of steps.
Other
There are also the following workflow management systems: DAGMan, Pegasus, Dryad.
AUTHOR
Ondřej Bojar <obo@matfyz.cz>
Contributions by Aleš Tamchyna and Dan Zeman.
Copyright 2010-2013, the respective authors. All rights reserved.