eman

...a feature-packed experiment manager.

Last update: 2018-02-07

eman
SYNOPSIS
DESCRIPTION

Why Should You Use eman?
Structure of a Step Directory
Special Files in Directory of Steps
Life Cycle of Individual Steps

STEP AND EXPERIMENT CLONING
FINDING STEPS

Eman `select' Query Syntax

USAGE PATTERNS

If a Cluster Node Completely Dies

COMMON PROBLEMS

Outdated Index File
Multiple Step Instances, Some Failed, Some OK

WRITING SEEDS: CORE CONVENTIONS

When Initing
When Preparing

WRITING SEEDS: TIPS

Inheriting Variables
Inheriting Dependencies

FUTURE EXTENSIONS

eman.requests

SEE ALSO

LoonyBin
Moses Experiment Management System
Other

AUTHOR

eman

eman, experiment manager

SYNOPSIS

  VAR=val eman init STEPTYPE # create new step of the given type
  VAR=val eman clone SPEC    # create new step based on SPEC

  eman clone < traceback     # create step by cloning incl. predecessors
  eman redo EXPSPEC          # equals 'eman tb --vars SPEC | eman clone'
                             # good predecessors are reused by default
    --reuse=SPEC             #   reuse the given step (incl. predecs.)
    --ignore=STEPTYPE        #   reuse steps of the given type
    --avoid=SPEC             #   don't reuse the given step in the clone
    --redo=STEPTYPE          #   don't reuse any steps of the given type
    --all-avoid              #   avoid all input steps
    --start                  # after init/clone/redo, submit exp to queue
    --prepare                # after init/clone/redo, prepare the step

  eman prepare SPEC          # prepare inited step
  eman run SPEC              # run prepared step
  eman continue SPEC         # continue a single step that failed
  eman clean SPEC            # erase files other than eman* and log* to make
                             # the step reentrant and 'eman continue' safe
  eman start EXPSPEC         # prepare and run, incl. all predecessors
    --priority -200          #   cluster scheduling priority (default: -100)
    --mem 15g                #   required memory (default: 6g)
    --disk 20g               #   required space in /mnt/h/tmp
    --cores 4                #   number of cores to use

  eman guess SPEC            # guess a *single* step based on the jobid
                             # or a substring of the hash, the tag, the
                             # date or final score
  eman qstat                 # call SGE's qstat and guess steps for eman jobs

  eman list STEPTYPE ...     # list all steps of the given type
  eman status SPEC/STEPTYPE  # like 'list --status', abbr. 'stat'
  eman vars SPEC/STEPTYPE    # like 'list --vars'
  eman tag SPEC/STEPTYPE     # like 'list --tag'
  eman annotate <txt >txt    # adds --stat, --tag, ... after stepname
  eman users SPEC ...        # list all steps that use the given step
  eman traceback EXPSPEC ... # show tree of the steps and predecessors
    --tag --vars --status    #   include extra information
                             #   tracebacks with --vars fully specify
                             #   the experiment
    -s /foo/bar/             #   modify vars; can be repeated
                             #     implies --vars
                             #     highlights diff if to terminal
    -sv VAR=newvalue         #   use -s to set the given VAR; can be repeated
    --ignore STEPTYPE        #   do not include STEPTYPE in traceback
  eman traceforward EXPSPEC  # show tree of the steps and users (successors)
  eman duplicates            # show groups of 2+ steps having same vars
    --tag --vars --status    #   include relevant information
    --log --jobid            #   and the tail of the log or SGE job ID

  eman abolish SPEC ...      # destroy all step files except
                             #   metadata => can still be cloned
  eman collect               # collect results of all experiments
  eman reindex               # re-create index of steps

  eman wait SPEC ...         # block until the jobs are FAILED or DONE
                             # die if FAILED or DONE is not reachable

  eman select QUERY          # output steps matching the QUERY;
                             #   query syntax is documented below

  eman addremote DIR ALIAS   # add a remote eman playground as ALIAS
                             # you can use remote steps, clone them etc.
  eman adddir DIR            # add a subdirectory for steps

SPEC is a step specifier, i.e. a text snippet capable of identifying a step uniquely. This can be a full stepdir name such as 's.align.12345678.20121010-1010' or any sufficient portion of it such as '678.2013'. You can even use '.' and '`pwd`' as SPEC. And once results are collected, any unambigous result number can be also used to refer to the respective step directory.

EXPSPEC is an experiment specifier. Formally, EXPSPEC and SPEC are identical, they refer to a particular unambiguous step directory. The difference lies in the command itself: SPEC-commands operate on the single step directory while EXPSPEC-commands operate on the whole structure of the step and its predecessors.

STEPTYPE specifies the seed to be used, e.g. 'align'.

DESCRIPTION

Command aliases:

  abolish      rm
  clone        cl
  continue     cont
  duplicates   dups
  list         ls
  prepare      pr
  prepare      prep
  retag        retag
  select       sel
  start        st
  status       stat
  tabulate     tab
  traceback    tb
  traceback    tr
  traceforward tf
  traceforward traceusers
  traceforward tu

Eman is an experiment manager, useful mainly for deriving steps and step chains, i.e. complex experiment scenarios.

In the following:

 a step        ... is a single unit of work
 an experiment ... is a directed acyclic graph (DAG) of depending steps.
                   an experiment can be also called a workflow.
                   eman currently displays DAGs as trees, repeating
                   shared steps
 a step seed   ... is a recipe to build individual steps

Why Should You Use eman?

Eman is designed to speed up your 'experimental loop' and broaden the range of explored experiment configurations while maintaining the reproducibility of all the various experiment runs. The specific subject of your experiments is not important for eman---all commands to run etc. are encoded in your custom 'seeds'.

Structure of a Step Directory

Each step is represented as a single directory s.STEPTYPE.HASH.TIMESTAMP. Apart from any files needed or produced by the step, the following files are always present in the step directory:

 *eman.tag           ... one-line "readable" summary of vars
                         Often manually edited to contain special flags.
 #eman.vars          ... the variables configuring the step
 *eman.deps          ... list of prerequisites of this step
!*eman.status        ... the status of the step
  eman.jobid         ... the jobid of the most recent (re)run
  eman.seed          ... the script used to init and prepare the step
 #eman.command       ... the script used to run the step
  eman.derived_from  ... the name of the step used when deriving
  eman.init_env      ... all environment variables at init time

Files marked with '#' have to be provided by your 'seed' scripts. Files marked with '*' can be provided by your 'seed' scripts. Other files are created by eman. The only file marked with '!', i.e. 'eman.status', is used to check if a directory is a valid step. To manually fake a step, create a directory called s.ANYTHING.TIMESTAMP and write 'DONE' into eman.status in there. Forged steps can be only depended upon, cloning or other advanced operations result in undefined behaviour.

Special Files in Directory of Steps

In the directory containing all your steps, eman uses the following files:

 #eman.seeds         ... the directory of all step seeds
  eman.index         ... index of steps for quick check for identities
 #eman.results.conf  ... name wildcard pattern and regex to extract result
  eman.results       ... collected results from all steps

Again, you are responsible for providing the items marked '#'.

Life Cycle of Individual Steps

Each (successful) step goes through these core phases:

  init    ... become part of structure of experiments, depend on other
              steps and allow other steps depend on me
  prepare ... quickly check that all input files exist
  run     ... long computation, submitted to cluster

Other than that, steps are considered immutable. You can modify existing steps as you like (changing status, variables, contents) but you are sacrificing the reproducibility of your experiments. Our best practice involves a lot of hacking of existing steps in early stages of implementation interleaved with frequent cloning and reruning of the experiments from scratch. Later, once all new tweaks are exposed as variables of the respective steps, we absolutely avoid modifying existing steps and use cloning only.

The progress of a step is achieved by eman following this procedure:

1. The commands 'eman init STEPTYPE' and 'eman clone s.STEPTYPE....' create a timestamped step directory, e.g. s.test.hash1234.20101115-1213.

2. The file eman.seeds/STEPTYPE is copied there as eman.seed.

3. The seed is run and expected to 'init' (i.e. produce the file eman.vars and optionally eman.deps). The seed may also produce eman.status with the content 'DONE' or 'PREPARED' to skip some of the following phases (XXX unimplemented). By default, the status becomes 'INITED'.

4. The seed is run and expected to 'prepare' (i.e. produce the file eman.command). The status becomes 'PREPARED'.

5. eman.command is run and expected to write 'DONE' or 'FAILED' to eman.status.

The following statuses are recognized:

  NONEXISTENT        ... not created yet / irreversibly deleted
  INITED             ... the step was just created
  INITFAILED         ... the initialization failed
  PREPARED           ... prepared using 'eman prepare'
  PREPFAILED         ... 'eman prepare' failed
  WAITING: ...       ... submitted by 'eman start', prereqs still run
                         (not used; waiting jobs are marked 'running')
  STARTING           ... just before 'running'
  RUNNING            ... running
  FAILED             ... the run failed
  DONE               ... the run succeeded
  ABOLISHED          ... has just vars, deps but no more content
  OUTDATED           ... you can manually set this to prevent reuse

STEP AND EXPERIMENT CLONING

The support for cloning steps and whole experiments (sequences of steps) is a key feature of eman. Cloning could be also called 'deriving', because we allow the clone to bear different variable values.

Cloning a step (the command 'eman clone SPEC') means creating a completely new step and providing it with variables from the source step and possibly adding or modifying some.

Cloning a sequence of steps (the commands 'eman clone < traceback' and 'eman redo') is slightly trickier: imagine we change a variable in an early step in the sequence. All the following steps in the experiment then have to be instructed to use this modified step. Eman achieves this by explicitly replacing the original step name with the name of the new step in variables of subsequent steps. The immutability of steps naturally requires to clone the subsequent steps as well.

FINDING STEPS

Once you have many complex experiments in your playground, using 'ls' becomes less friendly. Eman can provides a few tools to locate experiments matching you criteria.

While direct access of the filesystem (ls, grep s.*/eman.vars, ...) is perhaps faster, it does not search in (remote) subdirs of your playground.

  eman ls STEPTYPE  ... just lists existing steps
  eman select QUERY-ARGUMENTS

Eman `select' Query Syntax

You can use the following filters in eman select:

  t <type>         ... only steps of type <type>
  d                ... only DONE steps
  f                ... only FAILED steps
  i                ... only INITED steps
  p                ... only PREPARED steps
  o                ... only OUTDATED steps
  r                ... only RUNNING steps
  s <status>       ... only steps with status <status>
  v <VAR>=<value>  ... only steps with variable <VAR> that
                       has value <value>
  vre <regex>      ... only steps where the variable matches <regex>
                       (the expression can contain the name of the variable)
  tre <regex>      ... only steps where tags (incl. autotags) match <regex>
  l <count>        ... latest <count> steps (according to timestamp,
                       i.e. order of steps created within the same minute
                       in undefined)
  lh               ... latest 10 steps
  nq               ... only steps not currently known to qstat
                       (current user, any cluster job status)
  u <criteria>     ... only steps that have at least one user with given
                       properties; rest of query defines the user
  br <criteria>    ... only steps that have at least one (transitive) predecessor
                       with given properties (backward recursion)
  fr <criteria>    ... only steps that have at least one (transitive) successor
                       with given properties (forward recursion)
  not              ... negation of the remaining filters
                       e.g. eman select not d
  remote           ... only remote steps, implies --remote
  date <date>      ... only steps created on <date> (yyyymmdd)
  today            ... only steps created today

Examples:

  eman sel t mert d                        # MERT steps that are DONE
  eman sel lh                              # last 10 steps
  eman sel t tm v DECODINGSTEPS=t0-0       # tm steps with one 1-factor t-step
  eman sel t tm vre '^DECODINGSTEPS=t0-0$' # equvivalent to the above
  eman sel t mert br vre 'ALIAUG.*lcstem4' # MERT steps where word alignment
                                           # was done on lcstem4

  # The result (and speed) depends on the order of filters, e.g.:
  eman select t mert l 10 # output last 10 merts
  eman select l 10 t mert # output merts that were among 10 latest steps

USAGE PATTERNS

  eman traceback SPEC -s '/.../.../'
  # preview the experiment with some vars replaced
  # append "--colorize | less" to preview

  eman traceback SPEC -s '/.../.../' | eman clone
  # clone the whole subtree of steps replacing some vars

  VAR=x eman redo SPEC
  # clone the top step (and unusable predecessors) replacing VAR
  # with some new value x

  eman redo SPEC --start --outdate
  # redo all failed/outdated steps, mark them as outdated

  eman td SPEC --stat
  # see what all was derived (by redo or clone) from SPEC
  # useful e.g. for chasing redone experiments and finding the last one

  eman annotate --stat < my-notes.txt > new-notes.txt
  # change "s.anything.123 (anything)" into "s.anything.123 (RUNNING)"
  # useful for making your notes up-to-date

  eman abolish `eman select t STEPTYPE f`
  # remove all content files of all failing steps of a STEPTYPE

  eman stat `qstat | cut -c 1-8 | skip 2`
  # show status of all running/held jobs in SGE
  # WARNING: This only works if there are no non-eman cluster jobs.
  # Otherwise, eman will just say "Failed to guess step from: $job_id".
  # the skip command is available here:
  #   http://www.cuni.cz/~obo/textutils/#skip

  eman sel t MERT fr t EVAL tre FOO
  # list all MERT steps that were used to construct EVAL steps tagged
  # with the tag FOO

  eman sel t lm vre tag --stat --vars | grep 'ORDER\|s.lm'
  # what all orders of LM do I have over morphological 'tag's

  eman select r nq
  # which failed jobs failed to fail correctly?
  # eman believes they are still running but cluster does not know them

  eman sel t evaluator br t lm vre 'CORP.*=MYCORPUS'
  # all 'evaluator' steps which use a language model trained on MYCORPUS

  eman sel t tm not fr t evaluator
  # all 'tm' steps which have not been evaluated (i.e. not step of type
  # 'evaluator' transitively depends on them)

  eman sel f br vre 'ALICORP=.*un.*'
  # list all failed steps depending (transitively) on ALICORP=*un*
  # (e.g. un.fr-en or un.es-en)

  eman select t translate not u t evaluator
  # list all translate steps whose users (intransitive) do not include any
  # evaluator step

  eman ls > steps-all.txt
  eman traceback --notree \
    `qstat | grep $USER | grep -P '\ss\.' | cut -d' ' -f1 | grep -v 4296093` \
    | sort -u > steps-tabu.txt
  perl -e '
    open(TABU, "steps-tabu.txt") or die;
    while(<TABU>) { $tabu{$_}++ }
    open(ALL, "steps-all.txt") or die;
    while(<ALL>) { print unless($tabu{$_}) }' > steps-free.txt
  for i in `cat steps-free.txt` ; do
    echo $i ; cp -r $i /net/cluster/TMP/$USER/new_playground
  done
  for i in `cat steps-free.txt` ; do echo $i ; rm -rf $i ; done
  # I want to move the playground to another disk because I am hitting the
  # quota. But I need to know first which steps are running or prerequisities
  # of something running (immovable at the moment). Note the "grep -v 4296093"
  # above. Eman will die if any of the running jobs don't match a known step.
  # (BEWARE: Copying steps will also separate numerous hardlinks we have there.
  # The target disk usage will be much higher until we rejoin the hardlinks
  # again or erase them.)

If a Cluster Node Completely Dies

A cluster node completely dies when (your) job takes too much RAM. Assuming 2472256 is the SGE job ID of the failed (but still allegedly running) job, this is probably what you want to do:

  qdel 2472256                         # remove it from the cluster
  eman fail 2472256                    # mark it FAILED for eman
  eman redo 2472256 --start --mem 20g  # clone and re-run it with some more memory\
# or:
  eman continue 2472256 --mem 20g      # re-run it with more memory
    # eman redo will clone the failed step (and reuse steps it depends on)
    # eman continue will rerun the original step instead (run eman.command
    # without generating it anew); note that running users of the step cannot
    # be saved like this (run eman redo on the last step afterwards)

COMMON PROBLEMS

Outdated Index File

If you run e.g. 'eman users ...', get a step directory, but subsequent eman commands fail to find it ("Failed to guess step from:..."), try running:

  eman reindex

The step may indeed be a zombie, a removed directory.

Multiple Step Instances, Some Failed, Some OK

By cloning, you can easily end up with several instances of the same step (i.e. two distinct step directories with identical variables). Sometimes, some of the instances may be even failed and some may be finished successfully.

When such a step is further used in an experiment, and you clone the experiment, eman will automatically use the oldest plausible instance (FAILED, OUTDATED and ABOLISHED instances are not considered plausible).

To pick a specific instance of a step manually (including implausible instances), use --reuse. To avoid some instances, use --avoid.

You may wish to use 'eman dups' every now and then to get rid (or abolish) some of the unused instances.

WRITING SEEDS: CORE CONVENTIONS

Seeds (in eman.seeds) have to follow some conventions.

- executable

- respond to environment variables

- exit code 0 for success, other for failure

- init and prepare by default

- init only if $INIT_ONLY==yes

When Initing

- create the file: eman.vars

- optionally also create: eman.tag, eman.deps

Note that for reliable cloning, deps must be directly determined from the vars. It is actually best to include the full name of the dependence in one of the variables.

When Preparing

- create the file: eman.command

WRITING SEEDS: TIPS

Inheriting Variables

It is often useful to propagate a value of a variable from a dependence to the current step. This can be easily achieved:

  INHERITED=`cat ../$DEPENDENCE/eman.vars | grep TO_INHERIT | cut -d= -f2`
  echo INHERITED=$INHERITED >> eman.vars # save as our variable

Note that your seed can be asked to perform the init before the $DEPENDENCE was prepared and likewise, it can be asked to prepare before $DEPENDENCE was run. So avoid asking for files in $DEPENDENCE too early.

Note DZ: The complicated code above is probably outdated. There is now support for variable inheriting directly in eman. This is an example of what to include in the seed of the inheriting step:

  eman defvar INHERITED inherit=PREVIOUSSTEP
  eman defvar INHERITED2 inherit=PREVIOUSSTEP:VARIABLE

Inheriting Dependencies

Consider the following traceback:

  +- s.tm.ABC.20101127-1856
  |  | ALIGN=s.align.DEF.20101127-1856
  |  | BINARIES=s.binaries.GHI.20101127-1856
  |  +- s.align.DEF.20101127-1856
  |  |  | BINARIES=s.binaries.GHI.20101127-1856
  |  |  +- s.binaries.GHI.20101127-1856
  |  +- s.binaries.GHI.20101127-1856

The step 'binaries' is used by 'tm' directly but could be 'inherited' from 'align', so that we don't have to specify it when initing 'tm' and also the traceback is simpler:

  +- s.tm.ABC.20101127-1856
  |  +- s.align.DEF.20101127-1856
  |  |  +- s.binaries.GHI.20101127-1856

The best technique to achieve this simplification is:

1. 'align' should have BINARIES as a variable as well as a dependence.

2. 'tm' should have BINARIES only as a variable and not as a dependence.

3. The seed of 'tm' should use the given ALIGN to copy BINARIES from there.

4. For best flexibility, 'tm' should allow for using a different 'binaries' step. If and only if this happens, 'tm' should add the extra dependence:

  +- s.tm.ABC.20101127-1856
  |  | ALIGN=s.align.DEF.20101127-1856
  |  | BINARIES=s.binaries.JKL.20101130-1100
  |  +- s.align.DEF.20101127-1856
  |  |  | BINARIES=s.binaries.GHI.20101127-1856
  |  |  +- s.binaries.GHI.20101127-1856
  |  +- s.binaries.GHI.20101120-1100

Here a is proposed bash solution for the 'tm' seed:

  INHERITED=`cat ../$ALIGN/eman.vars | grep BINARIES | cut -d= -f2`
  if [ -z "$BINARIES" ] || [ "$BINARIES" == "$INHERITED" ]; then
    # inheriting
    BINARIES=$INHERITED
  else
    # using our own
    echo $BINARIES >> eman.deps
  fi
  # surely store the var
  echo BINARIES=$BINARIES >> eman.vars

The topic of inherited dependencies is related to the question whether we see the experiment as a tree or a directed acyclic graph.

FUTURE EXTENSIONS

eman.requests

A script in step directory that will, based on the size of step's data etc., predict the requirements on disk and memory and output a string such as:

  --mem 30g --disk 100g

The output would be used as additional arguments for eman start/continue.

The script would be optional and command-line specified parameters would override its output.

AUTHOR

Ondřej Bojar <obo@matfyz.cz>

Contributions by Aleš Tamchyna and Dan Zeman.