CoNLL-2009 Shared Task:
Syntactic and Semantic Dependencies in Multiple Languages



CoNLL-2009 Shared Task Scorer

The scorer is based on the 2008 scorer (see CoNLL 2008 Shared Task Wiki). The description of the script is therefore also derived from the 2008 description. The main differences between the current and previous version are marked by the word NEW in red. Some minor changes (mostly English-specific features of the script removed to allow for processing of various languages) are not mentioned here, just commented in the code.

Help message

  CoNLL-09 evaluation script:

   [perl] [OPTIONS] -g <gold standard> -s <system output>

  This script evaluates a system output with respect to a gold standard.
  Both files should be in UTF-8 encoded CoNLL-09 tabular format.

  The output breaks down the errors according to their type and context.

  Optional parameters:
     -o FILE : output: print output to FILE (default is standard output)
     -q : quiet:       only print overall performance, without the details
     -b : evalb:       produce output in a format similar to evalb
                       (; use together with -q
     -p : punctuation: do not score punctuation (default is to score)
     -v : version:     show the version number
     -h : help:        print this help text and exit

NEW: The option -u (SU preds/args: score SU predicates and arguments) has been removed.

Note: both the gold standard and the system output must follow the format used in this shared task for the training and development corpus! Especially, it means every sentence including the last one in a file must be followed by exactly one blank line. White-space characters are used only to separate columns, no leading nor trailing white-space is allowed on a data line.

How it works

The scorer consists of three main parts: a) evaluation of syntactic dependencies, b) evaluation of semantic frames, and c) combining the two tasks into a unique score.

The syntactic dependencies are evaluated using exactly the same algorithm and measures as the 2007 shared task scorer. We report the same three scores: labeled attachment, unlabeled attachment, and label accuracy.

The semantic frames are evaluated by reducing them to semantic dependencies, similarly as in 2008: we create a semantic dependency from every predicate to all its individual arguments. These dependencies are labeled with the labels of the corresponding arguments. Additionally, we create a semantic dependency from each predicate to a virtual ROOT node. The latter dependencies are labeled with the predicate senses. This approach guarantees that the semantic dependency structure forms conceptually a single-rooted, connected (but not necessarily acyclic) graph. More importantly, this scoring strategy implies that if a system assigns the incorrect predicate sense, it still receives some points for the arguments correctly assigned. For example, for the correct proposition:

verb.01: ARG0, ARG1, ARGM-TMP

the system that generates the following output for the same argument tokens:

verb.02: ARG0, ARG1, ARGM-LOC

receives a labeled precision score of 2/4 because two out of four semantic dependencies are incorrect: the ROOT dependency is labeled “02” instead of “01” and the dependency to the “ARGM-TMP” is incorrectly labeled “ARGM-LOC”. For both labeled and unlabeled dependencies we report precision (P), recall (R), and F1 scores.

NEW: If the PRED column values are of the form lemma.frame (dot separated, e.g. in English), the script compares only the frame part (i.e. the 2nd part) of the predicate.

NEW: In the Czech and Japanese data, multiple semantic functions can appear in the APREDn columns. They are represented by a vertical bar separated list of functions. They are scored by one point each, both for labeled and unlabeled scores. For example, for the correct proposition:


the system that generates the following output for the same argument tokens:


receives a labeled precision score of 3/4 because the PAT is incorrect and labeled recall 3/4 because the EFF is missing. The labeled scores would be counted similarly. In other words, the semantic structure can be represented by a multigraph (multitree).

To help system developers, we also report precision and recall scores for several sub-tasks. First, we report labeled and unlabeled scores for the semantic dependencies to ROOT. In other words, the unlabeled scores for the ROOT dependencies measure the performance of the predicate identification sub-task. The labeled scores for the ROOT dependencies measure the performance of the predicate identification and classification sub-tasks. These scores are reported per each predicate PPOS tag. Additionally, we report precision and recall scores for non-ROOT semantic dependencies. These scores measure the performance of the predicate/argument identification and argument classification sub-tasks. These scores are reported for each combination of predicate PPOS tag and argument label. These statistics indicate how the corresponding system performs for a given argument label from a given corpus1). For example, the label “N + ACT” for the Czech data indicates the performance for ACT arguments of nouns (N). All these statistics are displayed if the scorer runs in verbose mode (i.e., without the -q option).

Global scores: the scorer computes two types of global scores, using macro and micro combination strategies. The macro strategy computes macro precision and recall scores by averaging the precision/recall for semantic dependencies with the attachment scores for syntactic dependencies 2). For example:

LMP = Wsem * LPsem + (1 - Wsem) * LAsyn
LMR = Wsem * LRsem + (1 - Wsem) * LAsyn

where LMP is the labeled macro precision, LPsem is the labeled precision for semantic dependencies, and LAsyn is the labeled attachment for syntactic dependencies. Similarly, LMR is the labeled macro recall and LRsem is the labeled recall for semantic dependencies. Wsem is the weight assigned to the semantic task. We assign equal weight to the two tasks, i.e., Wsem = 0.5. The macro-level F1 score is computed using the standard formula applied to the macro-level precision and recall scores.

The micro strategy puts all syntactic and semantic dependencies in the same bag, and then computes standard precision, recall, and F1 scores.

For the final scores, we will sort systems in the Joint task based on the labeled macro F1 scores. We prefer macro-level scores because in the macro scenario we can assign equal weight to the two tasks. In the micro setup, the syntactic task dominates the overall score because there are many more syntactic dependencies than semantic ones. For the SRL-only task, the systems will be sorted according to their semantic F1 score.


This software is based on the script used in the 2008 CoNLL shared task. In fact, the evaluation is almost identical to last year's. We gratefully thank the original authors: Yuval Krymolowski, Sabine Buchholz, Prokopis Prokopidis, Deniz Yuret, Mihai Surdeanu, James Henderson. Adaptation to 2009: Mihai Surdeanu, Massi Ciaramita, Jan Stepanek, Pavel Stranak.


1) We make the reasonable assumption here that the PPOS tags are generally correct.
2) Note that the attachment scores for syntactic dependencies are a special case of precision and recall, where the predicted number of dependencies is equal to the number of gold dependencies.