UMR Parsing Shared Task

To be held as part of the DMR 2026 workshop, collocated with LREC (Palma de Mallorca, Spain).

Interested? Register here.

Timeline

We tentatively work with the following schedule:

  • 1 February 2026: Training data available
    • Latin data updated on 4 February 2026 9:00 AM UTC.
  • 8 February 2026: Evaluation script
  • 16 February 2026: Blind test data available. Test phase starts.
  • 27 February 2026: Submission of system outputs. Test phase ends.
  • 1 March 2026: Announcement of the results.
  • 15 March 2026: System description papers due.
  • 22 March 2026: Reviews of system description papers due.
  • 30 March 2026: Camera-ready papers due.
  • 11 May 2026: DMR Workshop (collocated with LREC as W5), Palma de Mallorca, Spain
  • 12–16 May 2026: rest of LREC, Palma de Mallorca, Spain

Data

Training data is largely based on UMR 2.1 but it is not identical. There are six languages with training data: English, Czech, Latin, Chinese, Arapaho, Navajo. Test data may contain additional languages, leading to zero-shot scenarios. Note that we distinguish two types of data available for training: “clean” (which should be reasonably similar to gold-standard test data, but is typically very small, in case of Latin non-existent) and “dirty” (which is much larger, especially for Czech and English, but it is imperfect or incomplete in various aspects – use at your own risk!)

All training/development data is freely available, without need for registration or signing a contract. A temporary download URL is active during the shared task. After the shared task, the data will be published at a permanent location.

We have published a specification of the UMR file format. Participants will be expected to submit valid system outputs in the same format. Blind data provided as system input will be tokenized and segmented to sentences; to facilitate evaluation, system outputs must preserve tokenization and segmentation.

Evaluation

At the beginning of the test phase, participants will receive blind test data, tokenized and segmented into sentences. Each document will be in a separate text file, one sentence per line, tokens separated by a space. For each input file, the participating system must generate corresponding valid UMR file with exactly the same sentences and tokens. Each sentence must have all four annotation block (tokens, sentence level graph, alignment, document level graph); if the system cannot predict certain types of annotation (e.g. document-level relations), the corresponding block must still exist, even if empty. Such omissions will naturally be penalized by lower score. We specifically point out that token-node alignment should not be omitted, as it affects mapping of system and gold nodes, and, consequently, evaluation of all relations and attributes.

Submitted UMR files will be first checked by the validation script. A file that does not pass validation will not be processed by the scoring script and its score will be set to 0. Not all tests available in the validation script must be passed. It is sufficient if the validation is passed with the following options (replace myfile.umr with the path to the file being validated):

python validate.py --level 2 myfile.umr

If the file passes validation, it will be compared with the gold standard file and scored. As is usual in evaluation of graph-based semantic representations, the main score is F₁ of triples (node0 :relation node1), resp. (node :attribute value) or (node :concept concept). Node identifiers (variables) in the system-produced file do not have to match ids in the gold-standard file. The algorithm that maps system nodes to gold nodes is taylored to the specifics of Uniform Meaning Representation (in particular, the availability of node-token alignment). This contrasts with the smatch score that is often used to evaluate AMR. The evaluation script can be invoked as follows:

perl compare_umr.pl GOLD goldfile.umr SYS myfile.umr --quiet

By default, the script runs in verbose mode and prints a lot of diagnostic information comparing the two files. With the --quiet option, it prints only the final score.

Besides the main metric for ranking of the participating systems, we also plan on computing various additional metrics (such as a separate F1-score for concepts, or a score for sentence-level graphs, disregarding document-level relations).

The shared task is not divided into any tracks. System outputs will be submitted to the task as a whole, and every submission will be evaluated along the same set of metrics.

Participation

Individuals and teams considering participation should register via a simple Google form (https://forms.gle/pc2c7A27TxeHjRKZ7). There is no deadline for registration but the sooner the better, as we intend to send important information to registered participants by e-mail.

There are no restrictions on who can participate. (The two main organizers will not participate.)

The link to the submission form will be posted here before the test phase starts. Participants will submit system outputs (parsed data), not the systems themselves. Each submission will be automatically checked for validity, so the participants know whether their submission can be evaluated.

Contact

Questions? Contact the organizers: