UMR File Format

This page provides low-level specification of encoding UMR annotation in a file. It does not detail UMR annotation guidelines; those are documented here.

UMR (Uniform Meaning Representation) is stored in a text file with Unix-style line breaks (LF, not CR LF), encoded in UTF-8, with Unicode normalized to NFC. It is structured into sentences, each sentence starting with a line containing exactly 80 hashtags ('#') and ending with two empty lines.

A sentence annotation consists of four annotation blocks in fixed order:

tokens
sentence-level graph
alignment
document-level graph

Each block ends with one empty line, except the last block, which ends with two empty lines (end of sentence). The first block starts with the 80 hashtags (beginning of sentence), while each subsequent block starts with a prescribed comment line:

# sentence level graph:
# alignment:
# document level annotation:

This block-initial line can be optionally followed by other comment lines (starting by hashtag, followed by anything).

If a certain type of annotation (document-level graph) is not available, the block may be empty but it must still be present, that is, there must be at least the initial comment line followed by the terminating empty line(s).

The Token Block

This is the header of the sentence. It may contain various kinds of information but the most important (and mandatory) part is the 'Words:' line, which lists the tokens (words + punctuation) in the sentence.

################################################################################
# meta-info :: sent_id = u_tree-cs-s1-root
# :: snt1
Index: 1   2    3 4     5      6       7  8           9         10
Words: 200 dead , 1,500 feared missing in Philippines landslide .

The comment line # :: snt1 gives the index of the sentence within the current document (file). In contrast, the sentence id in meta-info is optional and may refer to the original data source. The 'Index:' line is meant as a help to humans to quickly obtain token indices when reading or editing the alignment block.

Sentence Level Graph

# sentence level graph:
(s1p / publication-91
    :ARG1 (s1l / landslide-01
        :ARG3 (s1a / and
            :op1 (s1d / die-01
                :ARG1 (s1p3 / person :quant 200)
                :aspect state)
            :op2 (s1f / fear-01
                :ARG1 (s1m / miss-01
                    :ARG1 (s1p2 / person :quant 1500)
                    :aspect state)
                :aspect state)
        :aspect process)
    :place (s1c / country :wiki "Philippines"
        :name (s1n / name :op1 "Philippines"))))

The sentence level graph is a hierarchical structure with branching represented using brackets. Indentation is a visual enhancement for human readers but it bears no significance for interpretation of the structure. The structure resembles a tree and in many cases it actually is a tree, but this is not a requirement. There may be reentrancies (a child node has multiple parent nodes) and occasionally even cycles.

A description of a node starts with an opening bracket, followed by node id (also called “variable”, e.g., s1p), a slash ('/') and a semantic concept to which the node corresponds (e.g., publication-91).

The node may contain a number of relations (also called “roles”) leading to child nodes. The relations are denoted by keywords starting with a colon (e.g., :ARG1). Some of these keywords denote node “attributes” rather than relations. Attributes are followed by a value rather than by a child node. The value can be a string (:wiki "Philippines"), an atomic keyword (:aspect state) or a number (:quant 1500). Note that the difference between relations and attributes is blurry and some keywords (e.g., :quant) can be used as both, depending on context.

Reentrancies: When there is a relation from the current node to a node that has been already defined, only its node id (variable) is given after the relation. For example, in the graph of The boy wants to go, boy (s1b) is an :ARG0 of both the wanting event and the going event:

(s1w / want-01
    :ARG0 (s1b / boy)
    :ARG1 (s1g / go-01
        :ARG0 s1b))

For most relations, an inverted relation can be created using the '-of' suffix. For example, (s1a / a :ARG0-of (s1b / b)) has the same meaning as (s1b / b :ARG0 (s1a / a)). However, the two representations differ in information packaging (distinguishing main predication from modification). Inverted relations also help reduce the number of cycles in UMR graphs.

Node IDs (Variables)

By convention, node ids (variables) start with the letter “s”, followed by the index of the current sentence, then followed by a lowercase letter (if possible, identical to the first letter of the node's concept; most files use only English letters here, but accented letters from other alphabets may be encountered, too) and an optional number to make the id unique. The validation script may reject node ids that do not follow this convention.

Concepts

The concept strings consist of lowercase letters, hyphens and digits; the first character is a letter. Note that the letters in concepts are not restricted to the English alphabet!

Relations and Attributes

Relations (attributes, roles) start with a colon (“:”), followed by English letters, hyphens and digits.

The Alignment Block

This block specifies alignment between nodes of the sentence level graph and the tokens in the original sentence.

# alignment:
s1p: 0-0
s1l: 9-9
s1a: 0-0
s1d: 2-2
s1p3: 1-1
s1f: 5-5
s1m: 6-6
s1p2: 4-4
s1c: 8-8
s1n: 0-0

The ranges refer to token indices in the token block. Some nodes may remain unaligned but they should still appear here, listing their alignment as 0-0. On the other hand, it is not required that every token of the sentence is explicitly linked to a node.

A node may be aligned to multiple tokens. They even may not be consecutive, in which case multiple ranges are given, separated by commas:

# alignment:
s2e1: 6-6,8-8

While it is not forbidden that a token appears in alignments of multiple nodes, gold standard data usually avoid such alignments. This is most prominently visible in named entity annotation—in our example, we have

:place (s1c / country :wiki "Philippines"
        :name (s1n / name :op1 "Philippines"))

and one could claim that both nodes (s1c and s1n) correspond to token 8 (“Philippines”). However, only s1c is aligned to that token.

Document Level Graph

# document level annotation:
(s1s0 / sentence
    :temporal  ((document-creation-time :before s1l)
            (s1l :overlap s1d)
            (document-creation-time :overlap s1f)
            (s1l :overlap s1m))
    :modal ((root :modal author)
            (author :full-affirmative s1l)
            (author :full-affirmative s1d)
            (author :full-affirmative s1f)
            (author :partial-affirmative s1m)))

The document level graph can be seen as a graph spanning the entire document (file), connecting nodes of individual sentence graphs and a few special nodes with additional relations. Document level annotation of a sentence contributes a new batch of relations to the document graph. These relations involve nodes of the current sentence, nodes of previous sentences, and special nodes specified by predefined keywords (e.g., root, author, document-creation-time).

Document level annotation of a sentence is enclosed in brackets like a sentence level node, with a unique node id (variable) and the concept “sentence”. It has up to three “relations” (:temporal, :modal, :coref, they can occur in any order), but in fact each of them is just a label for a group of document level relations of the given type. The label is followed by the set of relations enclosed in brackets. Each of these relations is also enclosed in its own pair of brackets and it is a triple (node1 :relation-type node2).

UMR Parsing

Search form