FS file format description

Czech / Česky | Generic FS | PDT specific FS | Conversion to and from CSTS

Part One - Syntax common to all FS trees (not necessarily PDT)

The .fs files serve for encoding sentence structures in natural language. Each such file contains a sequence of trees whose nodes correspond to words of the sentence. Each node (word) is described by a set of attributes.

The names and data types of particular attributes are not part of FS format. Rather, each FS file has a header which defines attributes for its tree nodes locally. To be able to understand the Prague Dependency Treebank FS files, you need to read this file (general FS syntax) as well as the definition of FS attributes used in PDT.

Notes on metasyntax

The nonterminal symbols are surrounded by < > characters, terminal symbols or strings of terminal symbols are enclosed in double quotes. A c-like notation is used inside of quotes, thus "\t" means the character with the code 9, i.e. HTAB. The character "\n" represents the end of line regardless the platform, i.e. it matches not only real "\n" in its C sense, but also "\r\n" (DOS-Windows EOL), or even "\r".

Any end of line escaped by a backslash (\\\n) has a special meaning. It is generated only for the sake of human legibility of the file. When processing the file, such escaped end of line is discarded immediately and its surroundings is parsed as if it were not present. It can appear almost everywhere so in the syntax description it is not mentioned anywhere. It can even appear within an identifier but unlike the other backslash-escaped function characters it does not become a part of the identifier.

The unary postfix operators "*", "+" and "?" mean that the operand appears n-times in a row, where n>=0 for *, n>0 for +, and n is 0 or 1 for ?.

In contexts where a nonterminal can be interpreted as a set, the binary operator "-" can be used. It denotes a difference of two sets.

File structure

The file contains a header with node attribute definitions, and a sequence of trees.



<fs-file> ::=
<definition-line>+ "\n"+ (<tree> "\n")+
<editor-configuration>?

<editor-configuration> ::=
"(" <number> ("," <number>)* ")"

Note: The numbers in the editor configuration are indexes of attributes that ought to be displayed by default. (The editor allows to turn on displaying the rest.) The attribute indices must be ordered ascending, otherwise the program crashes. It is thus impossible to enforce a different ordering of attributes when displaying the tree.

Identifier, attribute name and value

An identifier is one of the main elements of the FS file syntax. It is a string of arbitrary characters starting by the first character and ending before the first function character (it self is not a part of the identifier). Even function characters can be parts of identifiers when they are escaped by a backslash (the backslash used for escaping a special character is not a part of the identifier).

Note: The length of identifiers is limited, the limit depends on the usage. For an attribute name it is limited to 20 characters, for an attribute value it is limited to 120 characters.



<attribute-name> ::=
<identifier>

<attribute-value> ::=
<identifier>

<identifier> ::=
<identifier-character>+

<identifier-character> ::=
<normal-character> | <escaped-character>

<function-character> ::=
"\\" | "=" | "," | "[" | "]" | "|"

<normal-character> ::=
<any-character>-<function-character>-"\n"

<escaped-character> ::=
"\\" (<any-character>-"\n")

Node attribute definition

The beginning of each file contains a header with definitions of the attributes which can appear in tree nodes. Each header line begins with the @ character. Follows a capital letter denoting properties of the attribute, then a space and the attribute name. For example "@P lemma".

Note: In the list of allowed values in the @L definition (<values>), the values cannot be repeated.



<definition-line> ::=
("@" <property> <view>? " " <attribute-name>
"\n") |
("@L" <view>? " " <attribute-name> "|" <values>
"\n")

<property> ::=
"K" | "P" | "O" | "N" | "V" | "W" | "H"

<view> ::=
"1" | "2" | "3"

<values> ::=
<attribute-value> ("|" <values>)?

Properties

K: Key attribute. The word "key" does not really mean anything except "this has no specific properties".
P: Positional attribute. All other attributes require that their name is written before their value in the data (ord=7, e.g.). Positional attributes don't. The name of a positional attribute is figured out after the relative position of its value with respect to the previous values (see details below in the paragraph "Node").
O: Obligatory attribute. Its value must be non-empty for every node (the empty string is the default value for all attributes). Thus the value must appear in the data.
L: One of predefined values. Such attribute can only have a value from a predefined list, or be empty.
H: Hiding attribute. Nodes having the string "hide" in this attribute are hidden in the tree viewers when hiding is turned on. Their subtrees are hidden as well.
N: Numeric attribute (the value is a non-negative integer), specifying the ordering of the nodes. Its value affects the x-coordinate of node positions in tree viewers. For backward compatibility, it also specifies the position of the word in the sentence on status line in case no @W attribute is provided. If the @N attribute is not present, the tree is centered regardless there is or is not a @W attribute. Maximally one such attribute per FS file can be defined.
W: Another numeric attribute denoting the word order. If both @N and @W attributes are defined, the former specifies the ordering of nodes in tree view while the latter specifies the ordering of words in the linear view on status line. It enables that a non-projective tree is reordered by the user to a projective order but the sentence remains displayed in the original order on the status line.
V: Value attribute. In some respect its value represents the whole node. In tree viewers this is used for the linear view of the sentence on the status line. Maximally one such attribute per FS file can be defined. It can be either of subtype @VH (default) or @VA. The former is default (i.e. @V is the same as @VH) and means that the values of hidden nodes (see the attribute @H) will not be displayed even on the status line. The latter means that even hidden nodes shall be shown on status line.

More than one property can be defined for one attribute. The definition lines with all the properties need not follow each other in the file header. They must however fulfill the following constraints:

Only one @V attribute per file can be defined.
Only one @W attribute per file can be defined.
Only one @N attribute per file can be defined.
The @N property cannot be combined with other properties. Nevertheless the @N attribute has automatically the properties @P and @O as well.
An attribute cannot be both @V and @L.
@L must be the last property defined for an attribute but it cannot be the only property of that attribute.

View

The view mode can be defined optionally. It can be required that the value of the attribute be always highlighted in the tree editor.



1
ATTR_SHADOW

2
ATTR_HILITE

3
ATTR_XHILITE

Example

The definition of node attributes in the Prague Dependency Treebank can serve as an example.

Tree

The trees are described in the usual parentheses notation, i.e. after the description of an inner node the parenthesized comma-separated list of its children (or their subtrees) follows. The children of each node must be ordered according to the values of their numeric attribute @N, if any. Breaking this rule can cause the tree editor to display the tree incorrectly (the projectivity is involved; it is assumed that the numeric attribute contains the index of the word according to the sentence word order).



<tree> ::=
<node> ("(" <children> ")")?

<children> ::=
<tree> ("," <children>)?

Node

Besides pure syntax it is also necessary to check the relations between the element <attributes> and the definitions of the respective attributes in the header of the file. The constraints following from these relations are described below.



<node> ::=
<attribute-set> ("|" <node>)?

<attribute-set> ::=
"[" <attributes>? "]"

<attributes> ::=
<attribute> ("," <attributes>)?

<attribute> ::=
(<attribute-name> "=")? <values>

<values> ::=
<attribute-value> ("|" <values>)

The element <attributes> must fulfill the following constraints (based on the particular definition of attributes in the file header):

The attribute name is required for non-positional attributes.
If the attribute name is not present it is necessary to figure out the attribute of which a value is being read. It is the first positional attribute whose definition in header follows the definition of the last read attribute (positional or not).
The identifier in the <attribute-name> element must equal to a name of an attribute defined in the header.
No attribute can be read more than once. This rule might be broken when the <attribute> element with the same <attribute-name> appears twice or if the attribute name is not mentioned but the last read attribute's definition immediately precedes the definition of an attribute whose value has already been read.
The identifier representing a value of a numeric attribute can contain only digits.
The value of a @L attribute must be one of the predefined values from the definition of the attribute.
The number of attributes in different sets can vary. At the end of the set, however, the program must check that the values of all obligatory attributes have been read.

Example

Here is an example of a whole FS file with some trees.

Michal Křen is the author of the FS file format.

Daniel Zeman wrote this description after Michal's source code in March 1998. It was translated into English in November 2000.