FS file format description

Czech / Česky | Generic FS | PDT specific FS | Conversion to and from CSTS

Part Two - Attributes specific to PDT

The .fs files serve for encoding sentence structures in natural language. Each such file contains a sequence of trees whose nodes correspond to words of the sentence. Each node (word) is described by a set of attributes.

This file describes a standard that is not really part of the FS format. In fact the node attributes can be defined for every FS file independently. Despite of that, the PDT files usually share the same node attributes; these are described here. If you need to learn the general FS syntax, please refer to its own description.

Please note that the root of a tree is exception to the rule that each node represents one word/token from the sentence. The root does not correspond to any word but bears some information about the whole sentence. Some attributes thus have special interpretation in the root.

Not every attribute described below appears in every PDT FS file. Some may be defined in the header of the file but do not appear in the data. Some others may even not appear in headers of some files but may be defined in other files. Especially this holds for attributes bound to the tectogrammatical layer of annotation, and to attributes used for technical reasons.

There are several methods to leave the value of an attribute empty or undefined:

Empty string. Can be entered in two ways:

The attribute is present but the value is empty. In the following example the value of the third attribute (tag) is empty: [dělá,dělat,,1]. The attribute name cannot be present, so [form=dělá,lemma=dělat,tag=,ord=1] is incorrect!
The attribute is skipped by naming the next attribute explicitly: [dělá,dělat,ord=1].

A sole dash is also interpreted as empty string, if not explicitly stated otherwise. This was often used prevaluably in older versions of the treebank. Even a morphological tag can have the value of -, although it is not mentioned in the description of the tag system. Example: [dělat,dělá,-,1].
The attributes with listed acceptable values cannot have dash as their value if it is not member of the value list. They can however be set to empty string. There are often two values with a special meaning in their value list, both denoting different kinds of empty values:

The value ??? usually means that the value is unknown. It is even not known whether the attribute applies to the given word. This is the default value for attributes that have it in value list.
The value NA usually means that the value has been set (is known) but is undefined because the attribute is not relevant for the given word.

Morphology

`form`

@P form
@O form

Corresponds to the CSTS elements <f>, <d>, and <fadd>.

In most cases the value of this attribute is identical to the word form as it appeared in the original text, including the upper/lowercase distinction. It differs only when a normalization step has been performed:

A non-integer number with decimal comma is changed to use decimal point.
Forms of the words aby and kdyby are split to two nodes. One of them remains a form of the word aby or kdyby respectively, the other is a conditional form of the verb být in the corresponding form (e.g. by, bychom).
Joint form of a preposition and a pronoun (e.g. naň = na něj, oč = o co) is split to two nodes. One of them bears the preposition as its form, the other bears the pronoun.
A word with the ending morpheme -s abbreviating 2nd person singular of the verb být (to be). It is transformed to two nodes, one of which has the word without -s as its form, the other's form is jsi (you are).
A word with the ending morpheme -ť abbreviating the conjunction neboť (because). It is transformed to two nodes, one of which has the word without -ť as its form, the other's form is neboť. Joint forms of this type are rather archaic.
Misspelled words are transformed to the correct spelling. Joint forms are also split according to the above rules.

The root has form=#n where n is the number of the sentence in this file. Sometimes this value can be non-numeric (e.g. form=#22A) if a sentence has been split in two or more sentences.

`origf`

@V origf
@P origf

Corresponds to the CSTS element <w>.

The original word form as it appeared in the sentence, before normalization if any. If the word was misspelled, it remains misspelled in this attribute but is corrected in form.

The root has origf=#n where n is the number of the sentence in this file. Sometimes this value can be non-numeric (e.g. origf=#22A) if a sentence has been split in two or more sentences.

`lemma`

@P lemma
@O lemma

Corresponds to the CSTS element <l>.

The lemma uniquely identifies a word as a lexical unit. It is represented as a string of letters and other characters which in most cases corresponds to the base form of the word, also used as dictionary entry. The following forms are considered base forms:

Part of speech	Base form
noun	nominative singular (if singular does not exist, plural)
adjective	nominative masculine singular, affirmative, positive
pronoun	nominative masculine singular (if case, gender and number are relevant); e.g. there are only three personal pronouns: já (I), ty (you), on (he).
numeral	nominative masculine singular (if case, gender and number are relevant)
verb	infinitive
adverb	positive affirmative (if relevant)
preposition	without vocalization (e.g. v, not ve)
other	original word form

Ortographic variants are united if they are really based on ortography only and not on some sense shift as well.

A lemma is case sensitive so the proper names can be identified even if they are identical to general nouns. The case of the lemma does not however reflect the case of the word form in the text. Should the word be capitalized only because it appeared in the beginning of a sentence or a heading, its lemma is all lowercase.

A sense identification in the form of a dash and one or more decimal digits (e.g. -2) can be added to the lemma string. Such identification distinguishes lexical units that would be otherwise indistinguishable (e.g. stát-1 = state, country, stát-2 = to become, to happen, stát-3 = to stand, stát-4 = to cost). The sense distinction is shallow as it is motivated mostly by different morphologic or syntactic properties of the distinguished lemmas.

The string described up to this point can be enriched by comments. The comments are connected to the lemma by the underscore character. Parenthesized comment preceded by circumflex contains a short description of the meaning (in Czech). It often acompanies lemmas with distinguished senses (e.g. stát-1_^(státní_útvar)). A comment preceded by a semicolon encodes some lexical and stylistical categories, e.g. G in Grónsko_;G means that Grónsko (Greenland) is a geographical name.

The lemma of the root is #.

`lemmaMM_source`



@P lemmaMM_source

Corresponds to the CSTS elements <MMl src="source">.

This set of attributes is automatically created during conversion
from CSTS to FS.

lemmaMM_source


@P lemmaMD_source

Corresponds to the CSTS elements <MDl src="source">.

This set of attributes is automatically created during conversion
from CSTS to FS.

tag

@P tag
@O tag

Corresponds to the CSTS element <t>.

The attribute tag contains the part of speech and
morphological tag. The Czech tag system uses roughly 3000
theoretically possible tags; one to two thousands of them really
appear in the PDT. There are two possible views of each tag:
compact and positional. There is a one-to-one mapping
between both systems so it is up to the user which one they
prefer. A tag is positional if and only if it is a string of 15
characters (English letters, digits, dashes and other special
characters (such as dots, exclamation marks...)). A compact tag has
variable length but is always shorter than 15. It contains only
uppercase English letters, digits, and sometimes a dash. The compact
system is older. The tags may be more legible for an experienced user;
they encode only properties relevant for the given part of
speech. Nevertheless it is difficult to parse them automatically
because there is a lot of rules saying "if up to this point we read
blablabla, the next character encodes the gender, otherwise it's the
tense...". On the other hand, in a positional tag, the index
(position) of a character already says which morphologic property it
encodes. The price for that is that the tags are long and contain long
sequences of dashes for categories not relevant for the given
word.

See the description of the compact tag system (available in:
pdffile,
psfile)
and the description of the positional tag system (detailed description available in:
pdffile,
psfile;
quick reference available in:
htmlfile,
pdffile
).

As for any FS attribute, there can be a set of values (tags)
separated by the vertical bar character (|). If the
lemma of this node contains several lemma alternatives,
the tag set must use special tags -- to separate the tag
set for lemma i from the tag set for lemma i+1.

The root has the tag ZSB.

wt

@P wt

Corresponds to the attribute w of the
CSTS element <t>.

tagMM_source


@P tagMM_source

Corresponds to the CSTS element <MMt src="source">.

This set of attributes is automatically created during conversion
from CSTS to FS.

tagMD_source


@P tagMD_source

Corresponds to the CSTS element <MDt src="source">.

This set of attributes is automatically created during conversion
from CSTS to FS.

wMDl_source, wMDt_source

@P wMDl_source
@P wMDt_source

Correspond to the attribute w of
CSTS element <MDl src="source"> and
<MDt src="source">.

This set of attributes is automatically created during conversion
from CSTS to FS.

Surface syntax

A comprehensive view of this part of annotation can be found in the
manual for the analytical layer
annotators (in Czech).

afun

@P afun
@O afun
@L2 afun|---|Pred|Pnom|AuxV|Sb|Obj|Atr|Adv|AtrAdv|AdvAtr\
|Coord|AtrObj|ObjAtr|AtrAtr|AuxT|AuxR|AuxP|Apos|ExD|AuxC|Atv|AtvV\
|AuxO|AuxZ|AuxY|AuxG|AuxK|AuxX|AuxS|Pred_Co|Pnom_Co|AuxV_Co|Sb_Co\
|Obj_Co|Atr_Co|Adv_Co|AtrAdv_Co|AdvAtr_Co|Coord_Co|AtrObj_Co\
|ObjAtr_Co|AtrAtr_Co|AuxT_Co|AuxR_Co|AuxP_Co|Apos_Co|ExD_Co|AuxC_Co\
|Atv_Co|AtvV_Co|AuxO_Co|AuxZ_Co|AuxY_Co|Pred_Ap|Pnom_Ap|AuxV_Ap|Sb_Ap\
|Obj_Ap|Atr_Ap|Adv_Ap|AtrAdv_Ap|AdvAtr_Ap|Coord_Ap|AtrObj_Ap\
|ObjAtr_Ap|AtrAtr_Ap|AuxT_Ap|AuxR_Ap|AuxP_Ap|Apos_Ap|ExD_Ap|AuxC_Ap\
|Atv_Ap|AtvV_Ap|AuxO_Ap|AuxZ_Ap|AuxY_Ap|Pred_Pa|Pnom_Pa|AuxV_Pa|Sb_Pa\
|Obj_Pa|Atr_Pa|Adv_Pa|AtrAdv_Pa|AdvAtr_Pa|Coord_Pa|AtrObj_Pa\
|ObjAtr_Pa|AtrAtr_Pa|AuxT_Pa|AuxR_Pa|AuxP_Pa|Apos_Pa|ExD_Pa|AuxC_Pa\
|Atv_Pa|AtvV_Pa|AuxO_Pa|AuxZ_Pa|AuxY_Pa|???

Corresponds to the CSTS elements <A>.

Analytical function (surface-syntactic tag). Denotes the type of
dependency between governing and dependent nodes. Besides typical
syntactical categories like subject, predicate, object, attribute or
adverbial, contains also many auxiliary relations and distinguishes
coordinations and appositive modifiers from real dependencies.

See the description of the analytical function system
(available in: pdffile, psfile).

The root has the afun AuxS.

afunMD_source


@P afunMD_source

Corresponds to the CSTS element <MDA src="source">.

This set of attributes is automatically created during conversion
from CSTS to FS.


ord

@N ord

Corresponds to the CSTS element <r>.

Index of the word in the sentence (original word order). The root
has the index of 0.

govMD_source


@P govMD_source

Corresponds to the CSTS element <MDg src="source">.

This set of attributes is automatically created during conversion
from CSTS to FS.


Deep syntax

Bunch of attributes has been added to the FS files on the
tectogrammatical layer (see the header
below). For their description, please refer to this postscript file or directly to the manual for the tectogrammatical
annotators.



Technical stuff

ID1

@P ID1

Corresponds to the attribute id of the CSTS element
<s>.

This attribute is non-empty only
for root nodes. Its value is then the sentence identification within
the Czech National Corpus.



ID2

@P ID2

Corresponds to a part of the attribute id of the CSTS
element <s>.

This attribute appears only in older files and is non-empty only
for root nodes. Its value is then the name of the file the tree
appears in.

nospace

@P nospace

Corresponds to the CSTS element <D>.

If the value of this attribute is 1, no space followed the
original form in the original data.

origfkind

@P origfkind

Corresponds to the attribute kind of the CSTS element <w>.

formtype

@P formtype

Corresponds to the attribute case of the CSTS element
<f> or to the attribute type of the
CSTS element <d>.

cstslang

@P cstslang

Corresponds to the attribute lang of the CSTS
top-level element <csts>.

cstssource

@P cstssource

Corresponds to the CSTS element <source>.

cstsmarkup

@P cstsmarkup

Corresponds to the CSTS element <markup> in
case it is a subelement of CSTS element <h>.

This attribute is non-empty only for the root node of the first
tree in a file. It stores the original SGML form of all subelements of CSTS element
<markup>, as stored in the CSTS header <h>.


chap

@P chap

Corresponds to the CSTS element <c>.

This attribute is non-empty only
for root nodes. If its value is 1 then the senence represented in the
tree is the first sentence of a chapter or section.

doc

@P doc

Corresponds to the attribute file of the CSTS element <doc>.

This attribute is non-empty only for root nodes. If the value of
this attribute is non-empty then the senence represented in the tree
is the first sentence of the document and the value is the original
file name of the document.

docid

@P docid

Corresponds to the attribute id of the CSTS element <doc>.

docmarkup

@P docmarkup

Corresponds to the CSTS element <markup> in
case the element appears in the document header <a>.

This attribute is non-empty only for root nodes and only if the
doc attribute is also non-empty. It stores the original
SGML form of all subelements of CSTS element
<markup> of the document header <a>.


docprolog

@P docprolog

Corresponds to the CSTS element <a>.

This attribute is non-empty only for root nodes and only if the doc
attribute is also non-empty. It stores the original SGML form of all
subelements of CSTS element <a> except <markup>.

gappre, gappost

@P gappre
@P gappost

These attributes correspond to the CSTS elements <i>,
<idioms>,
<idiom> and <iref>.

This attributes store the original SGML form of the above named
CSTS elements appearing just before (in case of gappre) or just
after (in case of gappost) all other elements which form the node.



A typical header of the Prague Dependency
Treebank's FS files

@P lemma
@O lemma
@P tag
@O tag
@P form
@O form
@P afun
@O afun
@L1 afun|---|Pred|Pnom|AuxV|Sb|Obj|Atr|Adv|AtrAdv|AdvAtr|Coord|AtrObj|ObjAtr|AtrAtr|AuxT|AuxR|AuxP|Apos|ExD|AuxC|Atv|AtvV|AuxO|AuxZ|AuxY|AuxG|AuxK|AuxX|AuxS|Pred_Co|Pnom_Co|AuxV_Co|Sb_Co|Obj_Co|Atr_Co|Adv_Co|AtrAdv_Co|AdvAtr_Co|Coord_Co|AtrObj_Co|ObjAtr_Co|AtrAtr_Co|AuxT_Co|AuxR_Co|AuxP_Co|Apos_Co|ExD_Co|AuxC_Co|Atv_Co|AtvV_Co|AuxO_Co|AuxZ_Co|AuxY_Co|AuxG_Co|AuxK_Co|AuxX_Co|Pred_Ap|Pnom_Ap|AuxV_Ap|Sb_Ap|Obj_Ap|Atr_Ap|Adv_Ap|AtrAdv_Ap|AdvAtr_Ap|Coord_Ap|AtrObj_Ap|ObjAtr_Ap|AtrAtr_Ap|AuxT_Ap|AuxR_Ap|AuxP_Ap|Apos_Ap|ExD_Ap|AuxC_Ap|Atv_Ap|AtvV_Ap|AuxO_Ap|AuxZ_Ap|AuxY_Ap|AuxG_Ap|AuxK_Ap|AuxX_Ap|Pred_Pa|Pnom_Pa|AuxV_Pa|Sb_Pa|Obj_Pa|Atr_Pa|Adv_Pa|AtrAdv_Pa|AdvAtr_Pa|Coord_Pa|AtrObj_Pa|ObjAtr_Pa|AtrAtr_Pa|AuxT_Pa|AuxR_Pa|AuxP_Pa|Apos_Pa|ExD_Pa|AuxC_Pa|Atv_Pa|AtvV_Pa|AuxO_Pa|AuxZ_Pa|AuxY_Pa|AuxG_Pa|AuxK_Pa|AuxX_Pa|Generated|NA|???
@P ID1
@P ID2
@VA origf
@P origf
@P afunprev
@P semPOS
@P tagauto
@P lemauto
@N ord
@P dord
@W sentord
@P govTR
@P nospace
@P root
@P ending
@P punct
@P alltags
@P wt
@P origfkind
@P formtype
@P gappost
@P gappre
@P cstslang
@P cstssource
@P cstsmarkup
@P chap
@P doc
@P docid
@P docmarkup
@P docprolog
@P1 warning
@P3 err1
@P3 err2
@P reserve1
@P reserve2
@P reserve3
@P reserve4
@P reserve5
@P wMDt_a
@P wMDl_a
@P wMDt_b
@P wMDl_b
@P tagMD_a
@P lemmaMD_a
@P tagMD_b
@P lemmaMD_b



Daniel
Zeman

FS file format description

Part Two - Attributes specific to PDT

Morphology

`form`

`origf`

`lemma`

Part of speech

Base form

`lemmaMM_source`

`lemmaMM_source`

`tag`

`wt`

`tagMM_source`

`tagMD_source`

`wMDl_source, wMDt_source`

Surface syntax

`afun`

`afunMD_source`

`ord`

`govMD_source`

Deep syntax

Technical stuff

`ID1`

`ID2`

`nospace`

`origfkind`

`formtype`

`cstslang`

`cstssource`

`cstsmarkup`

`chap`

`doc`

`docid`

`docmarkup`

`docprolog`

`gappre`, `gappost`

A typical header of the Prague Dependency Treebank's FS files