Up

FS file format description


Czech / Česky | Generic FS | PDT specific FS | Conversion to and from CSTS

Part Two - Attributes specific to PDT

The .fs files serve for encoding sentence structures in natural language. Each such file contains a sequence of trees whose nodes correspond to words of the sentence. Each node (word) is described by a set of attributes.

This file describes a standard that is not really part of the FS format. In fact the node attributes can be defined for every FS file independently. Despite of that, the PDT files usually share the same node attributes; these are described here. If you need to learn the general FS syntax, please refer to its own description.

Please note that the root of a tree is exception to the rule that each node represents one word/token from the sentence. The root does not correspond to any word but bears some information about the whole sentence. Some attributes thus have special interpretation in the root.

Not every attribute described below appears in every PDT FS file. Some may be defined in the header of the file but do not appear in the data. Some others may even not appear in headers of some files but may be defined in other files. Especially this holds for attributes bound to the tectogrammatical layer of annotation, and to attributes used for technical reasons.

There are several methods to leave the value of an attribute empty or undefined:

  1. Empty string. Can be entered in two ways:
  2. A sole dash is also interpreted as empty string, if not explicitly stated otherwise. This was often used prevaluably in older versions of the treebank. Even a morphological tag can have the value of -, although it is not mentioned in the description of the tag system. Example: [dělat,dělá,-,1].
  3. The attributes with listed acceptable values cannot have dash as their value if it is not member of the value list. They can however be set to empty string. There are often two values with a special meaning in their value list, both denoting different kinds of empty values:

Morphology

form

@P form
@O form

Corresponds to the CSTS elements <f>, <d>, and <fadd>.

In most cases the value of this attribute is identical to the word form as it appeared in the original text, including the upper/lowercase distinction. It differs only when a normalization step has been performed:

The root has form=#n where n is the number of the sentence in this file. Sometimes this value can be non-numeric (e.g. form=#22A) if a sentence has been split in two or more sentences.

origf

@V origf
@P origf

Corresponds to the CSTS element <w>.

The original word form as it appeared in the sentence, before normalization if any. If the word was misspelled, it remains misspelled in this attribute but is corrected in form.

The root has origf=#n where n is the number of the sentence in this file. Sometimes this value can be non-numeric (e.g. origf=#22A) if a sentence has been split in two or more sentences.

lemma

@P lemma
@O lemma

Corresponds to the CSTS element <l>.

The lemma uniquely identifies a word as a lexical unit. It is represented as a string of letters and other characters which in most cases corresponds to the base form of the word, also used as dictionary entry. The following forms are considered base forms:

Part of speech

Base form

nounnominative singular (if singular does not exist, plural)
adjectivenominative masculine singular, affirmative, positive
pronounnominative masculine singular (if case, gender and number are relevant); e.g. there are only three personal pronouns: (I), ty (you), on (he).
numeralnominative masculine singular (if case, gender and number are relevant)
verbinfinitive
adverbpositive affirmative (if relevant)
prepositionwithout vocalization (e.g. v, not ve)
otheroriginal word form

Ortographic variants are united if they are really based on ortography only and not on some sense shift as well.

A lemma is case sensitive so the proper names can be identified even if they are identical to general nouns. The case of the lemma does not however reflect the case of the word form in the text. Should the word be capitalized only because it appeared in the beginning of a sentence or a heading, its lemma is all lowercase.

A sense identification in the form of a dash and one or more decimal digits (e.g. -2) can be added to the lemma string. Such identification distinguishes lexical units that would be otherwise indistinguishable (e.g. stát-1 = state, country, stát-2 = to become, to happen, stát-3 = to stand, stát-4 = to cost). The sense distinction is shallow as it is motivated mostly by different morphologic or syntactic properties of the distinguished lemmas.

The string described up to this point can be enriched by comments. The comments are connected to the lemma by the underscore character. Parenthesized comment preceded by circumflex contains a short description of the meaning (in Czech). It often acompanies lemmas with distinguished senses (e.g. stát-1_^(státní_útvar)). A comment preceded by a semicolon encodes some lexical and stylistical categories, e.g. G in Grónsko_;G means that Grónsko (Greenland) is a geographical name.

The lemma of the root is #.

lemmaMM_source

@P lemmaMM_source

Corresponds to the CSTS elements <MMl src="source">.

This set of attributes is automatically created during conversion from CSTS to FS.

lemmaMM_source

@P lemmaMD_source

Corresponds to the CSTS elements <MDl src="source">.

This set of attributes is automatically created during conversion from CSTS to FS.

tag

@P tag
@O tag

Corresponds to the CSTS element <t>.

The attribute tag contains the part of speech and morphological tag. The Czech tag system uses roughly 3000 theoretically possible tags; one to two thousands of them really appear in the PDT. There are two possible views of each tag: compact and positional. There is a one-to-one mapping between both systems so it is up to the user which one they prefer. A tag is positional if and only if it is a string of 15 characters (English letters, digits, dashes and other special characters (such as dots, exclamation marks...)). A compact tag has variable length but is always shorter than 15. It contains only uppercase English letters, digits, and sometimes a dash. The compact system is older. The tags may be more legible for an experienced user; they encode only properties relevant for the given part of speech. Nevertheless it is difficult to parse them automatically because there is a lot of rules saying "if up to this point we read blablabla, the next character encodes the gender, otherwise it's the tense...". On the other hand, in a positional tag, the index (position) of a character already says which morphologic property it encodes. The price for that is that the tags are long and contain long sequences of dashes for categories not relevant for the given word.

See the description of the compact tag system (available in: pdffile, psfile) and the description of the positional tag system (detailed description available in: pdffile, psfile; quick reference available in: htmlfile, pdffile ).

As for any FS attribute, there can be a set of values (tags) separated by the vertical bar character (|). If the lemma of this node contains several lemma alternatives, the tag set must use special tags -- to separate the tag set for lemma i from the tag set for lemma i+1.

The root has the tag ZSB.

wt

@P wt

Corresponds to the attribute w of the CSTS element <t>.

tagMM_source

@P tagMM_source

Corresponds to the CSTS element <MMt src="source">.

This set of attributes is automatically created during conversion from CSTS to FS.

tagMD_source

@P tagMD_source

Corresponds to the CSTS element <MDt src="source">.

This set of attributes is automatically created during conversion from CSTS to FS.

wMDl_source, wMDt_source

@P wMDl_source
@P wMDt_source

Correspond to the attribute w of CSTS element <MDl src="source"> and <MDt src="source">.

This set of attributes is automatically created during conversion from CSTS to FS.

Surface syntax

A comprehensive view of this part of annotation can be found in the manual for the analytical layer annotators (in Czech).

afun

@P afun
@O afun
@L2 afun|---|Pred|Pnom|AuxV|Sb|Obj|Atr|Adv|AtrAdv|AdvAtr\
|Coord|AtrObj|ObjAtr|AtrAtr|AuxT|AuxR|AuxP|Apos|ExD|AuxC|Atv|AtvV\
|AuxO|AuxZ|AuxY|AuxG|AuxK|AuxX|AuxS|Pred_Co|Pnom_Co|AuxV_Co|Sb_Co\
|Obj_Co|Atr_Co|Adv_Co|AtrAdv_Co|AdvAtr_Co|Coord_Co|AtrObj_Co\
|ObjAtr_Co|AtrAtr_Co|AuxT_Co|AuxR_Co|AuxP_Co|Apos_Co|ExD_Co|AuxC_Co\
|Atv_Co|AtvV_Co|AuxO_Co|AuxZ_Co|AuxY_Co|Pred_Ap|Pnom_Ap|AuxV_Ap|Sb_Ap\
|Obj_Ap|Atr_Ap|Adv_Ap|AtrAdv_Ap|AdvAtr_Ap|Coord_Ap|AtrObj_Ap\
|ObjAtr_Ap|AtrAtr_Ap|AuxT_Ap|AuxR_Ap|AuxP_Ap|Apos_Ap|ExD_Ap|AuxC_Ap\
|Atv_Ap|AtvV_Ap|AuxO_Ap|AuxZ_Ap|AuxY_Ap|Pred_Pa|Pnom_Pa|AuxV_Pa|Sb_Pa\
|Obj_Pa|Atr_Pa|Adv_Pa|AtrAdv_Pa|AdvAtr_Pa|Coord_Pa|AtrObj_Pa\
|ObjAtr_Pa|AtrAtr_Pa|AuxT_Pa|AuxR_Pa|AuxP_Pa|Apos_Pa|ExD_Pa|AuxC_Pa\
|Atv_Pa|AtvV_Pa|AuxO_Pa|AuxZ_Pa|AuxY_Pa|???

Corresponds to the CSTS elements <A>.

Analytical function (surface-syntactic tag). Denotes the type of dependency between governing and dependent nodes. Besides typical syntactical categories like subject, predicate, object, attribute or adverbial, contains also many auxiliary relations and distinguishes coordinations and appositive modifiers from real dependencies.

See the description of the analytical function system (available in: pdffile, psfile).

The root has the afun AuxS.

afunMD_source

@P afunMD_source

Corresponds to the CSTS element <MDA src="source">.

This set of attributes is automatically created during conversion from CSTS to FS.

ord

@N ord

Corresponds to the CSTS element <r>.

Index of the word in the sentence (original word order). The root has the index of 0.

govMD_source

@P govMD_source

Corresponds to the CSTS element <MDg src="source">.

This set of attributes is automatically created during conversion from CSTS to FS.

Deep syntax

Bunch of attributes has been added to the FS files on the tectogrammatical layer (see the header below). For their description, please refer to this postscript file or directly to the manual for the tectogrammatical annotators.

Technical stuff

ID1

@P ID1

Corresponds to the attribute id of the CSTS element <s>.

This attribute is non-empty only for root nodes. Its value is then the sentence identification within the Czech National Corpus.

ID2

@P ID2

Corresponds to a part of the attribute id of the CSTS element <s>.

This attribute appears only in older files and is non-empty only for root nodes. Its value is then the name of the file the tree appears in.

nospace

@P nospace

Corresponds to the CSTS element <D>.

If the value of this attribute is 1, no space followed the original form in the original data.

origfkind

@P origfkind

Corresponds to the attribute kind of the CSTS element <w>.

formtype

@P formtype

Corresponds to the attribute case of the CSTS element <f> or to the attribute type of the CSTS element <d>.

cstslang

@P cstslang

Corresponds to the attribute lang of the CSTS top-level element <csts>.

cstssource

@P cstssource

Corresponds to the CSTS element <source>.

cstsmarkup

@P cstsmarkup

Corresponds to the CSTS element <markup> in case it is a subelement of CSTS element <h>.

This attribute is non-empty only for the root node of the first tree in a file. It stores the original SGML form of all subelements of CSTS element <markup>, as stored in the CSTS header <h>.

chap

@P chap

Corresponds to the CSTS element <c>.

This attribute is non-empty only for root nodes. If its value is 1 then the senence represented in the tree is the first sentence of a chapter or section.

doc

@P doc

Corresponds to the attribute file of the CSTS element <doc>.

This attribute is non-empty only for root nodes. If the value of this attribute is non-empty then the senence represented in the tree is the first sentence of the document and the value is the original file name of the document.

docid

@P docid

Corresponds to the attribute id of the CSTS element <doc>.

docmarkup

@P docmarkup

Corresponds to the CSTS element <markup> in case the element appears in the document header <a>.

This attribute is non-empty only for root nodes and only if the doc attribute is also non-empty. It stores the original SGML form of all subelements of CSTS element <markup> of the document header <a>.

docprolog

@P docprolog

Corresponds to the CSTS element <a>.

This attribute is non-empty only for root nodes and only if the doc attribute is also non-empty. It stores the original SGML form of all subelements of CSTS element <a> except <markup>.

gappre, gappost

@P gappre
@P gappost

These attributes correspond to the CSTS elements <i>, <idioms>, <idiom> and <iref>.

This attributes store the original SGML form of the above named CSTS elements appearing just before (in case of gappre) or just after (in case of gappost) all other elements which form the node.

A typical header of the Prague Dependency Treebank's FS files

@P lemma
@O lemma
@P tag
@O tag
@P form
@O form
@P afun
@O afun
@L1 afun|---|Pred|Pnom|AuxV|Sb|Obj|Atr|Adv|AtrAdv|AdvAtr|Coord|AtrObj|ObjAtr|AtrAtr|AuxT|AuxR|AuxP|Apos|ExD|AuxC|Atv|AtvV|AuxO|AuxZ|AuxY|AuxG|AuxK|AuxX|AuxS|Pred_Co|Pnom_Co|AuxV_Co|Sb_Co|Obj_Co|Atr_Co|Adv_Co|AtrAdv_Co|AdvAtr_Co|Coord_Co|AtrObj_Co|ObjAtr_Co|AtrAtr_Co|AuxT_Co|AuxR_Co|AuxP_Co|Apos_Co|ExD_Co|AuxC_Co|Atv_Co|AtvV_Co|AuxO_Co|AuxZ_Co|AuxY_Co|AuxG_Co|AuxK_Co|AuxX_Co|Pred_Ap|Pnom_Ap|AuxV_Ap|Sb_Ap|Obj_Ap|Atr_Ap|Adv_Ap|AtrAdv_Ap|AdvAtr_Ap|Coord_Ap|AtrObj_Ap|ObjAtr_Ap|AtrAtr_Ap|AuxT_Ap|AuxR_Ap|AuxP_Ap|Apos_Ap|ExD_Ap|AuxC_Ap|Atv_Ap|AtvV_Ap|AuxO_Ap|AuxZ_Ap|AuxY_Ap|AuxG_Ap|AuxK_Ap|AuxX_Ap|Pred_Pa|Pnom_Pa|AuxV_Pa|Sb_Pa|Obj_Pa|Atr_Pa|Adv_Pa|AtrAdv_Pa|AdvAtr_Pa|Coord_Pa|AtrObj_Pa|ObjAtr_Pa|AtrAtr_Pa|AuxT_Pa|AuxR_Pa|AuxP_Pa|Apos_Pa|ExD_Pa|AuxC_Pa|Atv_Pa|AtvV_Pa|AuxO_Pa|AuxZ_Pa|AuxY_Pa|AuxG_Pa|AuxK_Pa|AuxX_Pa|Generated|NA|???
@P ID1
@P ID2
@VA origf
@P origf
@P afunprev
@P semPOS
@P tagauto
@P lemauto
@N ord
@P dord
@W sentord
@P govTR
@P nospace
@P root
@P ending
@P punct
@P alltags
@P wt
@P origfkind
@P formtype
@P gappost
@P gappre
@P cstslang
@P cstssource
@P cstsmarkup
@P chap
@P doc
@P docid
@P docmarkup
@P docprolog
@P1 warning
@P3 err1
@P3 err2
@P reserve1
@P reserve2
@P reserve3
@P reserve4
@P reserve5
@P wMDt_a
@P wMDl_a
@P wMDt_b
@P wMDl_b
@P tagMD_a
@P lemmaMD_a
@P tagMD_b
@P lemmaMD_b