This Document Type Definition specifies the Czech National Corpus SGML markup scheme, as used (with extensions) in various derived corpora, most notably in the Prague Dependency Treebank. The pure DTD file is called csts.dtd and the declaration file is called csts.dcl; there is also a version for direct use with the nsgmls software (csts.doctype), and a description file which can be fed into the dtd2html program, csts.desc, which produced these HTML documentation pages.

The SGML document name. It contains an (optional) <h>eader, with bibliographic (source-of-test) and annotation/annotator information, and a number of <doc>uments (sometimes just one, sometimes hundreds) which share the same header and which contain the text proper. It cannot be omitted, not even at the end of the SGML document (as most other embedded tags can).

It has only one attribute (lang), for the specification of the default language of the contents; the language can be specified in the <a> tag within each <doc>ument differently.

The default language for the whole SGML document. Can be redefined in the <a> tag within each <doc> element.

An optional file header, as the first element of the <csts> element. It contains information about the source of the file, and about the markup performed on the file so far.

Short identification of the source of the data. Usually contains a publisher's common name, such as "Lidove noviny". Please not that this element's closing tag cannot be omitted.

Default markup information for the file. Might be (logically) superseded by the <doc> / <a>'s element of the same name. It is meant to contain the markup (main) author's name, date, and description of the markup performed on the file, whether automatically or manually. Latest markup information comes first in the sequence of markup elements.

Author of markup. Human-readable full name(s) of the author(s) (or of the main author if too many people were involved), or of person(s) who can provide further information, documentation and/or software concerning the markup.

Free-format specification of the date/time the markup was performed. Any human-readable format of the specification is acceptable, such as "01-Dec-1997", "1998-11-30", "Fri Oct 1 10:41:18 1999", or even "spring 1997". Regionally-specific (and thus globally-ambiguous) date formats should be avoided (such as "7/4/2001").

Free-text description of what has been done to the data. English is preferred as the language of the description. Several mdesc elements might be present within a <markup> element. Even though everything is optional, useful information should be recorded here, such as additional people involved, software used (with version id, parameter settings, etc.), and/or the environment in which the processing has occurred.

A document. Each file contains one or more documents (typically, there is one document per file for books, ephemerals, poems, etc., but possibly hundreds of documents per file for a newspaper, where one file contains the whole daily issue, and each document corresponds to an article.)

A document is identified by a (numerical) id attribute (documents are simply numbered within a file, starting at 1). For ease of local reference, the filename in which the document resides is repeated at every document in the file in the file attribute. Full path to the archive is used for the file reference, even though care has been taken to uniquely identify all files in the CNC (and thus, in the PDT as well) just by the filename.

The document contains one header (<a>) and its contents (<c>). The header contains information about the genre, time period, and other bibliographical and classification information as well as additional markup processing information (if any). The contents then contains a sequence of paragraphs (<p>) and sentences (<s>) within the paragraphs containing the linguistic material proper.

A filename in which the document resides (in the case of texts from the CNC, it is the filename under which the document is stored in the "Bank of the CNC"). The full path from the archive root should be included (and again, it is included for files coming from the CNC). In the case that the filename is unique, the filename alone should be sufficient (in the case of CNC, however, even though the filenames are unique, the full paths is given nevertheless, and it has the following structure:

A typical value of the file attribute is thus

s/inf/nws/1994/ln94164

A document identification (a decimal number). Documents are identified uniquely within a file.

Documents are initially numbered continuously and densely (starting at 1 in a file), but some of them might disappear or be added during subsequent markup and processing. Therefore, nothing is guaranteed but the fact that the attribute's value is a decimal number.

Document header. The <markup> subelement has the same semantics as the one in the file header (<h>), and it is considered to contain an additional markup information to the file header's <markup> subelement.

Corpus type:

scurrent-epoch ("modern") language written corpus
ddiachronic ("old") language written corpus
ooral (spoken) corpus (transcribed speech)

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Text type:

imaimaginative
infinformative
mixmixture

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Text genre:

son song lyrics
ver verse
nov novels and stories
crm crime fiction
sci science fiction
adv romance (adventure)
ero erotica and pornography
bio (auto)biography, memories, letters, diaries
tra travels (by non-experts)
tab yellow, fallen lit. (bulvar)
fab myths, legends, fables, tales
hum humor, satire, parody, jokes
jun literature for children and youth
ess essays, sketches, columns
chr chronicles, annals, yearbooks
exc eccentric literature
dra dramas, sets, tv series, radio
mus music
tvf television, movies
jur justice, criminology
his history, expert biographies
psy psychology
edu education, training, teaching, edification
soc sociology
mil military
phi philosophy
art visual arts, architecture, applied arts
the theater, ballet
pol politology
lit literature
lin linguistics
eth etnography, anthropology
agr agriculture, forestry, breeding, raising
med medicine
bio biology, botanics, zoology
che chemistry
mat mathematics, logics
ggr geography, travels (by experts)
phy physics, astronomy
geo geology, meteorology, hydrology
ind industry, technology, building, energy, transportation, crafts
inf information, computer science
eco economy, business, banking
adm administration, government, management, parliament
rel religion, theology
hou household economy (boarding, lodging, clothing)
spo sports
sct society (manners, gossips)
amu amusement, games

In the texts in the Prague Dependency Treebank, very few genres are actually marked, even though theoretically they can be identified (such as sports).

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Verse type:

txbtextbook
encdictionary, encyclopedia
poppopular style
cricritique
advadvertisement

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Medium:

bbook
nwsnewspaper
jjournal (periodic)
scrscreenplay
netInternet
ococcasional

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Author's sex:

ydefault; not known
mmale
ffemale

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Text language:

czeCzech
engEnglish
sloSlovak
gerGerman
freFrench (France)
spaSpanish
rusRussian

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Translator's sex:

nadefault if not a translation (srclang=no)
xdefault if translation; not known
mmale
ffemale

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Source text language (for translations only):

noDefault
engEnglish
sloSlovak
gerGerman
freFrench (France)
spaSpanish
rusRussian

Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.

Year of publication. Full four digits required (1898, 1960, 2001).

Year of first edition publication. Full four digits required (1898, 1960, 2001). If first edition publication date is not known, the value x is used (the default).

Author's identification (not a full name). Every author included in the corpus is assigned a max. 8-character identification (see ucnk.ff.cuni.cz web page for current documentation on the full name <-> author ID correspondence table).

Translator's identification (not a full name). Every author/translator included in the corpus is assigned a max. 8-character identification (see ucnk.ff.cuni.cz web page for current documentation on the full name <-> author ID correspondence table).

The filename. Every file within a corpus is assigned a max. 8-character name which is also a valid filename (a least common denominator for filename convention across file systems is used here for a definition of a valid filename). A copy of the file attribute of the doc element.

Various additional conventions are used within the CNC. For example, all Lidove Noviny newspaper filenames (one file corresponds to one day, with documents (<doc>) corresponding to articles) are formed using the following template: lndyyxxx, where lnd identifies the main ("daily") portion of the newspaper, yy is the year, and xxx is the day-of-the-year number. Similarly, Mlada Fronta daily newspaper files are assigned names according to the following scheme: mfyymmdd, using the full date (instead of the day-of-the-year shortcut) for identifying the day on which it was published. Novels and other non-periodic text usually get a mnemonic filename reminiscent of the original title or author's name.

Document identification (a 3-digit number, with leading zeros if necessary) within a file. A copy of the id attribute of the doc element.

The contents element, which follows the document header (<a>). It can also be interpreted as a "chapter" or "division" or any "super-paragraph" unit, if you like, since it can be repeated. However, in the CNC (and therefore in the PDT, too), it is never repeated, i.e. there is always one <c> per document (<doc>).

The paragraph element. It is marked only if it was apparent in the source data where the paragraph breaks are. Therefore it is possible that there are files with no paragraph breaks whatsoever (except one which is compulsory within the <c> element). This can happen even in data from one data source, acquired over a longer period of time.

The paragraph number. Paragraphs are numbered starting typically from 1, even though it is not compulsory.

Paragraphs are initially numbered continuously and densely, but during subsequent markup and processing, some of them might disappear or be added. Therefore, nothing is guaranteed but the fact that the attribute's value is a decimal number.

A sentence. Sentence boundaries are identified at tokenization time, unless there are marked in the source, which is almost never the case. The algorithm for sentence boundary identification used in the CNC is very rudimentary, and it is correct only about 95-98% of the time for general texts, and it s accuracy depends very heavily on the type of the text.

Sentences are identified uniquely within the CNC corpus (as they should be in any corpus). The identification consists of the

The full sentence identification is typically recorded in full at each sentence in the data in the id attribute.

A sentence identification. Sentences are identified uniquely within the CNC corpus (as they should be in any corpus). The identification consists of the

The format of the id string:

filename:docid-pXXXsYYY

where docid is the value of the id attribute of the <doc> element, XXX is the paragraph number (from the n attribute of the <p> element), and YYY is the sentence id number (from the id attribute of the <s> element). Thus a typical sentence id attribute is

id="s/inf/nws/1994/ln94164:001-p2s3"

Alternate markup for alternate annotation of a sentence. Contains a copy of the original sentence with different markup. NB: this element is embedded within the (original) <s> element, not that it is parallel to it. It has identical contents except another salt element is not allowed within it, rather all salt elements are listed next to each other.

Source "formatting" information. Most programs (tools), such as the morphological analyzer, preserve this information on the output. It could thus be used for transferring information from the input of such tools to the output, intact by the processing.

The "main" word (token) element. Contains the word form from text, and then elements associated with the word form, such as lemma and tag (manual, dictionary possibilities, machine generated by various taggers), or governing node and analytical function (again, manual and/or automatic) on the analytical level, and governing node, functor and grammateme(s) on the tectogrammatical annotation level (yet again, possibilities exist to encode both manual and automatically assigned values; see also the description of the <fadd> element).

The attribute case contains an indication of the token's capitalization pattern, even though the actual capitalization from the original text is preserved, too. Only five types of capitalization are recognized and marked:

The string abbr is appended to the capitalization pattern names above if the word form has been identified as an abbreviation followed by a dot (period/fullstop) by the tokenizer. (NB: other abbreviations (even such that are not followed by a fullstop) are recognized at dictionary look-up time, but the value of the attribute case is then never ever modified again, i.e. for such abbreviations the abbr string is not added, and the fact that the token is (possibly) an abbreviation is marked elsewhere - see the elements <t>, <MMt> and <MDt>.)

The <f> element is in most cases identical to the appearance of the word form in the original text. In case of any discrepancy (such as an obvious spelling error, multiword or split phrases detected at tokenization time), the <w> element(s) is(are) used, preceding the <f> element(s); in such cases, the attribute case containing the substring gen is present in the <f> tag. Obviously, some of those discrepancies could have been discovered only in the manually annotated data; therefore, it is not guaranteed that e.g. spelling errors are marked in all data.

Is the form of the original token as found in the original source of text. It's text #PCDATA is in most cases identical to the initial text (#PCDATA) of the <f> element, in which case it can be completely omitted. Otherwise it must immediately precede the corresponding "normalized" <f> element(s).

It is used in the following cases:

In the PDT data, the default value same of the kind attribute is never used explicitly; in fact, the whole <w> element, although theoretically correct, is never present in such a case.

In the following description, "immediately following ... element" means a following <f> and/or <d> element without an intervening SGML tag except (possibly) for a <D> and/or <i> element(s).

Punctuation markup. Used throughout the corpus, including the Czech National Corpus. Counted as one token, but not as a word proper. For all uses and purposes, it behaves like a word (see the <f> element). I.e., it contains the same subelements as the <f> element does, including elements for lemmas, morphological and syntactic tags, etc., to make further processing simple. The only difference is in the set of possible type attribute values - the set for this punctuation markup is much smaller than that for the <f> element.

Type of the <d> element. Usually empty (corresponds to the std value). The value gen is used for added <d> elements not present in the original input data (e.g., for typos in the part of the corpus that has been manually corrected).

Used to signal "no space present" between two tokens (usually between a <f> elements and a <d> element, or between two <d> elements. However, in the manually annotated data, it can also appear between two <f> elements. In any case, it signals that between the immediately preceding element and the immediately following element there was no space in the original source text as received from the text supplier. (This might be important, for example, for more reliable abbreviation or hyphenated compound identification.) Conversely, if two input tokens (i.e., two <d>/<f> elements) have been separated by one or more white-space character in he original textual material, there is no explicit markup in the data (since for most languages, this is the default).

For most applications, all the <D> tags can safely be ignored. (However, the standard morphological processing - such as in the PDT - does take this tag into account.)

Optional element for all-uppercase "speech-like transcription" of the original data tokens (i.e., of the <f> elements) into "pronounced" form. It is performed for the following token types:

For all other tokens, they are only transformed to all-uppercase and left as they were in the original data. It is never used for punctuation unless the punctuation is commonly pronounced in some way.

In all cases, regular pronunciation is not transcribed in this way, it is considered to be recorded in a pronunciation dictionary as usual. This tag is meant to unify the actual pronunciation if, e.g., the text is read aloud.

OBSOLETE. Formerly, list of all tags assigned by morphology, regardless of lemma, i.e. union of all sets of tags of all lemmas, separated by slashes (i.e. slash could not have been of a tag).

Lemma as defined by a morphological dictionary, manually disambiguated. It has the same format also in the <MMl> and <MDl> text fields.

In the Czech data provided in the Prague Dependency Treebank, and partially also in the Czech National Corpus (CNC), the formal structure of the lemma is described in the remaining part of this section.

The lemma string includes an optional sense ID (a decimal number separated from the lemma by a single dash symbol, such as -3), followed optionally by syntactic, semantic and style tags, derivational information and a comment (or any combination of those) marked in a non-SGML way: each tag is only one-letter long, it is attached to the lemma by an underscore and a single markup symbol:

Syntactic tags

Syntactic tags have been used formerly for alternate part of speech for some words, but are not used today except for verb aspect distinction for regular verbs (T, W symbols). Part of speech symbols can be always found in the associated morphological tag (<t>, <MMt>, <MDt>), and the abbreviation information from the tag (8 in its VAR (last) column) has precedence over the B designation here.
NNoun
JAdjective
AAdjective
ZPronoun
TImperfective verb
WPerfective verb
VVerb (aspect not specified)
MNumeral
CConjunction
DAdverb
PPreposition
FInterjection
IParticle
BAbbreviation
QUnused
XUnused

Semantic tags

GGeographical name
YPerson's first (given) name
SPerson's family name
ENames of members of nations, cities, ethnic groups etc.
RProduct name
KOrganization name
mOther proper name
HChemistry
UMedicine
LNatural Sciences
jLaw, Legal
gGeneral Technical term
cElectronics, Computers
yDIY, travel, free time
bEconomy and Finances
uCulture, Education, Arts, other science
wSports
pPolitics, Government, Military
zEnvironment
oColors

Style tags

sBookish
aArchaic
nDialect
hColloquial (not tolerated in the standard)
eExpressive
lSlang
vVulgar (extremely expressive)
tForeign-language word
xParallel spelling/form, do not use for morph. generation

Derivation information, general comment

Derivation information and general comment (introduced by the caret symbol, ^) are furthermore always contained within a set of parentheses.

Within the parentheses, derivation information always starts with a star (*) as a distinguishing symbol (vs. a general comment), optionally preceded by a derivation type formed by the symbol ^ (caret) and a two-letter code. After the star symbol, a "rule" follows which describes how to get the (underlying) lemma which the current lemma has been derived from. The rule has two parts:

If a star (*) is used in the deletion part, the to-be-appended part which follows the contain the complete original lemma. Otherwise, the number of symbols (including the sense ID if any) designated by the deletion part should be stripped off before attaching the to-be-appended part to form the original lemma.

Examples:

Eventually, all proper derivations should have the ^XX derivation present; all remaining comments starting with a star (*) and containing the string transformation rule described above will be considered synonyms, not derivations.

Lemma as defined by a morphological dictionary, all possibilities (context insensitive). Its text (#PCDATA) part has the same format and contents as the <l> text field.

Optional source of dictionary information (dictionary identification). In the Prague Dependency Treebank, the dictionary information can have three sources:

The src information in the <MMt> tag is just a copy of the current identification.

Lemma as defined by a morphological dictionary, automatically disambiguated by a tagger. Its text (#PCDATA) part has the same format and contents as the <l> text field.

The weight of the selected (disambiguated) lemma as defined by the tagger used for disambiguation, as a number between 0 and 1. Currently unused.

Source of disambiguation information (tagger ID). In the Prague Dependency Treebank (PDT), there are two taggers:

See the documentation to the PDT for more information about the taggers.

Pattern (morphological paradigm) name, optionally output by the morphological analyzer. It contains two parts, separated by a colon:

For example, if the dictionary pattern name is s2, and the actual ending pattern is y, the markup is <pn>s2:y.

The source dictionary pattern name (s2 in the above example) is the one which is manually inserted to the source main dictionary when adding a new word (dictionary entry). However, this pattern in general does not correspond directly to the actual ending database pattern name, since it is first processed by the [regular] derivation module (which is part of the morphological processor), and replaced by another "virtual" entry (or even expanded to a set of virtual dictionary entries), with possibly different root and lemma, which then uses a matching pattern name from the set of patterns from the ending database. This whole process is done on-the-fly and does not entail any special preprocessing steps other than regular dictionary compilation for the morphological analyzer.

The pattern name(s) have no importance for further linguistic processing, but they might be interesting for other reasons. No set of pattern names is generally provided with the data and this description (as opposed to, e.g., the morphological tags), but it might be made available on request.

Complete example:

<f>koupenou<MMl src="ad">koupený_^(něco_sobě/někomu)_(*3it)<pn>s2:y<MMt src="ad">AAFS4----1A----<pn>s2:y<MMt src="ad">AAFS7----1A----

Word-form root as internally defined by the morphological analyzer. Since the Czech morphological processor we use does not contain a phonological element, no root changes are possible during processing, and therefore the "root" as found here is a rather technical root, defined as the part of the word string which does not change within a paradigms (as defined as the full or partial paradigm in the morphological dictionary).

Only one <R>/<E> pair is generated for every word form (<f>), i.e. for every set of <MMl>s. If there are multiple analyses of the input word form, resulting in different segmentation to root and ending (<E>), the shortest root wins (i.e., the longest ending wins).

It is unused in the PDT; it has some importance for checking purposes only and/or for experiments with language modeling in speech recognition for reducing the size of a dictionary.

Word-form ending as defined by the morphological analyzer. Together with the <R> (by concatenation) gives the original word form. See the description at the <R> SGML tag.

The morphological tag of the current token (which can be found in the text part of <f> or <d>), manually disambiguated. The tagset is defined by the morphological dictionary used for preprocessing the data.

In the Prague Dependency Treebank (PDT), the following tagset system is currently in use. For more information, please refer to the PDT documentation.

Each tag is a 15-tuple of symbols (mostly uppercase letters and digits, but many lowercase and special symbols are used as well). Each single-character position contains a value from one morphological category. 13 categories are in fact fully used:
PositionCategory nameDescription
1POSPart of Speech
2SUBPOSDetailed Part of Speech
3GENDERGrammatical Gender (for agreement)
4NUMBERGrammatical Number (for agreement)
5CASEMorphological Case
6POSSGENDERGender of Possessor
7POSSNUMBERNumber of Possessor
8PERSONPerson
9TENSETense
10GRADEDegree of Comparison
11NEGATIONNegation
12VOICEVoice
13RESERVE1Reserved
14RESERVE2Reserved
15VARVariant, Style, Register

For more information on the individual categories, especially the sets of possible values, please see the full Tagset documentation (psfile, pdffile) or the quick tagset reference (htmlfile, pdffile).

Weight of manual disambiguation, as a number between 0 and 1. Meant for rare cases only, unused today.

A morphological tag as generated by the morphological analyzer. For a description of the tagset, see the <t> element.

Optional source of dictionary information (dictionary identification). In the Prague Dependency Treebank, the dictionary information can have three sources:

The src information from the <MMl src=...> attribute is just copied here.

A morphological tag as disambiguated (i.e., automatically selected from the list of <MMt>s) by the tagger. For a description of the tagset, see the <t> element.

The weight of the selected (disambiguated) tag as defined by the tagger used for disambiguation, as a number between 0 and 1. Currently unused.

Source of disambiguation information (tagger ID). In the Prague Dependency Treebank (PDT), there are two taggers:

See the documentation to the PDT for more information about the taggers.

Analytical (surface-syntactic) function, i.e., the type of dependency relation to the governing node of the current node (i.e., the <f> or <d> element). The list of those functions follows:

FunctionShort Description
- - -Not assigned
PredPredicate
PnomNominal part of a predicate
AuxVAuxiliary Verb
SbSubject
ObjObject
AtrAttribute
AdvAdverbial
AtrAdvAttribute and Adverbial (at Atr dep. position)
AdvAtrAttribute and Adverbial (at Atr dep. position)
CoordCoordination
AtrObjAttribute and Object (at Atr dep. position)
ObjAtrAttribute and Object (at Atr dep. position)
AtrAtrAttribute and Attribute (at deeper Atr dep. position)
AuxTReflexive se as part of the lexeme
AuxRReflexive se
AuxPPreposition
AposApposition
ExDExtra Dependency / Ellipsis above
AuxCSubordinate conjunction
AtvAttribute Verbal (Complement)
AtvVAttribute Verbal, missing nominal governor
AuxOAux. referring pronoun
AuxZRhematizer
AuxYAll the rest
AuxGGraphical symbol (punctuation except commas)
AuxKFinal Punctuation of sentence
AuxXComma
AuxSSentence root (artificial extra node)
GeneratedAdded dependency, technical value
NANot Applicable
???Unannotatable (extremely unclear case)

In addition to the above list, most of the functions can have the form function_suffix, where the suffix is one of the following:

For example, Obj_Co is an analytical function for coordinated object.

For more information on the Prague Dependency Treebank, and its analytical level of annotation, see the Manual for Analytic Layer Tagging of the Prague Dependency Treebank (English translation) (or the Czech original).

Analytical (surface-syntactic) function, i.e., the type of dependency relation to the governing node of the current node (i.e., the <f> or <d> element), as assigned automatically by the machine. For a description of the functions used, see the <A> element.

Weight of the automatically assigned analytical function, as a number between 0 and 1.

Source of the automatically assigned analytical function (i.e., parser's or automatic function analyzer's ID code).

This is an "added" token, used for insertions of tokens well after tokenization is made, such as for tectogrammatical annotation "nodes" (e.g., restored ellipsis, fillers for valency slots, etc.). It is identical to the <f> element, except that many subelements do not make sense anymore and thus are not defined; of those that are still here (see below for a list), the semantics is identical to those at the <f> element.

Type of surface deletion (i.e., reason for adding this node to the tectogrammatical tree). See Manual for Tectogrammatic Tagging of the Prague Dependency Treebank for more information.

An enclosing element for markup of automatic tectogrammatical annotation. For n-best-output-producing parsers, it might be repeated as a whole. For the description of the trlemma itself, and for more information about the dependency linking, see the <TRl> element.

An enclosing element for markup of manual tectogrammatical annotation (for automatically assigned dependency structure and additional information, the <MTRl> element is used). The only markup which crosses the boundaries of this enclosure is the <TRg> element, which "points" to the <r> element, which is a common identification of the node ID for for the analytical and tectogrammatical level.

The #PCDATA of this element is the so-called trlemma (for more info, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank), or "tectogrammatical lemma". It might differ from the morphological lemma in several minor respects, such as added reflexive particle se for some verbs, more or less fixed phrases put together, etc.

Direct speech and/or quotation indication. Applies to the whole subtree.

A "pointer" to the governing node (head) for dependency relation at the tectogrammatical level (for more info on the tectogrammatical level, see Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.) It "points" to the governing node with the corresponding <r> element (textual parts - #PCDATA - of the markup must match).

Pointer to 0 (zero) is allowed, and in fact required somewhere in every sentence, even tough it is never present in any r element in the sentence; it denotes the "virtual" (or technical) extra root node of the whole sentence structure. I.e. the nodes pointing to 0 are the "linguistic" roots.

For a dependency link at the analytical level, see the <g> element.

Tectogrammatical functor, a markup for the type ("function") of the dependency relation at the tectogrammatical level. (For the so-called analytical function, or the type of dependency link at the analytical level, see the <A> element.)

A table with short descriptions of the functors follows; for a more detailed and more up-to-date account, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.

FunctorShort description
ACTActor/Bearer (deep subject, first argument)
PATPatient (deep object, second argument)
ADDRAddressee (third argument)
EFFEffect (fourth argument)
ORIGOrigin (fifth argument)
ACMPAccompaniment
ADVSAdversative (coordination)
AIMAim
APPAppurtenance
APPSApposition
ATTAttitude (of speaker/writer)
BENBenefactive
CAUSCause
CNCSConcession
COMPLComplement
CONDConditional
CONFRConfrontation (coordination)
CONJConjunction (coordination)
CPRComparison
CRITCriterion
CSQConsequence (coordination)
CTERFCounterfactual (unreal condition)
DENOMDenomination (no predicate in clause)
DESDescriptive attribute
DIFFDifference
DIR1Direction 1 - from where
DIR2Direction 2 - through (which way)
DIR3Direction 3 - to where
DISJDisjunction (coordination)
ETHDEthical "dative"
EXTExtent
EVEmpty verb
GRADGradation (coordination)
HERHeritage
INTFIntensification
INTTIntent
IDIdentity
LOCLocation (where)
MANNManner (general)
MATMaterial attribute
MEANSMeans
MODModality
NORMNorm
PARParenthesis (w/o clausal function)
PRECReference to preceding text
PREDPredicate (null dependency)
REASReason (coordination)
REGRegard
RESLResult
RESTRRestriction
RHEMRhematizer
RSTRRestrictive attribute
SUBSSubstitution
TFHLTime - for how long
TFRWHTime - from when
THLTime - how long
THOTime - how often
TOWHTime - to when
TPARTime - parallel events (contemporaneous)
TSINTime - since when
TTILLTime - till when
TWHENTime - when (general)
VOCVocative sentence
VOCATVocative, in apposition within a clause
NANot Applicable
SENTSentence root (invisible in the SGML markup)

As opposed to the analytical level functions (<A> element), the "member of (coordination,apposition,parenthesis)" suffix is not part of the functor, but it is recorded separately as the <Tmo> element (even though in the graphical tools provided with the Prague Dependency Treebank, the information may be presented for the user in a form similar to the analytical level).

Tectogrammatical grammateme(s). Not yet specified - reserved for future use on the tectogrammatical level. For up-to-date information, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.

Member of (coordination, apposition, parenthesis). The following values are used:

ValueShort description
COMember of coordination
APMember of apposition
PAMember of parenthesis
NILNot a member of anything of the above
NANot Applicable

Morphosyntactic tag as provided at the tectogrammatical level. In general, it differs from the morphological tag (<t>, <MMt> and <MDt>), since it does not contain any information which can be restored from other nodes and from other attributes using, for example, grammatical rules for grammatical agreement, or surface information from the valency dictionary (such as the preposition/case combination).

The tags use a positional tag system to pack together several single-category morphosyntactic tags. Each position is one-symbol long, therefore identifying the morphological categories categories positionally.

PositionCategoryValuesValue abbreviations (if different and not from morphology)
1Gender[-MIFNX](identical, from morph.)
2Number[-SPX](identical, from morph.)
3Degree of comparison[-123X](identical, from morph.)
4Tense[-SPAX]SIM POST ANT
5Aspect[-PCRX]PROC CPL CDN
6Iterativeness[-01X]0 - NO, 1 - Iterative
7Manner[-IMCX]IND IMP CDN
8Deontmod[-DBHVSPFX]DECL DEB HRT VOL POSS PERM FAC
9Sentmod[-.!DM?]ENUNC EXCL DESID IMPER INTER

In all positions, a dash ('-') means not applicable, and capital X means "not used" (usage differs depending on part of speech category, dependency type, etc.: for example, nouns do have information about gender, but verbs and adjectives do not; in fact, adjectives contain only degree of comparison information, since everything else can be generated automatically from the head).

For more and up-to-date information, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.

Topic-focus articulation. Possible contents:

ValueShort Description
tTopic
cContrastive topic
fFocus
-Not Applicable
XNot assigned (default)

It should be noted that in addition to these values, the underlying word order based on so-called communicative dynamism, which is closely related to topic and focus, is represented by the "deep" node order attribute (see the <tfr> element).

Markup for deep (tectogrammatical) word order, which is based on communicative dynamism and related to topic and focus (see the <tfa> element). Every node (as represented by the <f>, <fadd> and (exceptionally) <d> elements) gets a numerical ID. Since it is numerical (even though not necessarily integer), an order is well defined; this order is then used for left-to-right ordering of nodes on every level in the tree. It is partial ordering in the sense that we do not require total ordering (but we can always extend it to be total, in many ways). In practice, the ordering is total.

"Functional word" markup. Usually prepositions are saved here, even though they are not really part of the tectogrammatical markup, at least theoretically, but they are provided for comparison purposes and for easier handling of certain aspect of machine learning.

Reserved for phraseme identification.

Lemma (<trlemma>) of the node that is in the coreference relation with the current node. Redundant.

Functor (<T>) of the node that is in the coreference relation with the current node. Redundant.

Node identifier (<r>) of the node that is in the coreference relation with the current node. Identifies the node uniquely and sufficiently within the sentence identified by the <cors> element.

cors should be empty or contain a sentence id; it is a coreference to a sentence id; if empty, current sentence is assumed. Attribute <rel> can be used to specify a relative `distance' of the coreferenced sentence from the sentence referenced by id.

Relative sentence distance (non-negative, possibly non-integer number) for coreference identification. Counts backwards from the current sentence. Default is 0 (current sentence).

Unique numerical token ID within a sentence (<s>). Its numerical value order corresponds to the original surface word order in the sentence. It is also used as a destination of the "governing node pointer" (<g> and <TRg>, for analytical and tectogrammatical levels, respectively). Each token (<f>, <fadd> and <d>) has exactly one token ID (on analytical and tectogrammatical levels, but not necessarily on the lower levels).

A "pointer" to the governing node (head) for dependency relation at the analytical level (for more info on the analytical level, see Manual for Analytic Layer Tagging of the Prague Dependency Treebank (English translation) (or the Czech original)). It "points" to the governing node with the corresponding <r> element (textual parts - #PCDATA - of the markup must match).

Pointer to 0 (zero) is allowed, and in fact required somewhere in every sentence, even tough it is never present in any r element in the sentence; it denotes the "virtual" (or technical) extra root node of the whole sentence structure. I.e. the nodes pointing to 0 are the "linguistic" roots.

For a dependency link at the tectogrammatical level, see the <TRg> element.

Governing node pointer, machine generated, e.g., by an analytical-level parser or by some other automatic means. Used at the analytical level. For description of analytical dependency markup, see its manually assigned counterpart (the <g> element).

The weight of the selected governing node pointer as defined by the parser, as a number between 0 and 1. Currently unused.

Source of parser (parser ID). Currently unused.

List of idioms by reference. Unused so far.

An idiom, as part of the <idioms> element. Unused so far.

Reference to part of an <idiom>. Unused so far.