This Document Type Definition specifies the Czech National Corpus SGML markup scheme, as used (with extensions) in various derived corpora, most notably in the Prague Dependency Treebank. The pure DTD file is called csts.dtd and the declaration file is called csts.dcl; there is also a version for direct use with the nsgmls software (csts.doctype), and a description file which can be fed into the dtd2html program, csts.desc, which produced these HTML documentation pages.
The SGML document name. It contains an (optional) <h>eader, with bibliographic (source-of-test) and annotation/annotator information, and a number of <doc>uments (sometimes just one, sometimes hundreds) which share the same header and which contain the text proper. It cannot be omitted, not even at the end of the SGML document (as most other embedded tags can).
It has only one attribute (lang), for the specification of the default language of the contents; the language can be specified in the <a> tag within each <doc>ument differently.
The default language for the whole SGML document. Can be redefined in the <a> tag within each <doc> element.
An optional file header, as the first element of the <csts> element. It contains information about the source of the file, and about the markup performed on the file so far.
Short identification of the source of the data. Usually contains a publisher's common name, such as "Lidove noviny". Please not that this element's closing tag cannot be omitted.
Default markup information for the file. Might be (logically) superseded by the <doc> / <a>'s element of the same name. It is meant to contain the markup (main) author's name, date, and description of the markup performed on the file, whether automatically or manually. Latest markup information comes first in the sequence of markup elements.
Author of markup. Human-readable full name(s) of the author(s) (or of the main author if too many people were involved), or of person(s) who can provide further information, documentation and/or software concerning the markup.
Free-format specification of the date/time the markup was performed. Any human-readable format of the specification is acceptable, such as "01-Dec-1997", "1998-11-30", "Fri Oct 1 10:41:18 1999", or even "spring 1997". Regionally-specific (and thus globally-ambiguous) date formats should be avoided (such as "7/4/2001").
Free-text description of what has been done to the data. English is preferred as the language of the description. Several mdesc elements might be present within a <markup> element. Even though everything is optional, useful information should be recorded here, such as additional people involved, software used (with version id, parameter settings, etc.), and/or the environment in which the processing has occurred.
A document. Each file contains one or more documents (typically, there is one document per file for books, ephemerals, poems, etc., but possibly hundreds of documents per file for a newspaper, where one file contains the whole daily issue, and each document corresponds to an article.)
A document is identified by a (numerical) id attribute (documents are simply numbered within a file, starting at 1). For ease of local reference, the filename in which the document resides is repeated at every document in the file in the file attribute. Full path to the archive is used for the file reference, even though care has been taken to uniquely identify all files in the CNC (and thus, in the PDT as well) just by the filename.
The document contains one header (<a>) and its contents (<c>). The header contains information about the genre, time period, and other bibliographical and classification information as well as additional markup processing information (if any). The contents then contains a sequence of paragraphs (<p>) and sentences (<s>) within the paragraphs containing the linguistic material proper.
A filename in which the document resides (in the case of texts from the CNC, it is the filename under which the document is stored in the "Bank of the CNC"). The full path from the archive root should be included (and again, it is included for files coming from the CNC). In the case that the filename is unique, the filename alone should be sufficient (in the case of CNC, however, even though the filenames are unique, the full paths is given nevertheless, and it has the following structure:
s/inf/nws/1994/ln94164
A document identification (a decimal number). Documents are identified uniquely within a file.
Documents are initially numbered continuously and densely (starting at 1 in a file), but some of them might disappear or be added during subsequent markup and processing. Therefore, nothing is guaranteed but the fact that the attribute's value is a decimal number.
Document header. The <markup> subelement has the same semantics as the one in the file header (<h>), and it is considered to contain an additional markup information to the file header's <markup> subelement.
Corpus type:
s | current-epoch ("modern") language written corpus |
d | diachronic ("old") language written corpus |
o | oral (spoken) corpus (transcribed speech) |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Text type:
ima | imaginative |
inf | informative |
mix | mixture |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Text genre:
son | song lyrics |
ver | verse |
nov | novels and stories |
crm | crime fiction |
sci | science fiction |
adv | romance (adventure) |
ero | erotica and pornography |
bio | (auto)biography, memories, letters, diaries |
tra | travels (by non-experts) |
tab | yellow, fallen lit. (bulvar) |
fab | myths, legends, fables, tales |
hum | humor, satire, parody, jokes |
jun | literature for children and youth |
ess | essays, sketches, columns |
chr | chronicles, annals, yearbooks |
exc | eccentric literature |
dra | dramas, sets, tv series, radio |
mus | music |
tvf | television, movies |
jur | justice, criminology |
his | history, expert biographies |
psy | psychology |
edu | education, training, teaching, edification |
soc | sociology |
mil | military |
phi | philosophy |
art | visual arts, architecture, applied arts |
the | theater, ballet |
pol | politology |
lit | literature |
lin | linguistics |
eth | etnography, anthropology |
agr | agriculture, forestry, breeding, raising |
med | medicine |
bio | biology, botanics, zoology |
che | chemistry |
mat | mathematics, logics |
ggr | geography, travels (by experts) |
phy | physics, astronomy |
geo | geology, meteorology, hydrology |
ind | industry, technology, building, energy, transportation, crafts |
inf | information, computer science |
eco | economy, business, banking |
adm | administration, government, management, parliament |
rel | religion, theology |
hou | household economy (boarding, lodging, clothing) |
spo | sports |
sct | society (manners, gossips) |
amu | amusement, games |
In the texts in the Prague Dependency Treebank, very few genres are actually marked, even though theoretically they can be identified (such as sports).
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Verse type:
txb | textbook |
enc | dictionary, encyclopedia |
pop | popular style |
cri | critique |
adv | advertisement |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Medium:
b | book |
nws | newspaper |
j | journal (periodic) |
scr | screenplay |
net | Internet |
oc | occasional |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Author's sex:
y | default; not known |
m | male |
f | female |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Text language:
cze | Czech |
eng | English |
slo | Slovak |
ger | German |
fre | French (France) |
spa | Spanish |
rus | Russian |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Translator's sex:
na | default if not a translation (srclang=no) |
x | default if translation; not known |
m | male |
f | female |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Source text language (for translations only):
no | Default |
eng | English |
slo | Slovak |
ger | German |
fre | French (France) |
spa | Spanish |
rus | Russian |
Please refer to the ucnk.ff.cuni.cz web page for current documentation on this element's contents.
Year of publication. Full four digits required (1898, 1960, 2001).
Year of first edition publication. Full four digits required (1898, 1960, 2001). If first edition publication date is not known, the value x is used (the default).
Author's identification (not a full name). Every author included in the corpus is assigned a max. 8-character identification (see ucnk.ff.cuni.cz web page for current documentation on the full name <-> author ID correspondence table).
Translator's identification (not a full name). Every author/translator included in the corpus is assigned a max. 8-character identification (see ucnk.ff.cuni.cz web page for current documentation on the full name <-> author ID correspondence table).
The filename. Every file within a corpus is assigned a max. 8-character name which is also a valid filename (a least common denominator for filename convention across file systems is used here for a definition of a valid filename). A copy of the file attribute of the doc element.
Various additional conventions are used within the CNC. For example, all Lidove Noviny newspaper filenames (one file corresponds to one day, with documents (<doc>) corresponding to articles) are formed using the following template: lndyyxxx, where lnd identifies the main ("daily") portion of the newspaper, yy is the year, and xxx is the day-of-the-year number. Similarly, Mlada Fronta daily newspaper files are assigned names according to the following scheme: mfyymmdd, using the full date (instead of the day-of-the-year shortcut) for identifying the day on which it was published. Novels and other non-periodic text usually get a mnemonic filename reminiscent of the original title or author's name.
Document identification (a 3-digit number, with leading zeros if necessary) within a file. A copy of the id attribute of the doc element.
The contents element, which follows the document header (<a>). It can also be interpreted as a "chapter" or "division" or any "super-paragraph" unit, if you like, since it can be repeated. However, in the CNC (and therefore in the PDT, too), it is never repeated, i.e. there is always one <c> per document (<doc>).
The paragraph element. It is marked only if it was apparent in the source data where the paragraph breaks are. Therefore it is possible that there are files with no paragraph breaks whatsoever (except one which is compulsory within the <c> element). This can happen even in data from one data source, acquired over a longer period of time.
The paragraph number. Paragraphs are numbered starting typically from 1, even though it is not compulsory.
Paragraphs are initially numbered continuously and densely, but during subsequent markup and processing, some of them might disappear or be added. Therefore, nothing is guaranteed but the fact that the attribute's value is a decimal number.
A sentence. Sentence boundaries are identified at tokenization time, unless there are marked in the source, which is almost never the case. The algorithm for sentence boundary identification used in the CNC is very rudimentary, and it is correct only about 95-98% of the time for general texts, and it s accuracy depends very heavily on the type of the text.
Sentences are identified uniquely within the CNC corpus (as they should be in any corpus). The identification consists of the
A sentence identification. Sentences are identified uniquely within the CNC corpus (as they should be in any corpus). The identification consists of the
filename:docid-pXXXsYYY
where docid is the value of the id attribute of the <doc> element, XXX is the paragraph number (from the n attribute of the <p> element), and YYY is the sentence id number (from the id attribute of the <s> element). Thus a typical sentence id attribute is
id="s/inf/nws/1994/ln94164:001-p2s3"
Alternate markup for alternate annotation of a sentence. Contains a copy of the original sentence with different markup. NB: this element is embedded within the (original) <s> element, not that it is parallel to it. It has identical contents except another salt element is not allowed within it, rather all salt elements are listed next to each other.
Source "formatting" information. Most programs (tools), such as the morphological analyzer, preserve this information on the output. It could thus be used for transferring information from the input of such tools to the output, intact by the processing.
The "main" word (token) element. Contains the word form from text, and then elements associated with the word form, such as lemma and tag (manual, dictionary possibilities, machine generated by various taggers), or governing node and analytical function (again, manual and/or automatic) on the analytical level, and governing node, functor and grammateme(s) on the tectogrammatical annotation level (yet again, possibilities exist to encode both manual and automatically assigned values; see also the description of the <fadd> element).
The attribute case contains an indication of the token's capitalization pattern, even though the actual capitalization from the original text is preserved, too. Only five types of capitalization are recognized and marked:
The <f> element is in most cases identical to the appearance of the word form in the original text. In case of any discrepancy (such as an obvious spelling error, multiword or split phrases detected at tokenization time), the <w> element(s) is(are) used, preceding the <f> element(s); in such cases, the attribute case containing the substring gen is present in the <f> tag. Obviously, some of those discrepancies could have been discovered only in the manually annotated data; therefore, it is not guaranteed that e.g. spelling errors are marked in all data.
Is the form of the original token as found in the original source of text. It's text #PCDATA is in most cases identical to the initial text (#PCDATA) of the <f> element, in which case it can be completely omitted. Otherwise it must immediately precede the corresponding "normalized" <f> element(s).
It is used in the following cases:
In the following description, "immediately following ... element" means a following <f> and/or <d> element without an intervening SGML tag except (possibly) for a <D> and/or <i> element(s).
Original string (text token) is a contracted form, such as isn't in English. In Czech, it only appears for the following forms and/or tags:
Original form of a misspelling, including erroneously split or joined forms. It can only be used for truly corrected forms (manually or automatically; currently, the texts of PDT are only manually corrected).
The immediately following <f> and/or <d> element(s) always contain the correct spelling(s) in its #PCDATA and the (sub)string gen in the case (or type) attribute(s).
Original form which is for some technical reason superfluous, but which could not be removed by the tokenizer without too much specific processing. Neither <f> nor <d> element(s) are present in the following text.
This <w> element has always empty text (#PCDATA), since it signifies a token which is for some technical reason missing in the original text. It is used solely for the purpose of easy and consistent identification of the "artificially" generated following <f> and/or <d> element(s).
Original form of a single fixed phrase (in the linguistic sense). There is always more than one element <w phrpart> immediately preceding a single <f> element, which then always has the gen.phrase (sub)string in its case attribute and contains the complete phrase in its #PCDATA (usually, spaces in the original text are replaced by the "equal sign" characters).
Original form of an automatically "normalized" number. Two phenomena can be normalized:
In other words, numbers are always in their mathematical notation at the <f num.gen> element (unless spelled out as numerals). As usual, the normalized number always immediately follows this <w num.orig> element.
Punctuation markup. Used throughout the corpus, including the Czech National Corpus. Counted as one token, but not as a word proper. For all uses and purposes, it behaves like a word (see the <f> element). I.e., it contains the same subelements as the <f> element does, including elements for lemmas, morphological and syntactic tags, etc., to make further processing simple. The only difference is in the set of possible type attribute values - the set for this punctuation markup is much smaller than that for the <f> element.
Type of the <d> element. Usually empty (corresponds to the std value). The value gen is used for added <d> elements not present in the original input data (e.g., for typos in the part of the corpus that has been manually corrected).Used to signal "no space present" between two tokens (usually between a <f> elements and a <d> element, or between two <d> elements. However, in the manually annotated data, it can also appear between two <f> elements. In any case, it signals that between the immediately preceding element and the immediately following element there was no space in the original source text as received from the text supplier. (This might be important, for example, for more reliable abbreviation or hyphenated compound identification.) Conversely, if two input tokens (i.e., two <d>/<f> elements) have been separated by one or more white-space character in he original textual material, there is no explicit markup in the data (since for most languages, this is the default).
For most applications, all the <D> tags can safely be ignored. (However, the standard morphological processing - such as in the PDT - does take this tag into account.)
Optional element for all-uppercase "speech-like transcription" of the original data tokens (i.e., of the <f> elements) into "pronounced" form. It is performed for the following token types:
For all other tokens, they are only transformed to all-uppercase and left as they were in the original data. It is never used for punctuation unless the punctuation is commonly pronounced in some way.
In all cases, regular pronunciation is not transcribed in this way, it is considered to be recorded in a pronunciation dictionary as usual. This tag is meant to unify the actual pronunciation if, e.g., the text is read aloud.
OBSOLETE. Formerly, list of all tags assigned by morphology, regardless of lemma, i.e. union of all sets of tags of all lemmas, separated by slashes (i.e. slash could not have been of a tag).
Lemma as defined by a morphological dictionary, manually disambiguated. It has the same format also in the <MMl> and <MDl> text fields.
In the Czech data provided in the Prague Dependency Treebank, and partially also in the Czech National Corpus (CNC), the formal structure of the lemma is described in the remaining part of this section.
The lemma string includes an optional sense ID (a decimal number separated from the lemma by a single dash symbol, such as -3), followed optionally by syntactic, semantic and style tags, derivational information and a comment (or any combination of those) marked in a non-SGML way: each tag is only one-letter long, it is attached to the lemma by an underscore and a single markup symbol:
Syntactic tags
Syntactic tags have been used formerly for alternate part of speech for some words, but are not used today except for verb aspect distinction for regular verbs (T, W symbols). Part of speech symbols can be always found in the associated morphological tag (<t>, <MMt>, <MDt>), and the abbreviation information from the tag (8 in its VAR (last) column) has precedence over the B designation here.N | Noun |
J | Adjective |
A | Adjective |
Z | Pronoun |
T | Imperfective verb |
W | Perfective verb |
V | Verb (aspect not specified) |
M | Numeral |
C | Conjunction |
D | Adverb |
P | Preposition |
F | Interjection |
I | Particle |
B | Abbreviation |
Q | Unused |
X | Unused |
Semantic tags
G | Geographical name |
Y | Person's first (given) name |
S | Person's family name |
E | Names of members of nations, cities, ethnic groups etc. |
R | Product name |
K | Organization name |
m | Other proper name |
H | Chemistry |
U | Medicine |
L | Natural Sciences |
j | Law, Legal |
g | General Technical term |
c | Electronics, Computers |
y | DIY, travel, free time |
b | Economy and Finances |
u | Culture, Education, Arts, other science |
w | Sports |
p | Politics, Government, Military |
z | Environment |
o | Colors |
Style tags
s | Bookish |
a | Archaic |
n | Dialect |
h | Colloquial (not tolerated in the standard) |
e | Expressive |
l | Slang |
v | Vulgar (extremely expressive) |
t | Foreign-language word |
x | Parallel spelling/form, do not use for morph. generation |
Derivation information, general comment
Derivation information and general comment (introduced by the caret symbol, ^) are furthermore always contained within a set of parentheses.
Within the parentheses, derivation information always starts with a star (*) as a distinguishing symbol (vs. a general comment), optionally preceded by a derivation type formed by the symbol ^ (caret) and a two-letter code. After the star symbol, a "rule" follows which describes how to get the (underlying) lemma which the current lemma has been derived from. The rule has two parts:
If a star (*) is used in the deletion part, the to-be-appended part which follows the contain the complete original lemma. Otherwise, the number of symbols (including the sense ID if any) designated by the deletion part should be stripped off before attaching the to-be-appended part to form the original lemma.
Examples:
Eventually, all proper derivations should have the ^XX derivation present; all remaining comments starting with a star (*) and containing the string transformation rule described above will be considered synonyms, not derivations.
Lemma as defined by a morphological dictionary, all possibilities (context insensitive). Its text (#PCDATA) part has the same format and contents as the <l> text field.
Optional source of dictionary information (dictionary identification). In the Prague Dependency Treebank, the dictionary information can have three sources:
Lemma as defined by a morphological dictionary, automatically disambiguated by a tagger. Its text (#PCDATA) part has the same format and contents as the <l> text field.
The weight of the selected (disambiguated) lemma as defined by the tagger used for disambiguation, as a number between 0 and 1. Currently unused.
Source of disambiguation information (tagger ID). In the Prague Dependency Treebank (PDT), there are two taggers:
Pattern (morphological paradigm) name, optionally output by the morphological analyzer. It contains two parts, separated by a colon:
For example, if the dictionary pattern name is s2, and the actual ending pattern is y, the markup is <pn>s2:y.
The source dictionary pattern name (s2 in the above example) is the one which is manually inserted to the source main dictionary when adding a new word (dictionary entry). However, this pattern in general does not correspond directly to the actual ending database pattern name, since it is first processed by the [regular] derivation module (which is part of the morphological processor), and replaced by another "virtual" entry (or even expanded to a set of virtual dictionary entries), with possibly different root and lemma, which then uses a matching pattern name from the set of patterns from the ending database. This whole process is done on-the-fly and does not entail any special preprocessing steps other than regular dictionary compilation for the morphological analyzer.
The pattern name(s) have no importance for further linguistic processing, but they might be interesting for other reasons. No set of pattern names is generally provided with the data and this description (as opposed to, e.g., the morphological tags), but it might be made available on request.
Complete example:
<f>koupenou<MMl src="ad">koupený_^(něco_sobě/někomu)_(*3it)<pn>s2:y<MMt src="ad">AAFS4----1A----<pn>s2:y<MMt src="ad">AAFS7----1A----
Word-form root as internally defined by the morphological analyzer. Since the Czech morphological processor we use does not contain a phonological element, no root changes are possible during processing, and therefore the "root" as found here is a rather technical root, defined as the part of the word string which does not change within a paradigms (as defined as the full or partial paradigm in the morphological dictionary).
Only one <R>/<E> pair is generated for every word form (<f>), i.e. for every set of <MMl>s. If there are multiple analyses of the input word form, resulting in different segmentation to root and ending (<E>), the shortest root wins (i.e., the longest ending wins).
It is unused in the PDT; it has some importance for checking purposes only and/or for experiments with language modeling in speech recognition for reducing the size of a dictionary.
Word-form ending as defined by the morphological analyzer. Together with the <R> (by concatenation) gives the original word form. See the description at the <R> SGML tag.
The morphological tag of the current token (which can be found in the text part of <f> or <d>), manually disambiguated. The tagset is defined by the morphological dictionary used for preprocessing the data.
In the Prague Dependency Treebank (PDT), the following tagset system is currently in use. For more information, please refer to the PDT documentation.
Each tag is a 15-tuple of symbols (mostly uppercase letters and digits, but many lowercase and special symbols are used as well). Each single-character position contains a value from one morphological category. 13 categories are in fact fully used:
Position | Category name | Description |
---|---|---|
1 | POS | Part of Speech |
2 | SUBPOS | Detailed Part of Speech |
3 | GENDER | Grammatical Gender (for agreement) |
4 | NUMBER | Grammatical Number (for agreement) |
5 | CASE | Morphological Case |
6 | POSSGENDER | Gender of Possessor |
7 | POSSNUMBER | Number of Possessor |
8 | PERSON | Person |
9 | TENSE | Tense |
10 | GRADE | Degree of Comparison |
11 | NEGATION | Negation |
12 | VOICE | Voice |
13 | RESERVE1 | Reserved |
14 | RESERVE2 | Reserved |
15 | VAR | Variant, Style, Register |
For more information on the individual categories, especially the sets of possible values, please see the full Tagset documentation (psfile, pdffile) or the quick tagset reference (htmlfile, pdffile).
Weight of manual disambiguation, as a number between 0 and 1. Meant for rare cases only, unused today.
A morphological tag as generated by the morphological analyzer. For a description of the tagset, see the <t> element.
Optional source of dictionary information (dictionary identification). In the Prague Dependency Treebank, the dictionary information can have three sources:
A morphological tag as disambiguated (i.e., automatically selected from the list of <MMt>s) by the tagger. For a description of the tagset, see the <t> element.
The weight of the selected (disambiguated) tag as defined by the tagger used for disambiguation, as a number between 0 and 1. Currently unused.
Source of disambiguation information (tagger ID). In the Prague Dependency Treebank (PDT), there are two taggers:
Analytical (surface-syntactic) function, i.e., the type of dependency relation to the governing node of the current node (i.e., the <f> or <d> element). The list of those functions follows:
Function | Short Description |
- - - | Not assigned |
Pred | Predicate |
Pnom | Nominal part of a predicate |
AuxV | Auxiliary Verb |
Sb | Subject |
Obj | Object |
Atr | Attribute |
Adv | Adverbial |
AtrAdv | Attribute and Adverbial (at Atr dep. position) |
AdvAtr | Attribute and Adverbial (at Atr dep. position) |
Coord | Coordination |
AtrObj | Attribute and Object (at Atr dep. position) |
ObjAtr | Attribute and Object (at Atr dep. position) |
AtrAtr | Attribute and Attribute (at deeper Atr dep. position) |
AuxT | Reflexive se as part of the lexeme |
AuxR | Reflexive se |
AuxP | Preposition |
Apos | Apposition |
ExD | Extra Dependency / Ellipsis above |
AuxC | Subordinate conjunction |
Atv | Attribute Verbal (Complement) |
AtvV | Attribute Verbal, missing nominal governor |
AuxO | Aux. referring pronoun |
AuxZ | Rhematizer |
AuxY | All the rest |
AuxG | Graphical symbol (punctuation except commas) |
AuxK | Final Punctuation of sentence |
AuxX | Comma |
AuxS | Sentence root (artificial extra node) |
Generated | Added dependency, technical value |
NA | Not Applicable |
??? | Unannotatable (extremely unclear case) |
In addition to the above list, most of the functions can have the form function_suffix, where the suffix is one of the following:
For example, Obj_Co is an analytical function for coordinated object.
For more information on the Prague Dependency Treebank, and its analytical level of annotation, see the Manual for Analytic Layer Tagging of the Prague Dependency Treebank (English translation) (or the Czech original).
Analytical (surface-syntactic) function, i.e., the type of dependency relation to the governing node of the current node (i.e., the <f> or <d> element), as assigned automatically by the machine. For a description of the functions used, see the <A> element.
Weight of the automatically assigned analytical function, as a number between 0 and 1.
Source of the automatically assigned analytical function (i.e., parser's or automatic function analyzer's ID code).
This is an "added" token, used for insertions of tokens well after tokenization is made, such as for tectogrammatical annotation "nodes" (e.g., restored ellipsis, fillers for valency slots, etc.). It is identical to the <f> element, except that many subelements do not make sense anymore and thus are not defined; of those that are still here (see below for a list), the semantics is identical to those at the <f> element.
Type of surface deletion (i.e., reason for adding this node to the tectogrammatical tree). See Manual for Tectogrammatic Tagging of the Prague Dependency Treebank for more information.
An enclosing element for markup of automatic tectogrammatical annotation. For n-best-output-producing parsers, it might be repeated as a whole. For the description of the trlemma itself, and for more information about the dependency linking, see the <TRl> element.
An enclosing element for markup of manual tectogrammatical annotation (for automatically assigned dependency structure and additional information, the <MTRl> element is used). The only markup which crosses the boundaries of this enclosure is the <TRg> element, which "points" to the <r> element, which is a common identification of the node ID for for the analytical and tectogrammatical level.
The #PCDATA of this element is the so-called trlemma (for more info, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank), or "tectogrammatical lemma". It might differ from the morphological lemma in several minor respects, such as added reflexive particle se for some verbs, more or less fixed phrases put together, etc.
Direct speech and/or quotation indication. Applies to the whole subtree.
A "pointer" to the governing node (head) for dependency relation at the tectogrammatical level (for more info on the tectogrammatical level, see Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.) It "points" to the governing node with the corresponding <r> element (textual parts - #PCDATA - of the markup must match).
Pointer to 0 (zero) is allowed, and in fact required somewhere in every sentence, even tough it is never present in any r element in the sentence; it denotes the "virtual" (or technical) extra root node of the whole sentence structure. I.e. the nodes pointing to 0 are the "linguistic" roots.
For a dependency link at the analytical level, see the <g> element.
Tectogrammatical functor, a markup for the type ("function") of the dependency relation at the tectogrammatical level. (For the so-called analytical function, or the type of dependency link at the analytical level, see the <A> element.)
A table with short descriptions of the functors follows; for a more detailed and more up-to-date account, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.
Functor | Short description |
---|---|
ACT | Actor/Bearer (deep subject, first argument) |
PAT | Patient (deep object, second argument) |
ADDR | Addressee (third argument) |
EFF | Effect (fourth argument) |
ORIG | Origin (fifth argument) |
ACMP | Accompaniment |
ADVS | Adversative (coordination) |
AIM | Aim |
APP | Appurtenance |
APPS | Apposition |
ATT | Attitude (of speaker/writer) |
BEN | Benefactive |
CAUS | Cause |
CNCS | Concession |
COMPL | Complement |
COND | Conditional |
CONFR | Confrontation (coordination) |
CONJ | Conjunction (coordination) |
CPR | Comparison |
CRIT | Criterion |
CSQ | Consequence (coordination) |
CTERF | Counterfactual (unreal condition) |
DENOM | Denomination (no predicate in clause) |
DES | Descriptive attribute |
DIFF | Difference |
DIR1 | Direction 1 - from where |
DIR2 | Direction 2 - through (which way) |
DIR3 | Direction 3 - to where |
DISJ | Disjunction (coordination) |
ETHD | Ethical "dative" |
EXT | Extent |
EV | Empty verb |
GRAD | Gradation (coordination) |
HER | Heritage |
INTF | Intensification |
INTT | Intent |
ID | Identity |
LOC | Location (where) |
MANN | Manner (general) |
MAT | Material attribute |
MEANS | Means |
MOD | Modality |
NORM | Norm |
PAR | Parenthesis (w/o clausal function) |
PREC | Reference to preceding text |
PRED | Predicate (null dependency) |
REAS | Reason (coordination) |
REG | Regard |
RESL | Result |
RESTR | Restriction |
RHEM | Rhematizer |
RSTR | Restrictive attribute |
SUBS | Substitution |
TFHL | Time - for how long |
TFRWH | Time - from when |
THL | Time - how long |
THO | Time - how often |
TOWH | Time - to when |
TPAR | Time - parallel events (contemporaneous) |
TSIN | Time - since when |
TTILL | Time - till when |
TWHEN | Time - when (general) |
VOC | Vocative sentence |
VOCAT | Vocative, in apposition within a clause |
NA | Not Applicable |
SENT | Sentence root (invisible in the SGML markup) |
As opposed to the analytical level functions (<A> element), the "member of (coordination,apposition,parenthesis)" suffix is not part of the functor, but it is recorded separately as the <Tmo> element (even though in the graphical tools provided with the Prague Dependency Treebank, the information may be presented for the user in a form similar to the analytical level).
Tectogrammatical grammateme(s). Not yet specified - reserved for future use on the tectogrammatical level. For up-to-date information, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.
Member of (coordination, apposition, parenthesis). The following values are used:
Value | Short description |
---|---|
CO | Member of coordination |
AP | Member of apposition |
PA | Member of parenthesis |
NIL | Not a member of anything of the above |
NA | Not Applicable |
Morphosyntactic tag as provided at the tectogrammatical level. In general, it differs from the morphological tag (<t>, <MMt> and <MDt>), since it does not contain any information which can be restored from other nodes and from other attributes using, for example, grammatical rules for grammatical agreement, or surface information from the valency dictionary (such as the preposition/case combination).
The tags use a positional tag system to pack together several single-category morphosyntactic tags. Each position is one-symbol long, therefore identifying the morphological categories categories positionally.
Position | Category | Values | Value abbreviations (if different and not from morphology) |
---|---|---|---|
1 | Gender | [-MIFNX] | (identical, from morph.) |
2 | Number | [-SPX] | (identical, from morph.) |
3 | Degree of comparison | [-123X] | (identical, from morph.) |
4 | Tense | [-SPAX] | SIM POST ANT |
5 | Aspect | [-PCRX] | PROC CPL CDN |
6 | Iterativeness | [-01X] | 0 - NO, 1 - Iterative |
7 | Manner | [-IMCX] | IND IMP CDN |
8 | Deontmod | [-DBHVSPFX] | DECL DEB HRT VOL POSS PERM FAC |
9 | Sentmod | [-.!DM?] | ENUNC EXCL DESID IMPER INTER |
In all positions, a dash ('-') means not applicable, and capital X means "not used" (usage differs depending on part of speech category, dependency type, etc.: for example, nouns do have information about gender, but verbs and adjectives do not; in fact, adjectives contain only degree of comparison information, since everything else can be generated automatically from the head).
For more and up-to-date information, see the Manual for Tectogrammatic Tagging of the Prague Dependency Treebank.
Topic-focus articulation. Possible contents:
Value | Short Description |
---|---|
t | Topic |
c | Contrastive topic |
f | Focus |
- | Not Applicable |
X | Not assigned (default) |
It should be noted that in addition to these values, the underlying word order based on so-called communicative dynamism, which is closely related to topic and focus, is represented by the "deep" node order attribute (see the <tfr> element).
Markup for deep (tectogrammatical) word order, which is based on communicative dynamism and related to topic and focus (see the <tfa> element). Every node (as represented by the <f>, <fadd> and (exceptionally) <d> elements) gets a numerical ID. Since it is numerical (even though not necessarily integer), an order is well defined; this order is then used for left-to-right ordering of nodes on every level in the tree. It is partial ordering in the sense that we do not require total ordering (but we can always extend it to be total, in many ways). In practice, the ordering is total.
"Functional word" markup. Usually prepositions are saved here, even though they are not really part of the tectogrammatical markup, at least theoretically, but they are provided for comparison purposes and for easier handling of certain aspect of machine learning.
Reserved for phraseme identification.
Lemma (<trlemma>) of the node that is in the coreference relation with the current node. Redundant.
Functor (<T>) of the node that is in the coreference relation with the current node. Redundant.
Node identifier (<r>) of the node that is in the coreference relation with the current node. Identifies the node uniquely and sufficiently within the sentence identified by the <cors> element.
cors should be empty or contain a sentence id; it is a coreference to a sentence id; if empty, current sentence is assumed. Attribute <rel> can be used to specify a relative `distance' of the coreferenced sentence from the sentence referenced by id.
Relative sentence distance (non-negative, possibly non-integer number) for coreference identification. Counts backwards from the current sentence. Default is 0 (current sentence).Unique numerical token ID within a sentence (<s>). Its numerical value order corresponds to the original surface word order in the sentence. It is also used as a destination of the "governing node pointer" (<g> and <TRg>, for analytical and tectogrammatical levels, respectively). Each token (<f>, <fadd> and <d>) has exactly one token ID (on analytical and tectogrammatical levels, but not necessarily on the lower levels).
A "pointer" to the governing node (head) for dependency relation at the analytical level (for more info on the analytical level, see Manual for Analytic Layer Tagging of the Prague Dependency Treebank (English translation) (or the Czech original)). It "points" to the governing node with the corresponding <r> element (textual parts - #PCDATA - of the markup must match).
Pointer to 0 (zero) is allowed, and in fact required somewhere in every sentence, even tough it is never present in any r element in the sentence; it denotes the "virtual" (or technical) extra root node of the whole sentence structure. I.e. the nodes pointing to 0 are the "linguistic" roots.
For a dependency link at the tectogrammatical level, see the <TRg> element.
Governing node pointer, machine generated, e.g., by an analytical-level parser or by some other automatic means. Used at the analytical level. For description of analytical dependency markup, see its manually assigned counterpart (the <g> element).
The weight of the selected governing node pointer as defined by the parser, as a number between 0 and 1. Currently unused.
Source of parser (parser ID). Currently unused.
List of idioms by reference. Unused so far.
An idiom, as part of the <idioms> element. Unused so far.
Reference to part of an <idiom>. Unused so far.