Structures in the framework of the project

The project of annotation of Czech texts in the PDT covers three levels: morphological, analytical and tectogrammatical.

Annotation at all of the levels is based on Czech texts having a SGML format (CSTS DTD), which is the basic format of the Czech National Corpus (ČNK). Most of the texts have been taken directly from ČNK. The Czech texts are already divided into separate words (word-forms), sentences and paragraphs in this format. Punctuation is explicitly marked and graphic information from the original text has been preserved wherever possible. Numbers in numerical form are also marked and decimals are normalized.

Texts chosen at random (in continuous samples) from the texts of ČNK are used for annotations at all levels.

Morphological level

The annotation (tagging) at the morphological level is linear. To each original word-form (name of the attribute: origf, SGML, attribute <w>) in the text, three attributes are assigned; namely word-form, lemma and tag. Tagging is manual with the aid of a full-screen program sgd working in the environment of Linux (which, however, can be carried on through the mediation of some remote means, e.g. from DOS). We are using also a MS Windows program called DA, which is compatible with sgd at the data level and which is very close also on the GUI level. Both programs require a preliminary morphological treatment of the original text, i.e., each word-form from it is supposed to be accompanied by a list of all possible lemmas and of their (possible) morphological categories. This assignment is done automatically on the basis of an electronic dictionary (at present the vocabulary covers some 98-99% of current newspaper or magazine texts, including names). The remaining word-forms are supplied by manual tagging. Typing errors are kept in the attribute origf; however, they are corrected (manually) and treated in the attribute form.

Morphological tagging with the aid of the program sgd (or DA) can be performed prior to annotation at the analytical level, but also after this has been accomplished, or in parallel. Both input and output data for the morphological annotation programs are in the SGML format according to the CSTS DTD. The volume of texts tagged at the morphological level is about 1.8 million tokens.

Word-form (attribute form, SGML attribute <f>)

In most of the cases the word-form is identical with the original word-form as it has been found in the original text including the use of lower and/or upper case letters. Exceptions occur only in case the original word-form represents

  • a number containing a decimal point

  • the words aby, kdyby (to, in order to, so that, if) (denoting purpose, condition)

  • a compact (contracted) form of a preposition linked with a pronoun (e.g., naň, proň, zaň, , zač)

  • a word with an -s added to indicate 2nd person singular of the verb být (to be) (e.g., tys, ses, udělals)

  • a typing error

In these cases the form (form) is derived from the original word-form (origf) in the following way:

 

origf # of form attr. 1st or the only form 2nd form
number with a decimal point 1 number with a decimal point  
form of the word aby/kdyby 2 aby/kdyby conditional by in corresponding form (e.g., bychom)
preposition with a pronoun 2 preposition pronoun in the corresp. (long) form (e.g. naň -> na + něj)
word with an -s 2 word without -s jsi
typing error 1 corrected form  
typing error with contracted forms 2 see line 2 - 4 corrected see line 2 - 4 corrected form

Lemma (lemma, SGML attribute <l>)

The lemma unequivocally identifies a word as a lexical unit. It is represented by a string of letters and signs which in most of the cases corresponds to the so-called dictionary form of the word, or, to put it differently, to the word-form under which the word usually figures in dictionaries.

 

Part of speech

Morphological categories of the word-form in the attribute lemma

Noun Nominative, singular, no negation (unless there is a positive form and negation does not change the lexical meaning; pluralia tantum: the same, but in plural
Adjective Masculinum animate, nominative, singular, no negation, 1st degree of comparison (positive)
Pronoun If there are such categories: Nominative, singular, masculine animate, no negation; (particularly: personal pronouns only , ty, on (I, you, he))
Numeral If there are such categories: Nominative, singular, masculine animate, no negation
Verb Infinitive
Adverb 1st degree, no negation
Preposition no vocalisation
The rest the original form

Orthographic variants are to be unified (if, of course, they represent just genuine orthographic variants and not, e.g., a shift in meaning; this concerns the category "rest" as well).

The identification string obtained in this way can be completed by additional distinguishing identification(s) which consist of a hyphen and one or more decimal numbers (e.g., -2). Isolated zero is not used. This identification serves for distinguishing grammatical forms belonging to different lexical units (e.g., the noun hnát-2 versus the verb hnát-1, -2 standing for shank, -1 for verb drive, pursue; cf. English bear N and bear V). In exceptional cases such means can be used for distinguishing the meanings of a full homonym: e.g., strana-4 (=page) vs. strana-2 (=polit. party).

Upper- and lowercase letters play their part in distinguishing lexical units; they are used to distinguish common names from proper names otherwise identical (e.g., trnka vs. Trnka - black-thorn and Trnka, professor). The original "size" of the letters as they have been found in the text (attributes form or origf) is disregarded, i.e., if a (common) word was originally written with an uppercase letter in initial position (titles, beginning of a sentence), it is contained in the attribute lemma in lowercase letters only.

Morphological marker (attribute tag, SGML attribute <t>)

The morphological marker consists in a sequence of uppercase and lowercase letters of the English alphabet (and some other allowed symbols) and of digits. There are 15 positions in the tag (13 of them actually used) for the morphological category values.

 

Pos. Category Description Czech Term
1 POS Part of Speech Slovní druh
2 SUBPOS Detailed Part of Speech Slovní poddruh
3 GENDER Agreement Gender Rod
4 NUMBER Agreement Number Číslo
5 CASE Case Pád
6 POSSGENDER Possessor's Gender Rod vlastníka
7 POSSNUMBER Possessor's Number Číslo vlastníka
8 PERSON Person Osoba
9 TENSE Tense Čas
10 GRADE Degree of Comparison Stupeň
11 NEGATION Negation (by prefix) Negace
12 VOICE Voice Slovesný rod
13 RESERVE1 Reserved for future use Rezerva
14 RESERVE2 Reserved for future use Rezerva
15 VAR Variant, Style, Register Varianta, styl

Distinguishing part-of-speech category according to the first letter of the tag:

 

1st letter of thetag part-of-speech
N noun
A adjective
P pronoun
C numeral
V verb
D adverb
R preposition
J conjunction
I interjection
T particle
Z punctuation, numeral figures, root of the tree
X (unknown, unidentified)

For sentence boundaries the tag Z#------------- is assigned, while for punctuation it is Z:-------------; however, the sentence boundary is not used explicitly at the morphological level.