Structures in the framework of the project

Structures in the framework of the project
Prev	Chapter 1. Environments	Next

The project of annotation of Czech texts in the PDT covers three levels: morphological, analytical and tectogrammatical.

Annotation at all of the levels is based on Czech texts having a SGML format (CSTS DTD), which is the basic format of the Czech National Corpus (ČNK). Most of the texts have been taken directly from ČNK. The Czech texts are already divided into separate words (word-forms), sentences and paragraphs in this format. Punctuation is explicitly marked and graphic information from the original text has been preserved wherever possible. Numbers in numerical form are also marked and decimals are normalized.

Texts chosen at random (in continuous samples) from the texts of ČNK are used for annotations at all levels.

Morphological level

The annotation (tagging) at the morphological level is linear. To each original word-form (name of the attribute: origf, SGML, attribute <w>) in the text, three attributes are assigned; namely word-form, lemma and tag. Tagging is manual with the aid of a full-screen program sgd working in the environment of Linux (which, however, can be carried on through the mediation of some remote means, e.g. from DOS). We are using also a MS Windows program called DA, which is compatible with sgd at the data level and which is very close also on the GUI level. Both programs require a preliminary morphological treatment of the original text, i.e., each word-form from it is supposed to be accompanied by a list of all possible lemmas and of their (possible) morphological categories. This assignment is done automatically on the basis of an electronic dictionary (at present the vocabulary covers some 98-99% of current newspaper or magazine texts, including names). The remaining word-forms are supplied by manual tagging. Typing errors are kept in the attribute origf; however, they are corrected (manually) and treated in the attribute form.

Morphological tagging with the aid of the program sgd (or DA) can be performed prior to annotation at the analytical level, but also after this has been accomplished, or in parallel. Both input and output data for the morphological annotation programs are in the SGML format according to the CSTS DTD. The volume of texts tagged at the morphological level is about 1.8 million tokens.

Word-form (attribute `form`, SGML attribute <f>)

In most of the cases the word-form is identical with the original word-form as it has been found in the original text including the use of lower and/or upper case letters. Exceptions occur only in case the original word-form represents

a number containing a decimal point
the words aby, kdyby (to, in order to, so that, if) (denoting purpose, condition)
a compact (contracted) form of a preposition linked with a pronoun (e.g., naň, proň, zaň, oč, zač)
a word with an -s added to indicate 2^nd person singular of the verb být (to be) (e.g., tys, ses, udělals)
a typing error

In these cases the form (form) is derived from the original word-form (origf) in the following way:

origf	# of `form` attr.	1^st or the only form	2^nd form
number with a decimal point	1	number with a decimal point
form of the word aby/kdyby	2	aby/kdyby	conditional by in corresponding form (e.g., bychom)
preposition with a pronoun	2	preposition	pronoun in the corresp. (long) form (e.g. naň -> na + něj)
word with an -s	2	word without -s	jsi
typing error	1	corrected form
typing error with contracted forms	2	see line 2 - 4 corrected	see line 2 - 4 corrected form

Lemma (`lemma`, SGML attribute <l>)

The lemma unequivocally identifies a word as a lexical unit. It is represented by a string of letters and signs which in most of the cases corresponds to the so-called dictionary form of the word, or, to put it differently, to the word-form under which the word usually figures in dictionaries.

Part of speech	Morphological categories of the word-form in the attribute `lemma`
Noun	Nominative, singular, no negation (unless there is a positive form and negation does not change the lexical meaning; pluralia tantum: the same, but in plural
Adjective	Masculinum animate, nominative, singular, no negation, 1^st degree of comparison (positive)
Pronoun	If there are such categories: Nominative, singular, masculine animate, no negation; (particularly: personal pronouns only já, ty, on (I, you, he))
Numeral	If there are such categories: Nominative, singular, masculine animate, no negation
Verb	Infinitive
Adverb	1^st degree, no negation
Preposition	no vocalisation
The rest	the original form

Orthographic variants are to be unified (if, of course, they represent just genuine orthographic variants and not, e.g., a shift in meaning; this concerns the category “rest” as well).

The identification string obtained in this way can be completed by additional distinguishing identification(s) which consist of a hyphen and one or more decimal numbers (e.g., -2). Isolated zero is not used. This identification serves for distinguishing grammatical forms belonging to different lexical units (e.g., the noun hnát-2 versus the verb hnát-1, -2 standing for shank, -1 for verb drive, pursue; cf. English bear N and bear V). In exceptional cases such means can be used for distinguishing the meanings of a full homonym: e.g., strana-4 (=page) vs. strana-2 (=polit. party).

Upper- and lowercase letters play their part in distinguishing lexical units; they are used to distinguish common names from proper names otherwise identical (e.g., trnka vs. Trnka - black-thorn and Trnka, professor). The original “size” of the letters as they have been found in the text (attributes form or origf) is disregarded, i.e., if a (common) word was originally written with an uppercase letter in initial position (titles, beginning of a sentence), it is contained in the attribute lemma in lowercase letters only.

Morphological marker (attribute `tag`, SGML attribute <t>)

The morphological marker consists in a sequence of uppercase and lowercase letters of the English alphabet (and some other allowed symbols) and of digits. There are 15 positions in the tag (13 of them actually used) for the morphological category values.

Pos.	Category	Description	Czech Term
1	`POS`	Part of Speech	Slovní druh
2	`SUBPOS`	Detailed Part of Speech	Slovní poddruh
3	`GENDER`	Agreement Gender	Rod
4	`NUMBER`	Agreement Number	Číslo
5	`CASE`	Case	Pád
6	`POSSGENDER`	Possessor's Gender	Rod vlastníka
7	`POSSNUMBER`	Possessor's Number	Číslo vlastníka
8	`PERSON`	Person	Osoba
9	`TENSE`	Tense	Čas
10	`GRADE`	Degree of Comparison	Stupeň
11	`NEGATION`	Negation (by prefix)	Negace
12	`VOICE`	Voice	Slovesný rod
13	`RESERVE1`	Reserved for future use	Rezerva
14	`RESERVE2`	Reserved for future use	Rezerva
15	`VAR`	Variant, Style, Register	Varianta, styl

Distinguishing part-of-speech category according to the first letter of the tag:

1^st letter of the`tag`	part-of-speech
N	noun
A	adjective
P	pronoun
C	numeral
V	verb
D	adverb
R	preposition
J	conjunction
I	interjection
T	particle
Z	punctuation, numeral figures, root of the tree
X	(unknown, unidentified)

For sentence boundaries the tag Z#------------- is assigned, while for punctuation it is Z:-------------; however, the sentence boundary is not used explicitly at the morphological level.