2.1.1. Base form and number

2.1.1. Base form and number
Prev	2.1. Lemma structure	Next

The Word in LemmaProper is the base form of the respective paradigm. This means nominative singular for nouns, the same plus masculine positive for adjectives, similarly for pronouns and numerals. Verbs are represented by their infinitive forms.

The Number in LemmaProper helps to distinguish several senses of a homonymous base form. It should neither be zero nor start with zero. The used numbers need not form a continuous sequence. Sometimes a particular number is repeatedly used for a special kind of word (e.g. the lemmas numbered "-99" are almost invariantly authors' signatures and their Category/Style part is "_:B_;S"). Conventions of this kind exist solely for the convenience of a human reader but they are not meant to signal anything to a processing program. No conclusions should be ever drawn from the value of the lemma number! There is no warranty that an observed number "semantics" holds anywhere else. Other sources of information, such as the AddInfo text, should be used instead.

The following rules shall hold for each group of lemmas sharing the same base form.

Rule 1: If lemmas use numbers to distinguish lexical items with the same base form, they all have to use them - i.e. if there is the lemma X-2, the unnumbered lemma X should not exist. If more than one lemma share a base form, all of them must be numbered.
Rule 2: If a lemma is numbered, its AddInfo should not be empty. The AddInfo must help to distinguish the lemma from other lemmas with the same base form but different numbers. Exception: if all but one lemmas with the same base form are foreign words, the domestic one need not have a non-empty AddInfo. All the foreign counterparts must have it, though.
Rule 3: Two lemmas with different AddInfo must differ in numbers as well. Exceptions (see below): abbreviations (two lemmas differ in the presence of _:B but not in their numbers).
Rule 4: Two lemmas with different number must differ in AddInfo as well.

Unfortunately many lemmas are not covered by our automatic morphological analyzer. Such lemmas were created by the annotators, and the administrator of the lexicon should later make their numbers and/or suffixes consistent and conformant to the above rules. In many cases it was not manageable to complete this task for PDT 2.0.

Base form in lemma is case-sensitive. Of course, words that have to be always capitalized in writing, have their lemma capitalized as well. As a consequence, špaček (starling) and Špaček_;S need not be distinguished by numbers (or they can both use the same number). However, although not required, the unique numbering of such cases is recommended.

Sometimes the numbering of lemmas reflect that their base form is homonymous with another word, although the other meaning is not base form. For instance, žena is a noun (meaning woman) but it can also be transgressive form of the verb hnát. The morphological analyzer may assign different numbers to both meanings of žena, although the latter is not a base form. As a consequence, there may be lemma žena-2 even if there is no other lemma with the same base form. Such behavior is allowed but not required.