2.1. Lemma structure

Lemma in PDT 1.0 has two parts. First part, the lemma proper, has to be a unique identifier of the lexical item. Usually it is the base form (e.g. infinitive for a verb) of the word, possibly followed by a number distinguishing different lemmas with the same base forms. Second part (optional) is not part of the identifier and contains additional information about the lemma, e.g. semantic or derivational information.

The formal description of the lemma structure follows. Spaces were inserted between nonterminals to improve readability. Note however that no lemma contains any spaces. Capitalized multi-character symbols are nonterminals. All other symbols are terminals.

Lemma       ::= LemmaProper | LemmaProper AddInfo
LemmaProper ::= Word | Word - Number | Number | SpecialChar
Word        ::= Letter | Letter Word
Letter      ::= A | a | Á | á | Ä | ä | ... | Z | z | Ž | ž | '
Number      ::= NonZero | NonZero Number0
Number0     ::= Digit | Digit Number0
NonZero     ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Digit       ::= 0 | NonZero
SpecialChar ::= ! | " | # | $ | % | & | ' | ( | ) | * | + | , |
                - | . | / | : | ; | < | = | > | ? | @ | [ | \ |
                ] | ^ | _ | ` | { | | | } | ~ | § | °
AddInfo     ::= Reference Category Term Style Comment
Reference   ::= <empty> | ` LemmaProper
Category    ::= <empty> | _: Category1 | _: Category1 Category
Term        ::= <empty> | _; Term1     | _; Term1 Term
Style       ::= <empty> | _, Style1    | _, Style1 Style
Comment     ::= <empty> | _^ Comment1
Category1   ::= N | J | A | Z | M | V | T | W | D | P | C | I | F | B | Q | X
Term1       ::= Y | S | E | G | K | R | m | 
                H | U | L | j | g | c | y | b | u | w | p | z | o
Style1      ::= t | n | a | s | h | e | l | v | x
Comment1    ::= ( Explanation ) | ( Derivation ) |
                ( Explanation )_( Derivation )
Explanation ::= CommentChar | CommentChar Explanation
Derivation  ::= * Number Word | * Word
CommentChar ::= Letter | Digit |
                ! | " | # | $ | % | & | ' | * | + | , | - | . |
                / | : | ; | < | = | > | ? | @ | [ | \ | ] | ^ |
                _ | ` | { | | | } | ~ | § | °
      

Notes on characters:

  1. Any character that is letter in the Unicode standard can appear in place of the Letter nonterminal. In the non-ASCII area this most frequently applies to the Czech accented characters: Á á Č č Ď ď É é Ě ě Í í Ň ň Ó ó Ř ř Š š Ť ť Ú ú Ů ů Ý ý Ž ž. However, other characters occur in names (e.g. German Ä ä Ö ö Ü ü, Serbo-Croatian Ć ć) and in foreign words (e.g. Slovak Ľ ľ Ĺ ĺ Ô ô Ŕ ŕ).

  2. Standard HTML entities (such as &amp; for & or &agrave; for ŕ) are also allowed. PDT 1.0 was encoded in the ISO Latin 2 codepage, so representing any West European characters required using entities. PDT 2.0 shall be encoded in UTF8, so few entities will be needed.

  3. The single quote (') is considered a Letter in some transcriptions of non-Latin alphabets (e.g. in Chinese Mao C'-tung, Hebrew Be'er Sheva'). If it marks deleted parts of words (e.g. English don't, French d'Artagnan), it is considered a SpecialChar and it splits the string into three tokens (d ' Artagnan). Even in these languages there are exceptions (e.g. the surname Preud'homme is one token).

Table 2.1. Lemma examples

Whole lemma LemmaProper AddInfo
Chemik chemik
maso_^(jídlo_apod.) maso _^(jídlo_apod.)
Bonn_;G Bonn _;G
vazba-1_^(obviněného) vazba-1 _^(obviněného)
vazba-2_^(spojení) vazba-2 _^(spojení)
Martinův-1_;Y_^(*4-1) Martinův-1 _;Y_^(*4-1)