Lemma in PDT 1.0 has two parts. First part, the lemma proper, has to be a unique identifier of the lexical item. Usually it is the base form (e.g. infinitive for a verb) of the word, possibly followed by a number distinguishing different lemmas with the same base forms. Second part (optional) is not part of the identifier and contains additional information about the lemma, e.g. semantic or derivational information.
The formal description of the lemma structure follows. Spaces were inserted between nonterminals to improve readability. Note however that no lemma contains any spaces. Capitalized multi-character symbols are nonterminals. All other symbols are terminals.
Lemma ::= LemmaProper | LemmaProper AddInfo LemmaProper ::= Word | Word - Number | Number | SpecialChar Word ::= Letter | Letter Word Letter ::= A | a | Á | á | Ä | ä | ... | Z | z | Ž | ž | ' Number ::= NonZero | NonZero Number0 Number0 ::= Digit | Digit Number0 NonZero ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Digit ::= 0 | NonZero SpecialChar ::= ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / | : | ; | < | = | > | ? | @ | [ | \ | ] | ^ | _ | ` | { | | | } | ~ | § | ° AddInfo ::= Reference Category Term Style Comment Reference ::= <empty> | ` LemmaProper Category ::= <empty> | _: Category1 | _: Category1 Category Term ::= <empty> | _; Term1 | _; Term1 Term Style ::= <empty> | _, Style1 | _, Style1 Style Comment ::= <empty> | _^ Comment1 Category1 ::= N | J | A | Z | M | V | T | W | D | P | C | I | F | B | Q | X Term1 ::= Y | S | E | G | K | R | m | H | U | L | j | g | c | y | b | u | w | p | z | o Style1 ::= t | n | a | s | h | e | l | v | x Comment1 ::= ( Explanation ) | ( Derivation ) | ( Explanation )_( Derivation ) Explanation ::= CommentChar | CommentChar Explanation Derivation ::= * Number Word | * Word CommentChar ::= Letter | Digit | ! | " | # | $ | % | & | ' | * | + | , | - | . | / | : | ; | < | = | > | ? | @ | [ | \ | ] | ^ | _ | ` | { | | | } | ~ | § | °
Notes on characters:
Any character that is letter in the Unicode standard can appear in place of the Letter nonterminal. In the non-ASCII area this most frequently applies to the Czech accented characters: Á á Č č Ď ď É é Ě ě Í í Ň ň Ó ó Ř ř Š š Ť ť Ú ú Ů ů Ý ý Ž ž. However, other characters occur in names (e.g. German Ä ä Ö ö Ü ü, Serbo-Croatian Ć ć) and in foreign words (e.g. Slovak Ľ ľ Ĺ ĺ Ô ô Ŕ ŕ).
Standard HTML entities (such as &
for & or à
for ŕ) are also allowed. PDT 1.0 was encoded in the ISO Latin 2 codepage, so representing any West European characters required using entities. PDT 2.0 shall be encoded in UTF8, so few entities will be needed.
The single quote (') is considered a Letter in some transcriptions of non-Latin alphabets (e.g. in Chinese Mao C'-tung, Hebrew Be'er Sheva'). If it marks deleted parts of words (e.g. English don't, French d'Artagnan), it is considered a SpecialChar and it splits the string into three tokens (d
'
Artagnan
). Even in these languages there are exceptions (e.g. the surname Preud'homme is one token).