Physical structure of the PDT-VALLEX


PDT-VALLEX is a valency lexicon of Czech verbs, nouns, adjectives, and adverbs, which occur at least once in the t-layer data of PDT 2.0. The whole lexicon is stored in a single XML file (data/pdt-vallex/vallex.xml on the PDT 2.0 CD-ROM). The structure of the file is formally described using Document Type Definition (data/pdt-vallex/vallex.dtd) and, equivalently, using RELAX NG schema (data/pdt-vallex/vallex.rng). Linguistic interpretation of the lexicon is explained in the Manual for tectogrammatical annotation. The following paragraphs contain a simplified and informal introduction to the physical structure of the lexicon.

The top-level element of the lexicon is named <valency_lexicon> and consists of three parts: <head>, <body>, and <tail>. The first and the last served only for technical purposes during the annotation, e.g. for capturing the history of changes or for storing the list of annotators. The core of the lexicon is formed by the <body> element.

The <body> element contains a sequence of <word> elements, each of them corresponding to an individual word entry. Each word entry is associated with attributes lemma (corresponding to t-lemma in tectogrammatical trees; e.g. PDT-VALLEX lemma "bát se" corresponds to t-lemma "bát_se"), POS (semantic part of speech), and id (word entry identifier). Besides the attributes, each word entry contains a sequence of frame entries, represented by <frame> elements and embedded in the <valency_frames> element.

The <frame> element corresponds to one of the valency frames of the lexical unit in question (specified by the attribute lemma). Each valency frame has its identifier stored in the id attribute (this is the identifier which is referred to in the nodes on the t-layer), together with several technical attributes. Each valency frame must be equipped with an example sentence or sentence fragment (<example> element), illustrating the usage of the frame in Czech. The valency frame itself is formally represented as a sequence of valency slots (<element> elements) listed in <frame_elements>.

Each frame slot has its functor (attribute functor, specifying the deep syntactic relation of the slot with respect to its governing lexical unit, such as ACT, ADDR or LOC), type (attribute type, distinguishing between obligatory and non-obligatory frame slots), and one or more possible surface realizations represented by a sequence of <form> elements (and, in parallel, also in attribute form, where the so called compact notation is used, see the Manual).

There are two types of restrictions on the slot surface form captured in the form element: either the restrictions correspond to one of four special cases (typical, elided,recip, or state), or the surface form is expressed using a simplified analytical tree prototype. The tree consists of node element or elements (embedded in each other, thus reflecting the tree topology). The constraints on the nodes, such as pos, case, or lemma, are stored as attributes of the respective node elements. For instance, a specific prepositional group can be represented as a tree composed of two nodes: the constraint on the upper node is the lemma of the preposition, whereas the constraint on the child node is the case number.