VALLEX 1.0 - Logical Structure of the Data


Remark on terminology: The terms used here either belong to the broadly accepted linguistic terminology, or come from the Functional Generative Description (FGD), which we have used as the background theory, or are defined somewhere else in this text.

Warning: The primary goal of this text is to explicitly describe the content of VALLEX 1.0 data from the structural point of view. Linguistic issues requiring a long explanation or discussion are mostly left apart - full analysis of all issues related to valency goes far beyond the scope of this text. The reader may find many useful references in the papers attached to VALLEX 1.0 (esp. in the Technical Report). VALLEX 1.0 is closely related to the Prague Dependency Treebank (PDT) project, basic knowledge about the PDT is useful for full understanding of VALLEX 1.0.

Contents

   1. Word Entries
   2. Lemmas
   3. Lemma Variants
   4. Homonyms
   5. Frame Entries
   6. Valency Frames
   7. Functors
   8. Morphemic Forms
   9. Explicitly Declared Forms
   10. Implicitly Declared Forms
   11. Types of Complementations
   12. Slot Expansion
   13. Frame Attributes
   14. Control
   15. Class
   16. Aspect, Aspectual Counterparts
   17. Idiomatic frames


Word entry structure in the HTML format:


1. Word Entries

On the topmost level, VALLEX 1.0 is divided into word entries. Each word entry relates to one or more headword lemmas  (Sec. 2) . The word entry consists of a sequence of frame entries  (Sec. 5)  relevant for the lemma(s) in question (where each frame entry usually corresponds to one of the lemma's meanings). Information about the aspect  (Sec. 16)  of the lemma(s) is assigned to each word entry as a whole.

Most of the word entries correspond to lemmas in a simple one-to-one manner, but the following two non-trivial situations (and even combinations of them) appear as well in VALLEX 1.0:

The content of a word entry roughly corresponds to the traditional term of lexeme.

2. Lemmas

Under the term of lemma (of a verb) we understand the infinitive form of the respective verb, in case of homonym  (Sec. 4)  followed by a Roman number in superscript (which is to be considered as an inseparable part of the lemma in VALLEX 1.0!).

Reflexive particles se or si are parts of the infinitive only if the verb is reflexive tantum, primary (e.g. bát se) as well as derived (e.g. zabít se, šířit se, vrátit se).

3. Lemma Variants

Lemma variants are groups of two (or more) lemmas that are interchangable in any context without any change of the meaning (e.g. dovědět se/dozvědět se). The only difference usually is just a small alternation in the morphological stem, which might be accompanied by a subtle stylistic shift (e.g. myslet/myslit, the latter one being bookish). Moreover, although the infinitive forms of the variants differ in spelling, some of their conjugated forms are often identical (mysli (imper.sg.) for both myslet and myslit).

The term 'lemma variants' should not be confused with the term 'synonymy'.

4. Homonyms

There are pairs of word entries in VALLEX 1.0, the lemmas of which have the same spelling, but considerably differ in their meanings (there is no obvious semantic relation between them). They also might differ as to their etymology (e.g. nakupovatI - to buy vs. nakupovatII - to heap), aspect  (Sec. 16)  (e.g. stačitI pf. - to be enough vs. stačitII impf. - to catch up with), or conjugated forms (žilo (past.sg.fem) for žítI - to live vs. žalo(past.sg.fem) žítII - to mow). Such lemmas (homonyms) are distinguished by Roman numbering in superscript. These numbers should be understood as an inseparable part of lemma  (Sec. 2)  in VALLEX 1.0.

Note on terminology: we have adopted the term 'homonyms' from Czech linguistic literature, where it traditionally stands for what was stated above (words identical in the spelling but considerably different in the meaning); in English literature the term 'homographs' is sometimes used to express the same notion.

5. Frame Entries

Each word entry  (Sec. 1)  consists of a non-empty sequence of frame entries, typically corresponding to the individual meanings (senses) of the headword lemma(s) (from this point of view, VALLEX 1.0 can be classified as a Sense Enumerated Lexicon).

The frame entries are numbered within each word entry; in the VALLEX 1.0 notation, the frame numbers are attached to the lemmas as subscripts.

The ordering of frames is not completely random, but it is not perfectly systematic either. So far it is based only on the following weak intuition: primary and/or the most frequent meanings should go first, whereas rare and/or idiomatic meanings should go last. (We do not guarantee that the ordering of meanings in this version of VALLEX 1.0 exactly matches their frequency of the occurrences in contemporary language.)

Each frame entry contains a description of the valency frame itself  (Sec. 6)  and of the frame attributes  (Sec. 13) .

Note on terminology: The content of 'frame entry' roughly corresponds to the term of lexical unit ('lexie' in Czech terminology).

6. Valency Frames

In VALLEX 1.0, a valency frame is modeled as a sequence of frame slots. Each frame slot corresponds to one (either required or specifically permitted) complementation of the given verb.

Note on terminology: in this text, the term 'complementation' (dependent item) is used in its broad sense, not related to the traditional argument/adjunct (complement/modifier) dichotomy (or, if you want, covering both ends of the dichotomy).

The following attributes are assigned to each slot:

Some slots tend to systematically occur together. In order to capture this type of regularity, we introduced the mechanism of slot expansion  (Sec. 12)  (full valency frame will be obtained after performing these expansions).

7. Functors

In VALLEX 1.0, functors (labels of 'deep roles'; similar to theta-roles) are used for expressing types of relations between verbs and their complementations. According to FGD, functors are divided into inner participants (actants) and free modifications (this division roughly corresponds to the argument/adjunct dichotomy). In VALLEX 1.0, we also distinguish an additional group of quasi-valency complementations.

Functors which occur in VALLEX 1.0 are listed in the following tables (for Czech sample sentences see Technical Report, page 43):

Inner participants:
Functor Example sentence
ACT (actor) Peter read a letter.
ADDR (addressee) Peter gave Mary a book.
PAT (patient) I saw him.
EFF (effect) We made her the secretary.
ORIG (origin) She made a cake from apples.

Quasi-valency complementations:
Functor Example sentence
DIFF (difference) The number has swollen by 200.
OBST(obstacle) The boy stumbled over a stumb.
INTT (intent) He came there to look for Jane.

Free modifications:
Functor Example sentence
ACMP (accompaniement) Mother came with her children.
AIM (aim) John came to a bakery for a piece of bread.
BEN (benefactive) She made this for her children.
CAUS (cause) She did so since they wanted it.
COMPL (complement) They painted the wall blue.
DIR1 (direction-from) He went from the forest to the village.
DIR2 (direction-through) He went through the forest to the village.
DIR3 (direction-to) He went from the forest to the village.
DPHR (dependent part of a phraseme)Peter talked horse again.
EXT (extent) The temperatures reached an all time high.
HER (heritage) He named the new villa after his wife.
LOC (locative) He was born in Italy.
MANN (manner) They did it quickly.
MEANS (means) He wrote it by hand.
NORM (norm) Peter has to do it exactly according to directions.
RCMP (recompense) She bought a new shirt for 25 $.
REG (regard) With regard to George she asked his teacher for advice.
RESL (result) Mother protects her children from any danger.
SUBS (substitution) He went to the theatre instead of his ill sister.
TFHL (temporal-for-how-long) They interrupted their studies for a year.
TFRWH (temporal-from-when) His bad reminiscences came from this period.
THL (temporal-how-long ) We were there for three weeks.
TOWH (temporal-to when) He put it over to next Tuesday.
TSIN (temporal-since-when) I have not heard about him since that time.
TWHEN (temporal-when) His son was born last year.

Note 1: Besides the functors listed in the tables above, also value DIR occurs in the VALLEX 1.0 data. It is used only as a special symbol for slot expansion  (Sec. 12) .

Note 2: The set of functors as introduced in FGD is richer than that shown above, moreover, it is still being elaborated within the Prague Dependency Treebank. We do not use its full (current) set in VALLEX 1.0 due to several reasons. Some functors do not occur with a verb at all (e.g. APP - appuertenace, "my.APP dog"), some other functors can occur there, but represent other than dependency relation (e.g. coordination, "Jim or.CONJ Jack"). And still others can occur with verbs as well, but their behaviour is absolutely independent of the head verb, thus they have nothing to do with valency frames (e.g. ATT - attitude, "He did it willingly.ATT").

8. Morphemic Forms

In a sentence, each frame slot can be expressed by a limited set of morphemic means, which we call forms. In VALLEX 1.0, the set of possible forms is defined either explicitly  (Sec. 9) , or implicitly  (Sec. 10) . In the former case, the forms are enumerated in a list attached to the given slot. In the latter case, no such list is specified, because the set of possible forms is implied by the functor of the respective slot (in other words, all forms possibly expressing the given functor may appear).

9. Explicitly Declared Forms

The list of forms attached to a frame slot may contain values of the following types:

10. Implicitly Declared Forms

If no forms are listed explicitly for a frame slot, then the list of possible forms implicitly results from the functor of the slot according to the following (yet incomplete) table:
LOC adverb, na+6, v+6, u+2, před+7, za+7, nad+7, pod+7, okolo+2, kolem+2, při+6, vedle+2, mezi+7, mimo+4, naproti+3, podél+2 ...
MANNadverb, 7, na+4, ...
DIR3adverb, na+4, v+4, do+2, před+4, za+4, nad+4, pod+4, vedle+2, mezi+4, po+4, okolo+2, kolem+2, k+3, mimo+4, naproti+3 ...
DIR1adverb, z+2, od+2, zpod+2, zpoza+2, zpřed+2 ...
DIR2adverb, 7, přes+4, podél+2, mezi+7, ...
TWHENadverb, 2, 4, 7, před+7, za+4, po+6, při+6, za+2, o+6, k+3, mezi+7, v+4, na+4, na+6, kolem+2, okolo+2, ...
THLadverb, 4, 7, po+4, za+4, ...
EXTadverb, 4, na+4, kolem+2, okolo+2, ...
REGadverb, 7, na+6, v+6, k+3, při+6, ohledně+2, nad+7, na+4, s+7, u+2, ...
TFRWHz+2, od+2, ...
AIMk+3, na+4, do+2, pro+4, proti+3, aby, ať, že, ...
TOWHna+4 ...
TSINod+2 ...
TFHLna+4, pro+4, ...
NORMpodle+2, v duchu+2, po+6, ...
MEANS7, v+6,na+6,po+6, z+2, že, s+7, na+4, za+4, pod+7, do+2, ...
CAUS7, za+4, z+2, kvůli+2, pro+4, k+3, na+4, že, ...

11. Types of Complementations

Within the FGD framework, valency frames (in a narrow sense) consist only of inner participants (both obligatory and optional, 'obl' and 'opt' for short) and obligatory free modifications; the dialogue test was introduced by Panevová as a criterium for obligatoriness. In VALLEX 1.0, valency frames are enriched with quasi-valency complementations. Moreover, a few non-obligatory free modifications occur in valency frames too, since they are typically ('typ') related to some verbs (or even to whole classes of them) and not to others. (The other free modifications can occur with the given verb too, but are not contained in the valency frame, as it was mentioned above  (Sec. 7) )

The attribute 'type' is attached to each frame slot and can have one of the following values: 'obl' or 'opt' for inner participants and quasi-valency complementations, and 'obl' or 'typ' for free modifications.

Note: It should be emphasized that in this context the term obligatoriness is related to the presence of the given complementation in the deep (tectogrammatical) structure, and not to its (surface) deletability in a sentence (moreover, the relation between deep obligatoriness and surface deletability is not at all straightforward in Czech).

12. Slot Expansion

Some slots tend systematically to occur together. For instance, verbs of motion can be often modified with direction-to and/or direction-through and/or direction-from modifier. We decided to capture this type of regularity by introducing the abbreviation flag for a slot. If this flag is set (in the VALLEX 1.0 notation it is marked with an upward arrow), the full valency frame will be obtained after slot expansion.

If one of the frame slots is marked with the upward arrow (in the XML data, attribute 'abbrev' is set to 1), then the fuller valency frame will be obtained after substituting this slot with a sequence of slots as follows:
↑DIRtyp→ DIR1typ DIR2typ DIR3typ
↑DIR1obl→ DIR1obl DIR2typ DIR3typ
↑DIR2obl→ DIR1typ DIR2obl DIR3typ
↑DIR3obl→ DIR1typ DIR2typ DIR3obl
↑TSINobl→ TSINobl THLtyp TTILtyp
↑THLtyp→ TSINtyp THLtyp TTILtyp

13. Frame Attributes

In VALLEX 1.0, frame attributes (more exactly, attribute-value pairs) are either obligatory or optional. The former ones have to be filled in every frame. The latter ones might be empty, either because they are not applicable (e.g. some verbs have no aspectual counterparts), or because the annotation was not finished (e.g. attribute class  (Sec. 15)  is filled only in roughly one third of frames).

Obligatory frame attributes:

Optional frame attributes:

14. Control

The term 'control' relates in this context to a certain type of predicates (verbs of control) and two correferential expressions, a 'controller' and a 'controllee'. In VALLEX 1.0, control is captured in the data only in the situation where a verb has an infinitive modifier (regardless of its functor). Then the controllee is an element that would be a 'subject' of the infinitive (which is structurally excluded on the surface), and controller is the co-indexed expression. In VALLEX 1.0, the type of control is stored in the frame attribute 'control' as follows:

Examples: Note on terminology: in English literature the terms 'equi verbs' and 'raising verbs' are used in a similar context.

15. Class

Some frames are assigned semantic classes like 'motion', 'exchange', 'communication', 'perception', etc. However, we admit that this classification is tentative and should be understood merely as an intuitive grouping of frames, rather than a properly defined ontology.

The motivation for introducing such semantic classification in VALLEX 1.0 was the fact that it simplifies systematic checking of consistency and allows for making more general observations about the data.

16. Aspect, Aspectual Counterparts

Perfective verbs (in VALLEX 1.0 marked as 'pf.' for short) and imperfective verbs (marked as 'impf.') are distinguished between in Czech; this characteristic is called aspect. In VALLEX 1.0, the value of aspect is attached to each word entry as a whole (i.e., it is the same for all its frames and it is shared by the lemma variants, if any).

Some verbs (i.e. informovat, charakterizovat) can be used in different contexts either as perfective or as imperfective (obouvidová slovesa, 'biasp.' for short).

Within imperfective verbs, there is a subclass of of iterative verbs (iter.). Czech iterative verbs are derived more or less in a regular way by affixes such as -va- or -iva-, and express extended and repetitive actions (e.g. čítávat, chodívat). In VALLEX 1.0, iterative verbs containing double affix -va- (e.g. chodívávat) are completely disregarded, whereas the remaining iterative verbs occur as aspectual counterparts in frame entries of the corresponding non-iterative verbs (but have no own word entries, still).

A verb in its particular meaning can have aspectual counterpart(s) - a verb the meaning of which is almost the same except for the difference in aspect (that is why the counterparts constitute a single lexical unit on the tectogrammatical level of FGD; however, each of them has its own word entry in VALLEX 1.0, because they have different morphemic forms). The aspectual counterpart(s) need not be the same for all the meanings of the given verb, e.g., odpovědět is a counterpart of odpovídat - to answer, but not of odpovídat - to correspond. Therefore the aspectual counterparts (if any) are listed in frame attribute 'asp. counterparts' in VALLEX 1.0. Moreover, for perfective or imperfective counterparts, not only the lemmas are specified within the list, but (more specifically) also the frame numbers of the counterpart frames (which is of course not the case for the iterative counterparts, for they have no word entries of their own as stated above).

One frame might have more than one counterpart because of two reasons. Either there are two counterparts with the same aspect (impf. působit and impf. způsobovat for pf. způsobit), or there are two counterparts with different aspects (impf. scházet, pf. sejít, iter. scházívat).

17. Idiomatic frames

When building VALLEX 1.0, we focused mainly on primary or usual meanings of verbs. We also noted many frames corresponding to peripheral usages of verbs, however their coverage in VALLEX is not exhaustive. We call such frames idiomatic and mark them with label 'idiom'. An idiomatic frame is tentatively characterized either by a substantial shift in meaning (with respect to the primary sense), or by a small and strictly limited set of possible lexical values in one of its complementations, or by occurence of another types of irregularity or anomaly.