VALLEX 2.5 – Logical Structure of the Lexicon

The primary goal of the following text is to briefly describe the content of VALLEX 2.5 data from a structural point of view. Linguistic issues requiring an extensive explanation or discussion are mostly left apart. However, more detailed description (and also additional relevant references) can be found in Žabokrtský, 2005. Some theoretical issues concerning valency are summarized in Lopatková, 2003.

As for terminology, the terms used here either belong to the broadly accepted linguistic terminology, or come from FGD (which we have used as the background theory), or are defined somewhere else in this text.

Contents  cesky

ukazka slovnikoveho hesla s popisky

1 Lexemes  cesky

On the highest level, VALLEX 2.5 is composed of lexemes. Lexeme is understood as a two-fold abstract entity, see Cruse, 1986: it associates a set of possible lexical forms (by which the presence of the lexeme is manifested in an utterance, Section 2) with a set of lexical units (complexes of syntactic and semantic features, LUs for short, Section 3). In simpler words, lexical forms can be viewed as the conjugated forms of a given verbal lexeme, whereas each LU corresponds roughly to the lexeme used in a specific sense and with specific syntactic combinatorial potential.

2 Lexical Forms and Lemmas  cesky

It is usual in dictionaries that the set of all possible lexical forms of a given lexeme is represented only by the infinitive form called lemma.

Lemma in VALLEX 2.5 should be considered as a complex structure:

In VALLEX 2.5, there are typically two or more lemmas listed at the beginning of the lexeme entry. It follows the FGD principle of treating aspectual counterparts (perfective and imperfective verbs expressing the same lexical meaning, Section 2.2) as manifestations of the same lexeme. Another reason for more lemmas being present in the same lexeme might be the existence of orthographic variants (Section 2.3).

2.1 Reflexive Lemmas  cesky

In VALLEX 2.5, two types of reflexive constructions are distinguished:

2.2 Aspectual Counterparts  cesky

Imperfective and perfective verb forms are distinguished in Czech (as well as a specific subclasses of iterative verbs and so called biaspectual verbs); this characteristic is called aspect.

In VALLEX 2.5, the value of aspect is attached to each lemma as a superscript label:

There are three ways how aspectual counterparts (verbs with the same or very similar lexical meaning differing in aspect) are formed in Czech (sorted according to productivity):

Aspectual counterparts of the first and third type constitute a single lexeme in VALLEX 2.5, as e.g. in the case of nasedat impf, nasednout pf, nasedávat iter – to get on.

As already mentioned, a LU typically shares all its lemmas with the other LUs in the lexeme in which it is embedded. However, there are exceptions: the aspectual counterpart(s) need not be the same for all LUs of the particular lexeme. For example, odpovědět pf is a counterpart of odpovídat impf in the sense ‘to answer’, but not in the sense ‘to correspond’. In such cases, the set of applicable lemmas is specified directly for the LU introduced by the abbreviation jen (and overrides the set of lemmas specified for the whole lexeme).

There might be more than one lemma with the same aspect in a lexeme without being lemma variants. Then the aspect flags are distinguished by Arabic numbers, as e.g. in the lexeme osušovat impf1, osoušet impf2, osušit pf – to dry up, to wipe, or odřezávat impf, odříznout pf1, odřezat pf2 – to cut off (unique aspect flags are necessary because they serve also for co-indexing the lemmas with example sentences illustrating the usage of the lexeme).

Some verbs (e.g. informovat  – to inform, charakterizovat  – to characterize) can be used in different contexts either as imperfective or as perfective. They are called biaspectual verbs.

Within imperfective verbs, there is a subclass of iterative verbs (iter.). Czech iterative verbs are derived more or less in a regular way by affixes such as -va-  or -íva-, and express extended and repetitive actions (e.g. číst  – to read → čítávat, chodit  – to walk → chodívat ). In VALLEX 2.5, iterative verbs containing double affix -va-  (e.g. chodívávat ) are completely disregarded, whereas the remaining iterative verbs occur as headword lemmas of the relevant lexeme.

2.3 Lemma Variants  cesky

Lemma variants (many of which are just spelling variants, i.e. orthographic variants) are groups of two or more lemmas that are interchangeable in any context without any change of the meaning (e.g. dovědět se/dozvědět se  – to learn). Usually, the only difference is just a small alternation in the morphological stem, which might be accompanied by a subtle stylistic shift (e.g. myslet/myslit  – to think, the latter one being bookish). Moreover, although the infinitive forms of the variants differ in spelling, some of their conjugated forms might be identical (mysli  (imper.sg.) both for myslet  and myslit ).

There are rare exceptions when only one of the variants can be used, e.g. plavat  and plovat  – to swim, are usually considered to be variants, see, e.g. SSJČ, 1964, although, in some contexts, only plavat, in the sense ‘to flounder’, can be used (plavat při zkoušce, *plovat při zkoušce ). The applicable lemmas must be then listed for the specific LU as in any other cases when a LU imposes a further limitation on the set of lexical forms.

2.4 Homographs  cesky

Homographs are lemmas ‘accidentally’ identical in the spelling but considerably different in their meaning (there is no obvious semantic relation between them). They also might differ as to their etymology (e.g. nakupovatI  – to buy vs. nakupovatII  – to heap), aspect (Section 2.2) (e.g. stačitIpf – to be enough vs. stačitIIimpf – to catch up with), or conjugated forms (žilo  (past.sg.fem) for žítI  – to live vs. žalo  (past.sg.fem) žítII  – to mow. In VALLEX 2.5, such lemmas are distinguished by Roman numbering in the subscript. These numbers should be understood as inseparable parts of VALLEX 2.5 lemmas.

3 Lexical Units  cesky

Each lexeme is formed by a set of lexical units that are assigned to respective lexical forms (represented by their lemmas). Following Cruse, 1986, we understand lexical units (LUs) as “form-meaning complexes with (relatively) stable and discrete semantic properties”. Roughly speaking, LU can be understood as ‘a given word in the given sense’. In the Czech tradition, this concept of LU corresponds to Filipec’s ‘monosemic lexeme’, see Filipec and Čermák, 1985.

Within each lexeme in VALLEX 2.5, LUs are numbered by Arabic numbers. In the printed and html versions of the lexicon, the LU entry starts with its number.

The ordering of lexical units is not completely random, but it is not perfectly systematic either. So far, it is based only on the following weak intuition: the primary and/or the most frequent meanings should go first, whereas rare and/or idiomatic meanings should go last. (We do not guarantee that the ordering of LUs in VALLEX 2.5 exactly matches their frequency in the contemporary language.)

By default, a LU ‘inherits’ all lemmas specified for the given lexeme in which it is embedded. However, it might happen that for a given LU not all the forms specified for the whole lexeme are applicable. In such cases, the list of applicable lemmas is specified for the given LU separately.

Available information about each LU entry in VALLEX 2.5 is captured by obligatory and optional attributes. The former ones have to be filled with every LU. The latter ones might be empty, either because they are not applicable (e.g. no control can be applicable for verbs without infinitive complementations), or because the annotation was not finished yet (e.g. attribute class; Section 5.4).

Obligatory LU attributes:

Optional LU attributes:

4 Valency Frames  cesky

The core valency information is encoded in the valency frame. Within the FGD framework, valency frames (in a narrow sense) consist only of inner participants (both obligatory and optional) and obligatory free modifications, Panevová, 1974; Panevová, 1994. In VALLEX 2.5, valency frames are enriched with quasi-valency complementations. Moreover, a few non-obligatory free modifications occur in valency frames too, since they are typically related to some verbs (or even to whole classes of them) and not to others. (The other free modifications can occur with the given verb too, but they are not contained in the valency frame as their presence in a sentence is not understood as syntactically conditioned in FGD.)

In VALLEX 2.5, a valency frame is modeled as a sequence of frame slots. Each frame slot corresponds to one (either required or specifically permitted) complementation of the given verb.

Note on terminology: in this text, the term ‘complementation’ (dependent item) is used in its broad sense, not related to the traditional argument/adjunct (complement/modifier) dichotomy.

The following attributes are assigned to each slot:

Some slots tend to occur systematically together. In order to capture this type of regularity, we have introduced the mechanism of slot expansion, Section 4.4 (full valency frame is obtained after performing these expansions).

4.1 Functors  cesky

In VALLEX 2.5, functors (labels for ‘deep roles’; similar to theta-roles) are used for expressing types of relations between verbs and their complementations. According to FGD, functors are divided into inner participants (actants) and free modifications (this division roughly corresponds to the argument/adjunct dichotomy), see Panevová, 1974; Panevová, 1994. In VALLEX 2.5, we also distinguish an additional group of quasi-valency complementations, see esp. Lopatková and Panevová, 2005.

Functors that occur in VALLEX 2.5 are listed in the following tables

Inner participants:

Quasi-valency complementations:

Free modifications:

Note 1: Besides the functors listed in the tables above, also value DIR occurs in the VALLEX 2.5 data. It is used only as a special symbol for the slot expansion (Section 4.4).

Note 2: The set of functors as introduced in FGD and used in the Prague Dependency Treebank is richer than that shown above, see Mikulová et al. , 2006. We do not use its full (current) set in VALLEX 2.5 due to several reasons. Some functors do not occur with verbs at all (e.g. MAT – material, partitive, as sklenice piva.MAT  – glass of beer), some other functors can occur there but represent other than dependency relations (e.g. coordination, Jim nebo.CONJ Jack  – Jim or Jack). And still others can occur with verbs as well but their behavior is absolutely independent of the head verb; thus they have nothing to do with valency frames (e.g. ATT – attitude, udělal to dobrovolně.ATT  – he did it willingly).

4.2 Morphemic Forms  cesky

In a sentence, each frame slot can be expressed by a limited set of morphemic means which we call forms. In VALLEX 2.5, the set of possible forms (supposing active verb form) is defined either explicitly, or implicitly.

In the first case (explicitly declared forms), the forms are enumerated in a list attached as a subscript to the given slot (in the case of arguments and quasi-valency complementations, no other forms can be used; in the case of free modifiers, the possible forms are not necessarily limited to those given in the list).

In the second case (implicitly declared forms), no such list is specified because the set of possible forms is implied by the functor of the respective slot (in other words, all forms possibly expressing the given functor may appear).

4.2.1 Explicitly Declared Forms

The list of forms attached to a frame slot may contain values of the following types:

4.2.2 Implicitly Declared Forms

If no forms are listed explicitly for a frame slot, then the list of possible forms implicitly results from the functor of the slot according to the following (yet incomplete) lists:

4.3 Types of Complementations  cesky

Within the FGD framework, valency frames (in a narrow sense) consist only of inner participants (both obligatory and optional) and obligatory free modifications.

As a criterion for obligatoriness, the dialogue test was introduced by Panevová in Panevová, 1974, see also Sgall, Hajičová, and Panevová, 1986. It should be emphasized that in this context the term obligatoriness is related to the presence of the given complementation in the deep (tectogrammatical) structure, and not to its (surface) deletability in a sentence (moreover, the relation between deep obligatoriness and surface deletability is not at all straightforward in Czech).

In VALLEX 2.5, valency frames are enriched with quasi-valency complementations. Moreover, a few non-obligatory free modifications occur in valency frames too, since they are typically related to some verbs (or even to whole classes of them) and not to others.

The attribute type is attached to each frame slot and can have one of the following values: obl or opt for inner participants and quasi-valency complementations, and obl or typ for free modifications.

4.4 Slot Expansion  cesky

Some slots tend to occur systematically together. For instance, verbs of motion can be often modified with direction-to and/or direction-through and/or direction-from modifier. We decided to capture this type of regularity by introducing the abbreviation flag for a slot. If this flag is set (in the VALLEX 2.5 notation it is marked with an upward arrow ), the full valency frame is obtained after slot expansion.

If one of the frame slots is marked with the upward arrow, then the full valency frame will be obtained after substituting this slot with a sequence of slots as follows:

5 Optional LU Attributes  cesky

5.1 Control  cesky

The term control (abbr. control) relates in this context to a certain type of predicates (verbs of control) and two coreferential expressions, a ‘controller’ and a ‘controllee’, see also Panevová, 1996. In VALLEX 2.5, control is captured in the data only in the situation in which a verb has an infinitive modifier (regardless of its functor). Then the controllee is an element that would be a ‘subject’ of the infinitive (which is structurally excluded on the surface), and controller is the co-indexed expression. In VALLEX 2.5, the type of control is stored in the frame attribute ‘control’ as follows:

Examples:

5.2 Reflexivity  cesky

The optional attribute reflexivity (abbr. rfl) indicates possible syntactic functions of the reflexive particles/pronouns se  or si.

The reflexive particles/pronouns se  or si  are used in Czech as formal means expressing the following syntactic constructions:

Note that the attribute reflexivity does not cover reflexive verb forms where reflexive particles se  or si  are parts of the infinitive forms, i.e. true reflexive (e.g. bát se  – to fear, smát se  – to laugh) as well as derived reflexive (e.g. odpovídat se  – to account, šířit se  – to spread, vrátit se  – to return) (as already discussed in Section 2.1), nor the reciprocal function of se  or si  pronouns (see Section 5.3).

5.3 Reciprocity  cesky

Reciprocity is understood as a possibility of (two or more) valency complementations to be in relations with each other that may be viewed symmetrically (and their roles are interchangeable).

In Czech, if Actor and some other complementation are reciprocal, then the reflexive verb form is used and these two complementations are expressed either as a coordinated nominal group (as in Petr a Marie se hádali  – Peter and Mary argued (with one another)), or as a plural noun (přátelé se navštěvují  – friends visit each other), possibly with additional adverbs spolu, navzájem, … .

If Actor is not affected, the reciprocity may follow from the plural form or coordination (with no other formal sign), as in seznámil je  – he introduced them (to each other).

The possibility of reciprocal usage is indicated in the attribute reciprocity (rcp for short), the value of which is a pair (or triple) of functors involved, e.g. ACT-ADDR for hádat se  – to argue, neustále se spolu hádali  – they argued with each other all the time; or ACT-ADDR-PAT for mluvit  – to talk, mluví spolu o sobě  – they talked with each other about themselves.

In the case of derived reflexive lexemes of inherently reciprocal verbs (with the obligatory complementation with the form s+7), both LUs for irreflexive and reflexive lexemes have assigned attribute rcp.

5.4 Semantic Class  cesky

Semantic classes are assigned to a significant part of lexical units (2,903 LUs out of 6,460, i.e. 45% of all LUs). These classes were built strictly in a ‘bottom-up’ way, by grouping LUs with similar syntactic property and with respect to their semantics. The following 22 semantic classes were established:

We admit that this classification is tentative and should be understood merely as an intuitive gathering of frames, rather than a properly defined ontology. The motivation for introducing such semantic classification in VALLEX 2.5 was the fact that it simplifies systematic checking of consistency and allows for making more general observations about the data.

5.5 Idioms  cesky

When building VALLEX, we have focused mainly on primary or usual meanings of verbs. We also noted many LUs corresponding to peripheral usages of verbs. However, their coverage in VALLEX might not be complete. We call such LUs idiomatic and mark them with the label ‘idiom’. An idiomatic frame is tentatively characterized either by a substantial shift in meaning (with respect to the primary sense), or by a small and strictly limited set of possible lexical values in one of its complementations, or by occurrence of another type of irregularity or anomaly.


References  cesky


Valid XHTML 1.0!