Documentation to CzeDLex 1.0

Introduction

CzeDLex is a Lexicon of Czech Discourse Connectives, originally (2015 – 2017) developed within the COST-cz project TextLink-cz, and later (2019 – 2021) within the project Shallow discourse parsing in Czech (GAČR GA19-03490S).

CzeDLex 0.5 (the pilot version) was published in December 24, 2017 in the Lindat/Clarin repository (also available on-line).
CzeDLex 0.6 (the first update) was published in December 19, 2019 in the Lindat/Clarin repository (also available on-line).
CzeDLex 0.7 (the second update) was published in December 24, 2020 in the Lindat/Clarin repository (also available on-line).
CzeDLex 1.0 (the third update) to be published in December, 2021 in the Lindat/Clarin repository (also available on-line).

For further updates, see the web pages of the current development version of CzeDLex.

The lexicon contains connectives partially automatically extracted from two large corpora annotated manually with discourse relations, and a smaller additional material also annotated manually with discourse relations:

  • the primary resource: the Prague Discourse Treebank 2.0 (PDiT),
  • (since version 0.7) a supplementary resource: the Czech part of the Prague Czech–English Dependency Treebank (PCEDT-cz) with discourse annotation projected from the Penn Discourse Treebank 3.0,
  • and (since version 1.0) a supplementary resource: a thousand sentences selected from various fiction novels (Ludvík Souček: Cesta slepých ptáků; Martin Reiner: Lucka, Maceška a já; Pavel Kohout: Tango mortale) and transcriptions of public speeches (TED Talks 1927, 1971 and 1978; Václav Havel: a speech at the opening ceremony of the conference Forum 2000, 10.10.2010).

The lexicon entries have been manually checked and supplemented with additional information and English translations.

License and Availability

CzeDLex is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

CzeDLex is available in two formats, PML and HTML:

PML

The Prague Markup Language (PML) is the primary XML format of the lexicon. The lexicon is dowloadable from the Lindat/Clarin repository (available versions: 0.5, 0.6, 0.7, 1.0) and can be opened (browsed and edited) in the tree editor TrEd. Installation instructions are a part of the Lindat/Clarin package.

HTML

The on-line version of CzeDLex 1.0 in the form of HTML web pages presents the most important properties of the lexicon entries in a graphical, user-friendly way, without a need to install any tools (older on-line versions: 0.5, 0.6, 0.7), with the following filtering, sorting and presentation possibilities:

  • Lists of lexicon entries can be filtered by three criteria (which cannot be combined): the basic filter distinguishes the primary and secondary connectives, the second filter distinguishes the connectives according to discourse types they are able to express, and the last filter distinguishes the connectives according to their part of speech.
  • Each selection of connectives can be sorted alphabetically or by counts of connective occurrences in the source corpora.
  • English translations can be switched on and off.

How to cite

If you use the data of the lexicon or wish to refer to the published version of the data (version 1.0), please cite the publication of the data:

Jiří Mírovský, Pavlína Synková, Lucie Poláková, Věra Kloudová, Magdaléna Rysová: CzeDLex 1.0. Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-4595, Dec 2021

You can also cite the following journal articles describing (i) the design of the lexicon and the original extraction of the lexicon from the source corpus, and (ii) the subsequent extension of the lexicon from the second source corpus:

Jiří Mírovský, Pavlína Synková, Magdaléna Rysová, Lucie Poláková: CzeDLex – A Lexicon of Czech Discourse Connectives. In: The Prague Bulletin of Mathematical Linguistics, No. 109, Charles University, Prague, Czech Republic, ISSN 0032-6585, pp. 61-91, Oct 2017

Jiří Mírovský, Pavlína Synková, Lucie Poláková: Extending Coverage of a Lexicon of Discourse Connectives Using Annotation Projection. In: The Prague Bulletin of Mathematical Linguistics, No. 117, Charles University, Prague, Czech Republic, ISSN 0032-6585, pp. 5-26, Oct 2021

Manual Checks and Additions

Manual checks included checking of auto-filled values, assessment of suspicious usages of the connectives (in terms of complex forms – see below in the description of complex forms at level-two entries), addition of attributes/elementes not filled in automatically, translation of level-one entries to English, addition of glosses to individual usages (and their translations), translation of complex forms and modifications, selection of their types, selection of the most appropriate examples and their translation to English. Substantial information that could not be added within structural attributes/elements was provided as a free text in the element note.

Refering to the last published version of CzeDLex (1.0), there are 200 level-one entries in the lexicon, all fully manually checked and supplemented with additional information (in brackets: numbers of connective usages in the PDiT 2.0 (or another of the sources), incl. variants, modifications and complex forms): a [and] (6612), a skutečně [indeed] (8), aby [(in order) to] (437), [although] (115), ale [but] (1745), alespoň [at least] (5), aneb [in other words] (3), anebo [or] (31), ani [nor, not (even)] (64), aniž [without (doing sth)] (53), argumentovat [to argue] (10), [no matter] (37), avšak [however] (70), [when, until] (53), ba [even] (5), během [during] (5), buď ~ nebo [either ~ or] (28), byť [albeit] (25), což [which] (217), či [or] (86), čili [that is, i.e.] (9), dále [further, also] (126), díky [thanks to] (15), do třetice [in the third place] (5), doba [time] (19), dokonce [even] (111), dokud [until, while] (41), dovršení [completion] (1), dříve [sooner] (29), důsledek [consequence] (3), důvod [reason] (70), hlavně [primarily] (1), i [also] (187), i když [even if] (178), i tak [even so] (7), jak [as, when] (9), jak ~ tak [both ~ and] (4), jakkoli [however] (7), jakmile [as soon as] (33), jednak ~ jednak [for one thing ~ for another] (14), jelikož [because, since] (20), jen [only, just] (83), jenže [but] (84), jestli [if] (17), jestliže [if, in case] (96), ještě [still, even] (21), ježto [as] (1), jinak [otherwise] (23), jinými slovy [in other words] (4), jmenovitě [namely] (1), k [to] (8), kdežto [whereas] (10), kdy [when] (29), kdyby [if] (175), kdykoli [whenever] (8), když [when] (768), koneckonců [after all] (10), konkrétně [specifically] (1), kontrast [contrast] (2), kontrastovat [to contrast] (1), kromě [besides] (44), kupříkladu [for example] (2), kvůli [because of] (7), leč [but] (4), leda [unless, only] (1), li [if] (296), mezitím [in the meantime] (12), mimo jiné [besides other things] (20), mimoto [apart from that] (5), místo [instead of] (21), na rozdíl od [in contrast with] (2), na základě [on the grounds of] (2), na závěr [in the end] (4), načež [after which] (1), nadto [moreover] (3), nakonec [eventually] (29), naopak [on the contrary] (190), naproti [opposite] (24), například [for example] (104), následek [consequence] (1), následně [subsequently] (4), nato [then, afterwards] (6), natož [let alone] (4), navíc [moreover] (203), navzdory [despite] (6), ne [not] (49), #neg [{negation}] (255), nebo [or] (230), neboli [in other words] (1), neboť [as, because] (222), nedosti na tom [that is not enough] (4), nehledě na [regardless of] (4), nejen [not only] (67), nejenže [not only that] (14), nejprve [(at) first] (15), nemluvě o [not to mention] (3), než [until] (54), nicméně [nevertheless] (77), nikoli [not] (24), nýbrž [but] (44), obdobně [similarly] (3), odůvodnění [justification] (3), okamžik [moment] (11), oproti [contrary (to)] (4), ostatně [after all] (3), ovšem [but, of course] (310), pak [then] (430), pakliže [if] (1), podmínka [condition] (20), podobně [similarly] (54), pokud [if] (473), poněvadž [since, as] (6), popřípadě [alternatively] (6), posléze [afterwards, finally, then] (17), poté [afterwards] (141), potom [then] (86), pouze [only, just] (40), později [later] (116), prostě [simply, just] (9), proto [therefore] (481), protože [because] (635), přece [after all] (31), přece jen [after all] (23), především [above all] (6), předpoklad [assumption] (9), předtím [before (that)] (32), přeloženo [translated] (1), přes [despite] (4), přesněji [more precisely] (4), přesto [despite of that] (141), přestože [although] (124), přičemž [while] (91), příčina [cause] (4), příklad [example] (11), případ [case] (80), případně [alternatively] (13), přitom [at the same time] (220), původně [originally] (3), respektive [or (more precisely)] (3), rovněž [also] (116), rozdíl [difference] (5), řečeno [speaking] (14), s tím, že [with the fact that] (52), sice [otherwise, granted] (2), sotva [the moment, hardly] (6), souběžně [concurrently] (3), současně [at the same time] (41), souvislost [connection, context] (18), spíše [rather] (40), srovnání [comparison] (10), stejně [equally, still] (36), strana [side] (60), tak [so] (334), také [also] (305), taktéž [also] (7), takže [so] (153), tedy [so] (337), též [also] (10), tím [thus] (28), tím pádem [thus] (8), tím spíše [all the more] (8), tj. [i.e., that is] (7), to [{N/A}] (14), totiž [you see, actually] (485), třeba [for example] (13), třebaže [although] (12), tudíž [consequently] (32), účel [purpose] (7), upřesnit [to specify] (14), v neposlední řadě [last but not least] (4), ve skutečnosti [in fact] (13), vedle [apart from] (4), více [more] (9), vinou [due to] (1), vlastně [actually] (7), však [however] (1686), výjimka [exception] (3), vyjma [excluding] (1), výsledek [result] (5), vzápětí [in no time] (19), vzhledem k [with respect to] (42), vždyť [after all] (46), záhy [soon] (2), zároveň [at the same time] (137), zase [again, in turn] (55), zásluhou [thanks to] (2), zatím [meantime] (11), zatímco [while] (207), zato [but (still)] (37), zčásti ~ zčásti [partly ~ partly] (2), zejména [particularly] (8), zkrátka [in short] (3), znamenat [to mean] (69), známka [indication] (1), způsobit [to cause] (1), zvlášť [especially] (2), že [that] (3), - [{dash}] (246), : [{colon}] (416), ; [{semicolon}] (3).

The lexicon covers all primary connectives used in the source PDiT annotated data and most of the secondary connectives from the PDiT annotation (some verbal second connectives have been excluded from CzeDLex 1.0).

Lexicon Structure

Level-one entry

The level-one entry in the lexicon structure is represented by the lemma of the connective. It is encoded in the element lemma and contains the following information:

  • element text: the lemma of the connective; for primary connectives, it is the connective itself (both single word primary connectives (proto [therefore]) and complex primary connectives (i když [even if])), for secondary connectives with a fixed form it is the phrase (ve skutečnosti [in fact]), for secondary connectives with a variable form it is the core word carrying the meaning of the relation (případ [case]).
  • element english: an approximate English translation for a basic orientation; more precise translations are given in connection with semantic discourse types at level-two entries
  • element type: the type of the connective: primary vs. secondary
  • element struct: the structure of the connective: it signals whether the connective is single such as proto [therefore] or complex such as jednak jednak [on the one hand on the other hand]. The complex connectives are further differentiated in the attribute type according to their placement in the argument(s): complex connectives with parts occurring in both arguments (e.g. jednak jednak [on the one hand on the other hand] or buď nebo [either or]) are labeled correlative, while complex connectives with all parts occurring in a single argument are labeled continuous if no word can be inserted between the parts of the connective (e.g. the connective i když [even if, although]), or discontinuous if other words can occur between the connective parts (e.g. a potom [and then]). Multiplied connectives in coordinations (e.g. protože ... and protože [because ... and because]) are labeled as multiple.
  • element variants: a list of variants of the connective: they are further specified in the attribute type as stylistic (cf. neutral tedy [so.neutral] vs. informal teda [so.informal]) or orthographic (e.g. mimoto vs. mimo to [both meaning: besides]), or inflection (e.g. the form čímž [by which] is the instrumental form of the connective with the nominative form což [which]). If the lemma and the variant differ in integration value (see below), this value is given for variant and aplies to all uses.
  • element conn-usages: a list of connective usages – level-two entries
  • element non-conn-usages: a list of non-connective usages – level-two entries
  • element note: important information not encoded in other attributes
  • attribute id: a lexicon-wide unique identifier of this level-one lexicon entry
  • element src: an identifier of an annotator editing this lexicon entry
  • element is_checked: is set to 1 for entries considered to be fully checked and annotated

Level-two entry

For each level-one entry in the lexicon structure, its connective and non-connective usages are represented as level-two entries. In connective-usages, the discourse type (see Table 1) is used as the base for nesting, while in non-connective-usages, the part-of-speech appurtenance of the expressions is used. The second level entry of the lexicon is encoded in the element usage and contains the following information:

  • element sense: the discourse type (see Table 1)
  • element scheme: the dependency scheme (used for all secondary connectives and for primary connectives with original structure preposition + pronoun "to" (it))
  • element gloss: a Czech expression disambiguating the meaning of the connective (a synonym or an explanatory phrase)
  • element english: an English translation (the gloss in English)
  • element pos: the part-of-speech appurtenance of the connective (the lemma) in the given usage. Conjunctions are further distinguished in the attribute subpos as coordinating or subordinating.
  • element syntax: for secondary connectives, the part-of-speech characteristics of the core word is accompanied by a syntactic characteristics for the whole secondary connective represented by this usage (nominal phrase, adjectival phrase, pronominal phrase, clause, adverbial phrase, or prepositional phrase).
  • element arg_semantics: this characteristics specifies the semantics of the argument the connective occurs in (see Table 2). From the semantic perspective, there is a basic difference between symmetric and asymmetric discourse relations. While both arguments of a symmetric relation (i.e. conjunction or synchrony) share the same general semantic characteristics, asymmetric discourse relations (e.g. reason–result or gradation) hold between arguments that have different semantic nature (e.g. one argument expresses the reason, the other the result). A connective of an asymmetric relation is characterized by its placement in one specific part of the relation it signals. For example, the coordinating conjunction tedy [thus] signals the result, while totiž [because] signals the reason. Similarly, the subordinating conjunctions než [until] and když [when] can be used for signalling precedence–succession – the former occurs in the argument expressing the event happening later, while the latter occurs in the argument expressing the earlier event. For symmetric relations, the element arg_semantics has the value symmetric. For complex correlative connectives forming level-one entries, the value is given for the second part of the connective. For connectives with border integration and main clause integration (see below), the value is given for the dependent clause. For not-integrated (see below) connectives, the value is given for the right (latter) argument.
  • element ordering: signals the linear order of the argument the connective occurs in (relatively to the other – external – argument). In the majority of cases, ordering is connected with the part-of-speech characteristics – coordinating conjunctions, adverbs and particles are placed in the second argument in the linear order, while subordinating conjunctions can be placed in either of the arguments. There are, however, exceptions – e.g. the particle nejenže [not only that] which occurs always in the first argument – that justify incorporation of this characteristics as a separate element into the lexicon. The element ordering has one of these four values: 1 for connectives occurring only in the first argument, 2 for connectives in the second argument, any for connectives occuring either in the first or in the second argument and between for secondary connectives forming a separate syntactic unit (e.g. Důvod je jednoduchý. [The reason is simple.]) and therefore occurring entirely between the arguments. For complex correlative connectives forming level-one entries, the value is given for the second part of the connective.
  • element integration: captures the position of the connective within the argument. According to their origin and other possible functions in text, Czech connectives have different positions in the argument. Only subordinating conjunctions and prototypical coordinating conjunctions occupy the very beginning of the clause or sentence; the position of other connectives varies. Some of them are placed typically at the clitic, i.e. the second position (e.g. však [however]), some of them are typically either on the first or on the second position (e.g. potom [then] or proto [therefore]), one type of connectives occurs at the border of the main and the dependent clause (e.g. díky tomu, že [because (lit. thanks to that, that)] or potom, co [after (lit. after that, what)]). Another type forms a main clause introducing a dependent clause (e.g. příčinou je to, že [the reason is that] or jako příklad uvádí to, že [as an example, he says that]) and for the class of focusing particles (i.e. expressions such as také [also] or jenom [only]), the position is given by the information structure. For secondary connectives represented by a whole clause, the integration value is not-integrated. For secondary connectives represented by a nominal phrase attached loosely to the host sentence (e.g. podmínka: [condition:] or výsledek: [result:]), the integration value is appos. Other values of this element, as follows from examples just mentioned, are first, second, first or second, border, main clause and any. For complex correlative connectives forming level-one entries, the value is given for the second part of the connective only.
  • element realizations: a list of non-modified and non-complex secondary connectives from PDiT 2.0 represented by the given dependency scheme (applies to secondary connectives and for primary connectives with original structure preposition + pronoun "to" (it))
  • element modifications: a list of the connective modifications: e.g. for the lemma potom [then] expressing precedence–succession, there is a modification teprve potom [only then]. Secondary connectives can be modified as well – cf. hlavní důvod proč [the main reason why]. Modifications are further distinguished in the attribute type as eval (evaluative), modal, and intense (intensifying).
  • element complex_forms: a list of complex connectives: e.g. for the lemma potom [then] expressing precedence–succession, there are for example complex forms a potom [and then] and nejdřív potom [first then]. Secondary connectives can have complex forms as well – cf. a z tohoto důvodu [and for this reason]. The criterion for a complex form to be placed in the level-two entry under a certain lemma is the ability of the basic connective (the given lemma) to express the same discourse type. It means that e.g. the complex connective přesto však [yet however] expressing the discourse type of concession is placed in respective level-two entries under both lemmas přesto [yet] and však [however], because both these single connectives individually also express the discourse type of concession in PDiT 2.0. Further, according to its placement either in both arguments or in one argument, each complex form is labeled in the attribute type as correlative, continuous, discontinuous or multiple (see above among the level-one entry characteristics). Within each complex form, element note may contain additonal information.
  • element examples: a list of a few illustrative examples from PDiT 2.0 and their English translations. Both intra-sentential and inter-sentential examples are – if available in the corpus – given for the connective usages and marked as such in the attribute type (intra vs. inter).
  • element is_rare: signals a rare use of the connective with the given discourse type
  • element register: captures whether the connective is used in the neutral, formal or informal register
  • element note: important information not encoded in other attributes
  • attribute id: a unique identifier of this level-two entry

 

Table 1: List of possible discourse types (senses)
CONTRAST EXPANSION CONTINGENCY TEMPORAL
confrontation conjunction reason–result synchrony
opposition conjunctive alternative     pragmatic reason–result     precedence–succession
restrictive opposition     disjunctive alternative explication  
pragmatic contrast instantiation condition  
concession specification pragmatic condition  
correction equivalence purpose  
gradation generalization    

 

Table 2: Possible values of the argument semantics (attribute arg_semantics)
relation argument semantics
concession concession:expectation
concession:contra-expectation
condition condition:condition
condition:result of condition
correction correction:claim
correction:correction
explication explication:claim
explication:argument
generalization generalization:more specific
generalization:less specific
gradation gradation:lower degree
gradation:higher degree
instantiation instantiation:general statement
instantiation:example
pragmatic condition pragmatic condition:pragmatic condition
pragmatic condition:result of pragmatic condition
pragmatic reason-result     pragmatic reason-result:pragmatic reason
pragmatic reason-result:pragmatic result
precedence-succession precedence-succession:precedence
precedence-succession:succession
purpose purpose:action
purpose:motivation
reason-result reason-result:reason
reason-result:result
restrictive opposition restrictive opposition:general statement
restrictive opposition:exception
specification specification:less specific
specification:more specific
all other relations symmetric

 

Corpus frequencies

Numbers of occurrences in PDiT 2.0 were added to all individual variants, complex forms, modifications and realizations, as well as to connective and non-connective usages (level-two entries) and the whole lemmas (level-one entries), in two attributes: pdt_count and pdt_intra, capturing numbers of all vs. intra-sentential occurrences of the respective items.

For level-one entries (whole lemmas) coming from the supplementary resources, numbers of occurrences (captured also in attributes pdt_count and pdt_intra) represent numbers of occurrencies in the respective resource (while in PDiT, these numbers are usually 0). Such level-one entries are marked by values PCEDT or other of element source.

For level-two entries (connective usages) and other smaller parts of lexicon entries (complex forms, examples, etc.) coming from the supplementary resources and added to level-one entries (lemmas) coming from PDiT, their counts in the given supplementary resource are also captured in attributes pdt_count and pdt_intra but they are not counted in total numbers of occurrences of the lemma (or of the complex forms etc.). These level-two entries or the smaller parts are also marked by values PCEDT or other of element source.

Translations

Apart from English translations listed in the descriptions of level-one and level-two entries, all complex forms, modified forms, realizations, variants (when possible) and examples have been translated to English (the translations are captured in elements english at the respective places).

Naturally, for entries coming from the PCEDT-cz, the English examples are actually the original ones and the Czech examples are translations.