CzeDLex is a Lexicon of Czech Discourse Connectives, originally (2015 – 2017) developed within the COST-cz project TextLink-cz, and later (2019 – 2021) within the project Shallow discourse parsing in Czech (GAČR GA19-03490S).
CzeDLex 0.5 (the pilot version) was published in December 24, 2017 in the Lindat/Clarin repository (also available on-line).
CzeDLex 0.6 (the first update) was published in December 19, 2019 in the Lindat/Clarin repository (also available on-line).
CzeDLex 0.7 (the second update) was published in December 24, 2020 in the Lindat/Clarin repository (also available on-line).
CzeDLex 1.0 (the third update) to be published in December, 2021 in the Lindat/Clarin repository (also available on-line).
For further updates, see the web pages of the current development version of CzeDLex.
The lexicon contains connectives partially automatically extracted from two large corpora annotated manually with discourse relations, and a smaller additional material also annotated manually with discourse relations:
The lexicon entries have been manually checked and supplemented with additional information and English translations.
CzeDLex is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
CzeDLex is available in two formats, PML and HTML:
The Prague Markup Language (PML) is the primary XML format of the lexicon. The lexicon is dowloadable from the Lindat/Clarin repository (available versions: 0.5, 0.6, 0.7, 1.0) and can be opened (browsed and edited) in the tree editor TrEd. Installation instructions are a part of the Lindat/Clarin package.
The on-line version of CzeDLex 1.0 in the form of HTML web pages presents the most important properties of the lexicon entries in a graphical, user-friendly way, without a need to install any tools (older on-line versions: 0.5, 0.6, 0.7), with the following filtering, sorting and presentation possibilities:
If you use the data of the lexicon or wish to refer to the published version of the data (version 1.0), please cite the publication of the data:
Jiří Mírovský, Pavlína Synková, Lucie Poláková, Věra Kloudová, Magdaléna Rysová: CzeDLex 1.0. Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-4595, Dec 2021
You can also cite the following journal articles describing (i) the design of the lexicon and the original extraction of the lexicon from the source corpus, and (ii) the subsequent extension of the lexicon from the second source corpus:
Jiří Mírovský, Pavlína Synková, Magdaléna Rysová, Lucie Poláková: CzeDLex – A Lexicon of Czech Discourse Connectives. In: The Prague Bulletin of Mathematical Linguistics, No. 109, Charles University, Prague, Czech Republic, ISSN 0032-6585, pp. 61-91, Oct 2017
Jiří Mírovský, Pavlína Synková, Lucie Poláková: Extending Coverage of a Lexicon of Discourse Connectives Using Annotation Projection. In: The Prague Bulletin of Mathematical Linguistics, No. 117, Charles University, Prague, Czech Republic, ISSN 0032-6585, pp. 5-26, Oct 2021
Manual checks included checking of auto-filled values, assessment of suspicious usages of the connectives (in terms of complex forms – see below in the description of complex forms at level-two entries), addition of attributes/elementes not filled in automatically, translation of level-one entries to English, addition of glosses to individual usages (and their translations), translation of complex forms and modifications, selection of their types, selection of the most appropriate examples and their translation to English. Substantial information that could not be added within structural attributes/elements was provided as a free text in the element note.
Refering to the last published version of CzeDLex (1.0), there are 200 level-one entries in the lexicon, all fully manually checked and supplemented with additional information (in brackets: numbers of connective usages in the PDiT 2.0 (or another of the sources), incl. variants, modifications and complex forms): a [and] (6612), a skutečně [indeed] (8), aby [(in order) to] (437), ač [although] (115), ale [but] (1745), alespoň [at least] (5), aneb [in other words] (3), anebo [or] (31), ani [nor, not (even)] (64), aniž [without (doing sth)] (53), argumentovat [to argue] (10), ať [no matter] (37), avšak [however] (70), až [when, until] (53), ba [even] (5), během [during] (5), buď ~ nebo [either ~ or] (28), byť [albeit] (25), což [which] (217), či [or] (86), čili [that is, i.e.] (9), dále [further, also] (126), díky [thanks to] (15), do třetice [in the third place] (5), doba [time] (19), dokonce [even] (111), dokud [until, while] (41), dovršení [completion] (1), dříve [sooner] (29), důsledek [consequence] (3), důvod [reason] (70), hlavně [primarily] (1), i [also] (187), i když [even if] (178), i tak [even so] (7), jak [as, when] (9), jak ~ tak [both ~ and] (4), jakkoli [however] (7), jakmile [as soon as] (33), jednak ~ jednak [for one thing ~ for another] (14), jelikož [because, since] (20), jen [only, just] (83), jenže [but] (84), jestli [if] (17), jestliže [if, in case] (96), ještě [still, even] (21), ježto [as] (1), jinak [otherwise] (23), jinými slovy [in other words] (4), jmenovitě [namely] (1), k [to] (8), kdežto [whereas] (10), kdy [when] (29), kdyby [if] (175), kdykoli [whenever] (8), když [when] (768), koneckonců [after all] (10), konkrétně [specifically] (1), kontrast [contrast] (2), kontrastovat [to contrast] (1), kromě [besides] (44), kupříkladu [for example] (2), kvůli [because of] (7), leč [but] (4), leda [unless, only] (1), li [if] (296), mezitím [in the meantime] (12), mimo jiné [besides other things] (20), mimoto [apart from that] (5), místo [instead of] (21), na rozdíl od [in contrast with] (2), na základě [on the grounds of] (2), na závěr [in the end] (4), načež [after which] (1), nadto [moreover] (3), nakonec [eventually] (29), naopak [on the contrary] (190), naproti [opposite] (24), například [for example] (104), následek [consequence] (1), následně [subsequently] (4), nato [then, afterwards] (6), natož [let alone] (4), navíc [moreover] (203), navzdory [despite] (6), ne [not] (49), #neg [{negation}] (255), nebo [or] (230), neboli [in other words] (1), neboť [as, because] (222), nedosti na tom [that is not enough] (4), nehledě na [regardless of] (4), nejen [not only] (67), nejenže [not only that] (14), nejprve [(at) first] (15), nemluvě o [not to mention] (3), než [until] (54), nicméně [nevertheless] (77), nikoli [not] (24), nýbrž [but] (44), obdobně [similarly] (3), odůvodnění [justification] (3), okamžik [moment] (11), oproti [contrary (to)] (4), ostatně [after all] (3), ovšem [but, of course] (310), pak [then] (430), pakliže [if] (1), podmínka [condition] (20), podobně [similarly] (54), pokud [if] (473), poněvadž [since, as] (6), popřípadě [alternatively] (6), posléze [afterwards, finally, then] (17), poté [afterwards] (141), potom [then] (86), pouze [only, just] (40), později [later] (116), prostě [simply, just] (9), proto [therefore] (481), protože [because] (635), přece [after all] (31), přece jen [after all] (23), především [above all] (6), předpoklad [assumption] (9), předtím [before (that)] (32), přeloženo [translated] (1), přes [despite] (4), přesněji [more precisely] (4), přesto [despite of that] (141), přestože [although] (124), přičemž [while] (91), příčina [cause] (4), příklad [example] (11), případ [case] (80), případně [alternatively] (13), přitom [at the same time] (220), původně [originally] (3), respektive [or (more precisely)] (3), rovněž [also] (116), rozdíl [difference] (5), řečeno [speaking] (14), s tím, že [with the fact that] (52), sice [otherwise, granted] (2), sotva [the moment, hardly] (6), souběžně [concurrently] (3), současně [at the same time] (41), souvislost [connection, context] (18), spíše [rather] (40), srovnání [comparison] (10), stejně [equally, still] (36), strana [side] (60), tak [so] (334), také [also] (305), taktéž [also] (7), takže [so] (153), tedy [so] (337), též [also] (10), tím [thus] (28), tím pádem [thus] (8), tím spíše [all the more] (8), tj. [i.e., that is] (7), to [{N/A}] (14), totiž [you see, actually] (485), třeba [for example] (13), třebaže [although] (12), tudíž [consequently] (32), účel [purpose] (7), upřesnit [to specify] (14), v neposlední řadě [last but not least] (4), ve skutečnosti [in fact] (13), vedle [apart from] (4), více [more] (9), vinou [due to] (1), vlastně [actually] (7), však [however] (1686), výjimka [exception] (3), vyjma [excluding] (1), výsledek [result] (5), vzápětí [in no time] (19), vzhledem k [with respect to] (42), vždyť [after all] (46), záhy [soon] (2), zároveň [at the same time] (137), zase [again, in turn] (55), zásluhou [thanks to] (2), zatím [meantime] (11), zatímco [while] (207), zato [but (still)] (37), zčásti ~ zčásti [partly ~ partly] (2), zejména [particularly] (8), zkrátka [in short] (3), znamenat [to mean] (69), známka [indication] (1), způsobit [to cause] (1), zvlášť [especially] (2), že [that] (3), - [{dash}] (246), : [{colon}] (416), ; [{semicolon}] (3).
The lexicon covers all primary connectives used in the source PDiT annotated data and most of the secondary connectives from the PDiT annotation (some verbal second connectives have been excluded from CzeDLex 1.0).
The level-one entry in the lexicon structure is represented by the lemma of the connective. It is encoded in the element lemma and contains the following information:
For each level-one entry in the lexicon structure, its connective and non-connective usages are represented as level-two entries. In connective-usages, the discourse type (see Table 1) is used as the base for nesting, while in non-connective-usages, the part-of-speech appurtenance of the expressions is used. The second level entry of the lexicon is encoded in the element usage and contains the following information:
CONTRAST | EXPANSION | CONTINGENCY | TEMPORAL |
---|---|---|---|
confrontation | conjunction | reason–result | synchrony |
opposition | conjunctive alternative | pragmatic reason–result | precedence–succession |
restrictive opposition | disjunctive alternative | explication | |
pragmatic contrast | instantiation | condition | |
concession | specification | pragmatic condition | |
correction | equivalence | purpose | |
gradation | generalization |
relation | argument semantics |
---|---|
concession |
concession:expectation concession:contra-expectation |
condition |
condition:condition condition:result of condition |
correction |
correction:claim correction:correction |
explication |
explication:claim explication:argument |
generalization |
generalization:more specific generalization:less specific |
gradation |
gradation:lower degree gradation:higher degree |
instantiation |
instantiation:general statement instantiation:example |
pragmatic condition |
pragmatic condition:pragmatic condition pragmatic condition:result of pragmatic condition |
pragmatic reason-result |
pragmatic reason-result:pragmatic reason pragmatic reason-result:pragmatic result |
precedence-succession |
precedence-succession:precedence precedence-succession:succession |
purpose |
purpose:action purpose:motivation |
reason-result |
reason-result:reason reason-result:result |
restrictive opposition |
restrictive opposition:general statement restrictive opposition:exception |
specification |
specification:less specific specification:more specific |
all other relations | symmetric |
Numbers of occurrences in PDiT 2.0 were added to all individual variants, complex forms, modifications and realizations, as well as to connective and non-connective usages (level-two entries) and the whole lemmas (level-one entries), in two attributes: pdt_count and pdt_intra, capturing numbers of all vs. intra-sentential occurrences of the respective items.
For level-one entries (whole lemmas) coming from the supplementary resources, numbers of occurrences (captured also in attributes pdt_count and pdt_intra) represent numbers of occurrencies in the respective resource (while in PDiT, these numbers are usually 0). Such level-one entries are marked by values PCEDT or other of element source.
For level-two entries (connective usages) and other smaller parts of lexicon entries (complex forms, examples, etc.) coming from the supplementary resources and added to level-one entries (lemmas) coming from PDiT, their counts in the given supplementary resource are also captured in attributes pdt_count and pdt_intra but they are not counted in total numbers of occurrences of the lemma (or of the complex forms etc.). These level-two entries or the smaller parts are also marked by values PCEDT or other of element source.
Apart from English translations listed in the descriptions of level-one and level-two entries, all complex forms, modified forms, realizations, variants (when possible) and examples have been translated to English (the translations are captured in elements english at the respective places).
Naturally, for entries coming from the PCEDT-cz, the English examples are actually the original ones and the Czech examples are translations.