CzeDLex 0.6 is the second version of the Lexicon of Czech Discourse Connectives, originally (2015 – 2017) developed within the COST-cz project TextLink-cz (see Czech project notes for individual years: 2015, 2016, 2017), and later (2019) within the project Shallow discourse parsing in Czech (GAČR GA19-03490S). CzeDLex 0.6 is an update of the previous version, CzeDLex 0.5, which was published in 2017.
The lexicon contains connectives partially automatically extracted from the Prague Discourse Treebank 2.0, a large corpus annotated manually with discourse relations. The most frequent entries have been manually checked and supplemented by additional information and English translations (see below "Manual Checks and Additions").
CzeDLex 0.6 was published in December 19, 2019 in the Lindat/Clarin repository. It is also available on-line.
For updates, see the web pages of the current development version of CzeDLex.
CzeDLex 0.6 is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
CzeDLex 0.6 is available in two formats, PML and HTML:
The Prague Markup Language (PML) is the primary XML format of the lexicon. The lexicon in this format is dowloadable from the Lindat/Clarin repository and can be opened (browsed and edited) in the tree editor TrEd. Installation instructions are a part of the Lindat/Clarin package.
The on-line version of the lexicon in the form of HTML web pages presents the most important properties of the lexicon entries in a graphical, user-friendly way, without a need to install any tools.
The on-line version of the lexicon allows to filter the list of lexicon entries by three criteria (which cannot be combined): the basic filter distinguishes the primary and secondary connectives, the second filter distinguishes the connectives according to discourse types they are able to express, and the last filter distinguishes the connectives according to their part of speech.
If you use the data of the lexicon or wish to refer to the published version of the data (version 0.6), please cite the publication of the data:
Pavlína Synková, Lucie Poláková, Jiří Mírovský, Magdaléna Rysová: CzeDLex 0.6. Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-3074, Dec 2019
You can also cite the following journal article describing the design of the lexicon and the extraction of the lexicon from the source corpus:
Jiří Mírovský, Pavlína Synková, Magdaléna Rysová, Lucie Poláková: CzeDLex – A Lexicon of Czech Discourse Connectives. In: The Prague Bulletin of Mathematical Linguistics, No. 109, Charles University, Prague, Czech Republic, ISSN 0032-6585, pp. 61-91, Oct 2017
Manual checks included checking of auto-filled values, assessment of suspicious usages of the connectives (in terms of complex forms – see below in the description of complex forms at level-two entries), addition of attributes/elementes not filled in automatically, translation of level-one entries to English, addition of glosses to individual usages (and their translations), translation of complex forms and modifications, selection of their types, selection of the most appropriate examples and their translation to English. Substantial information that could not be added within structural attributes/elements was provided as a free text in the element note.
In total, there are 204 level-one entries in the lexicon. The following 76 lexicon entries (covering more than 90% of the discourse relations annotated in the PDiT 2.0) have been fully manually checked and supplemented with additional information in CzeDLex 0.6 (in brackets: numbers of connective usages in the PDiT 2.0, incl. variants, modifications and complex forms): a [and] (6612), ale [but] (1745), však [however] (1686), když [when] (769), protože [because] (635), totiž [actually, you see, you know] (485), proto [therefore] (481), pokud [if] (473), aby [(in order) to] (437), pak [then] (430), : [{colon}] (416), tedy [so, I mean] (337), tak [so, in that way] (334), ovšem [but, of course] (310), také [also, too] (305), li [if] (296), nebo [or] (230), neboť [for, since] (222), přitom [at the same time] (221), což [which] (217), zatímco [while] (207), navíc [moreover] (203), naopak [on the contrary] (189), dodat [add] (187), i [also] (187), i když [although, even if] (178), kdyby [if] (175), takže [so] (153), poté [then, afterwards] (141), přesto [despite of that] (141), zároveň [at the same time] (137), dále [further, also] (126), přestože [although] (124), rovněž [also, too] (116), ač [although] (115), dokonce [even] (111), například [for example] (104), jestliže [if, in case] (96), přičemž [at the same time, while] (92), či [or] (87), potom [then] (86), jen [only, just] (83), případ [case] (83), jenže [but] (79), nicméně [nevertheless] (77), ani [nor, not (even)] (66), tím [this way, thus] (66), než [until] (54), podobně [similarly] (54), aniž [without (doing sth)] (53), až [when, until] (53), nýbrž [but] (47), vždyť [after all] (46), kromě [besides, apart from] (42), vzhledem k [with respect to] (42), dokud [until, as long as] (41), současně [at the same time] (41), spíše [rather] (40), ať [no matter how, be it or not] (37), zato [but (still)] (37), předtím [before (that), previously] (32), tudíž [consequently] (32), anebo [or] (31), za to [for that] (26), místo [instead of] (21), mezitím [in the meantime] (12), kdežto [whereas] (10), k tomu [moreover, also] (8), i tak [even so, even then] (7), mimoto [apart from that] (5), natož [let alone] (4), oproti [contrary (to)] (4), aneb [in other words] (3), nadto [moreover] (3), ani + případ [not even + case] (1), přece jen [after all] (1).
Realizations at several additional secondary connectives have been manually sorted according to a dependency scheme: důvod [reason] (76), strana [side] (61), naproti [on the other hand] (24), oproti [on the other hand] (4).
The level-one entry in the lexicon structure is represented by the lemma of the connective. It is encoded in the element lemma and contains the following information:
For each level-one entry in the lexicon structure, its connective and non-connective usages are represented as level-two entries. In connective-usages, the discourse type (see Table 1) is used as the base for nesting, while in non-connective-usages, the part-of-speech appurtenance of the expressions is used. The second level entry of the lexicon is encoded in the element usage and contains the following information:
CONTRAST | EXPANSION | CONTINGENCY | TEMPORAL |
---|---|---|---|
confrontation | conjunction | reason–result | synchrony |
opposition | conjunctive alternative | pragmatic reason–result | precedence–succession |
restrictive opposition | disjunctive alternative | explication | |
pragmatic contrast | instantiation | condition | |
concession | specification | pragmatic condition | |
correction | equivalence | purpose | |
gradation | generalization |
relation | argument semantics |
---|---|
concession |
concession:expectation concession:contra-expectation |
condition |
condition:condition condition:result of condition |
correction |
correction:claim correction:correction |
explication |
explication:claim explication:argument |
generalization |
generalization:more specific generalization:less specific |
gradation |
gradation:lower degree gradation:higher degree |
instantiation |
instantiation:general statement instantiation:example |
pragmatic condition |
pragmatic condition:pragmatic condition pragmatic condition:result of pragmatic condition |
pragmatic reason-result |
pragmatic reason-result:pragmatic reason pragmatic reason-result:pragmatic result |
precedence-succession |
precedence-succession:precedence precedence-succession:succession |
purpose |
purpose:action purpose:motivation |
reason-result |
reason-result:reason reason-result:result |
restrictive opposition |
restrictive opposition:general statement restrictive opposition:exception |
specification |
specification:less specific specification:more specific |
all other relations | symmetric |
Numbers of occurrences in PDiT 2.0 were added to all individual variants, complex forms, modifications and realizations, as well as to connective and non-connective usages (level-two entries) and the whole lemmas (level-one entries), in two attributes: pdt_count and pdt_intra, capturing numbers of all vs. intra-sentential occurrences of the respective items.
Apart from English translations listed in the descriptions of level-one and level-two entries, all complex forms, modified forms, realizations, variants (when possible) and some examples have been translated to English (the translations are captured in elements english at the respective places).