Documentation to GeCzLex 1.0

Introduction

GeCzLex 1.0 is an online electronic resource for translation equivalents of Czech and German discourse connectives. The lexicon is one of the outcomes of a three-year research project on anaphoricity in Czech and German connectives (Anaphoricity in Connectives: Lexical Description and Bilingual Corpus Analysis), it contains at present time anaphoric connectives for both languages, and further their possible translations documented in bilingual parallel corpora (not necessarily anaphoric).

As a basis, we use two existing monolingual lexicons of connectives:

  • the Lexicon of Czech Discourse Connectives CzeDLex 0.6 and
  • the Lexicon of Discourse Markers (DiMLex) for German. 

Their relevant entries have been interlinked via semantic annotation of the connectives according to the PDTB 3 sense taxonomy and statistical information of translation possibilities from the Czech and German parallel data of the Intercorp project. The lexicon is, as far as we know, the first bilingual inventory of connectives with linkage on the level of individual entries.

License and Availability

The first version of GeCzLex was released in December 2019, as a freely available resource under the Creative Commons License.

How to Cite

Rysová Kateřina, Poláková Lucie, Rysová Magdaléna, Mírovský Jiří: Lexicon of Czech and German Anaphoric Connectives. Data/software, Charles University, Prague, Czech Republic, http://hdl.handle.net/11234/1-3075, Dec 2019

Poláková Lucie, Rysová Kateřina, Rysová Magdaléna, Mírovský Jiří: GeCzLex: Lexicon of Czech and German Anaphoric Connectives. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), Copyright © European Language Resources Association, Paris, France, 2020

Lexicon Entries – Anaphoric Connectives

GeCzLex is a resource aiming to cover anaphoric connectives (ACs) in Czech and German. Anaphoricity is here taken from both formal and functional point of view. Language expressions and phrases complying with either of these definitions can be different but there is an intersection and both such sets of connectives are included in GeCzLex.

According to the formal definition, an anaphoric connective is an expression (or a multiword phrase, depending on the degree of grammaticalization) containing an anaphoric element – regarding its structure, it is usually formed from a preposition (adposition) and a referential component (e.g. darum in German, proto in Czech). In this aspect, German exhibits a stronger tendency toward formation by composition than Czech. German contains more grammaticalized (single-word) anaphoric connectives than Czech. Taking into account semantic equivalents in Czech and German, many anaphoric connectives, which appear as single words in German are multiword phrases in Czech. Therefore, in the Czech part of the lexicon, we also cover multiword phrases corresponding to the structure “preposition + anaphoric element”. In this way, we selected three groups of formally anaphoric connectives: grammaticalized connectives in German (altogether 54 connectives), grammaticalized connectives in Czech (17) and non-grammaticalized connectives (multi-word phrases) in Czech (11).

From the functional perspective, anaphoric connectives have, like demonstratives, the ability to relate anaphorically, not syntactically, to their left-sided argument, which also includes the possibility to relate “remotely” to non-adjacent text segments. More precisely, ACs can also accept distant text segments as their left-sided arguments. We make no constraints on PoS of anaphoric connectives and base our work solely on gold discourse-annotated data. Currently, GeCzLex contains 14 anaphoric connectives in Czech and 2 in German (according to the functional definition).

Lexicon Structure

The current GeCzLex lexicon entry contains:

The entry head – the lemma of the connective. For Czech secondary connectives, the lemma is the whole (prepositional) phrase so that the referential component is visible at first sight.

The URL link to the full entry of the given connective in the underlying resource, that is CL – CzeDLex for Czech connectives and DL – DiMLex for German connectives. Such a link is then provided also for every translation of a given connective.

For each entry lemma, a list of assigned semantic relations (senses) from the PDTB 3 tagset is displayed. At present, the ordering of the semantic relations in the lexicon entry is alphabetical, not sorted according to the corpus frequencies.

Intra-lexicon links: if a translation of a given connective is also an anaphoric connective, clicking on its lemma again opens up its GeCzLex entry. If it is not, it is displayed in a different color and does not contain a hypertext link.

For each translation, syntactic categories, i. e. part of speech (for Czech primary connectives) or syntactic structure (for German connectives) are extracted from the underlying lexicons, and, in case of secondary connectives – multiword structures in Czech, their syntactic structure is added manually.