Year 2 - 2020

The work on the lexicon of Czech discourse connectives (CzeDLex) continued with further manual checks and additions. Special attention was given to prepositional connectives such as k (tomu) [to (that)], kromě (toho) [except for (that)] etc. and their consistent annotation. The manually processed part of the lexicon now covers over 95% of all discourse relations in the Prague Discourse Treebank (PDiT; WP4).

The projection of the annotation of discourse relations in the Penn Discourse Treebank 3.0 (PDTB) to the Czech part of the Prague Czech-English Dependency Treebank (PCEDT) was finished and a list of Czech connectives used in the projected annotation was extracted from the data in a similar fashion as the original version of CzeDLex was extracted from PDiT, i.e. along with possible discourse types and examples. After initial cleaning (the word-to-word alignment of the PCEDT is automatic, i.e. not without errors), the remaining entries (approx. 200 connectives) were automatically compared with the current version of CzeDLex. Completely new entries and also new discourse types for entries already present in CzeDLex were then considered by two researchers. This effort resulted in an addition of (a surprisingly high number of) 24 new whole entries and 26 new discourse types for previously present entries to CzeDLex (WP5).

A new version of CzeDLex, version 0.7, was published at the end of 2020 in the Lindat/Clarin repository, with 218 entries in total and 131 entries manually checked and enriched with additional linguistic information. The lexicon, along with a detailed documentation, is also available on-line: https://ufal.mff.cuni.cz/czedparse/czedlex0.7

Further work was done on the discourse parser for Czech and associated test data. Over one thousand sentences of the Czech part of the PCEDT were annotated manually in the Prague approach, to serve mainly as test data for the parser. Two hundred sentences were annotated in parallel by two annotators and their discrepancies were studied and resolved (WP6). Additionally, for the theoretical purposes and also for testing the annotation projection, a part of the PCEDT-cz sentences was annotated manually also in the PDTB style using the original PDTB Annotator tool.

The research in discourse parsing itself was in 2020 focused on automatic recognition of a discourse type of a discourse relation using embedding and deep neural networks. Following similar approaches for implicit relations, we devised a method how to incorporate information about the explicitly present connective in the training/test instances (WP7). We also experimented with using the data obtained by the annotation projection as additional training data. Our experiments with BERT (a widely used pre-trained system for deep learning in NLP) were summarized in a paper that was submitted to the ICICT 2021 conference.

Publications

Jiří Mírovský, Lucie Poláková, Pavlína Synková (2020): CzeDLex 0.6 and its Representation in the PML-TQ. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 1128-1134, European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4