The project aims at theoretical and corpus-based representation of global coherence in Czech written texts. Global coherence assumes a hierarchical representation of smaller (clauses, sentences) and larger text units (e.g. paragraphs) and existence of coherence relations between these units on all levels of the hierarchy. A single interconnected representation for the entire document is postulated, too. As a first step, up-to-date linguistic frameworks for global coherence analysis are critically evaluated. We benefit from our own long-term experience with describing various linguistic aspects of local coherence. Next, we will design a suitable scenario for representing global coherence with corpus methods and conduct a pilot annotation. The project combines and expands both the line of development of research on discourse and coherence at ÚFAL and recent advances in international discourse-oriented community.
In the second year of the project, we have further focused on assessing and applying the early findings to the benefit of the intended global coherence analysis by corpus methods. A large summarizing study was published in the international journal Dialogue and Discourse (Poláková et al., 2021, see below). In this study, apart from deepening our early work on hierarchies in local annotation and extending and comparing it to English locally-annotated data, we offer a) a description of distribution differences in semantic types of relations in cross-paragraph vs. intra-paragraph settings in the Prague Dependency Treebank, b) a study of paragraph-initial discourse connectives with the identification of Czech connectives only typical for higher structures, c) the detection of prevalence of large left-sided arguments in locally annotated data, d) some new reflections on methodologies of the approaches under scrutiny.
Another line of research was dedicated to the role subjectivity and intentionality in discourse structure, more precisely the role of the so-called pragmatic (epistemic, speech-act) relations in discourse structuring. (Poláková and Synková, 2021). The starting points of the analysis were the extent and the way of author involvement in relation to the text content and text structuring, and an analysis of inferences. The detailed study of pragmatic relations (as opposed to semantic relations) with their widest contexts reveals a considerable diversity within this group and shows some space for improvement in local annotation schemes and also direct consequences for understanding (not only) the nature of rhetorical labels in the global RST framework. In RST, there is a similar division to semantic and presentational pragmatic relations (and a third category - textual), which needs to be reviewed for our purposes.
In preparation for the planned annotation of global coherence in the last year of the project, we have collected and prepared appropriate Czech texts from the PDiT-EDA treebank, a subset of the Prague Dependency Treebank annotated for local implicit relations. Further, we have selected and installed the annotation tool and tested its functionality.
In the first stage of the project, we have concentrated on the research of mutual configurations and hierarchical structures in local discourse relations (in local analytic approach), as compared to the principles of a global approach like the Rhetorical Structure Theory (RST). With qualitative and quantitative corpus methods and advanced querying system, we have described the ways and the extent, in which Czech data annotated for local coherence display features of higher text structure/global coherence. A first step of this research was published this year (Poláková and Mírovský, TSD 2020, see below).
On the basis of these findings, in terms of underlying theories and analytical methods in coherence processing, we have addressed the adequacy of some of the principles of local and global approaches to the description of discourse coherence on real texts, like the tree-like representation of documents (RST) or the minimality principle (Penn Discourse Treebank, PDTB). The findings for Czech data are quite similar to those for English data published earlier: that very few configurations of pairs of local discourse relation in fact break the tree-ness constraint applied in the RST. The most decisive factor here is the definition of a discourse unit (argument) in each theoretical frame, together with the annotators' biases in the local, incrementally proceeding analysis vs. the global perspective. A specific role is also played by the way of treatment of cues/signals of these relations, in our case specifically the treatment of secondary connectives (connective phrases).
We have further explored the role of long-distance (mostly anaphoric) relations and connectives, which, in different (global) analytic perspective, can be regarded as relations between large discourse units, relations of higher structure (Poláková et al., LREC 2020) and we have also studied specific connective roles of most common focalizers, which play a role in thematic progressions of a text and also function as operators in discourse relations (Hajičová, Mírovský, Štěpánková, PBML 2020).
Discourse Relations and Connectives in Higher Text Structure. In: Dialogue and Discourse, ISSN 2152-9620, vol. 12, no. 2, pp. 1-37, https://journals.uic.edu/ojs/index.php/dad/article/view/11537/10198
Pragmatické aspekty v popisu textové koherence. In: Naše řeč, ISSN 0027-8203, vol. 104, no. 4, pp. 225-242, http://nase-rec.ujc.cas.cz/archiv.php?art=8638
Mining Local Discourse Annotation for Features of Global Discourse Structure. In: 23rd International Conference on Text, Speech and Dialogue, pp. 50-60, Springer, Cham, Switzerland, ISBN 978-3-030-58322-4, https://www.springer.com/gp/book/9783030583224#aboutBook
Focalizers and Discourse Relations. In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 115, pp. 187-197, https://ufal.mff.cuni.cz/pbml/115/art-hajicova-mirovsky-stepankova.pdf
GeCzLex: Lexicon of Czech and German Anaphoric Connectives. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 1082-1089, European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4, http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.137.pdf
Workshop: "Explicit and implicit coherence relations: Different, but how exactly?", Humboldt-Universität zu Berlin, Germany, January 17-18, 2020:
Lucie Poláková: Implicit relation questions surfacing in Prague discourse projects
Šárka Zikánová: Factors influencing implicit discourse relations in Czech
Annual Meeting of Societás Linguistica Europea (SLE 2020), August 27:
Eva Hajičová: Focalizers and discourse relations