The project aims at theoretical and corpus-based representation of global coherence in Czech written texts. Global coherence assumes a hierarchical representation of smaller (clauses, sentences) and larger text units (e.g. paragraphs) and existence of coherence relations between these units on all levels of the hierarchy. A single interconnected representation for the entire document is postulated, too. As a first step, up-to-date linguistic frameworks for global coherence analysis are critically evaluated. We benefit from our own long-term experience with describing various linguistic aspects of local coherence. Next, we will design a suitable scenario for representing global coherence with corpus methods and conduct a pilot annotation. The project combines and expands both the line of development of research on discourse and coherence at ÚFAL and recent advances in international discourse-oriented community.
The main focus in of the project in its third year was on determining a suitable annotation scheme for annotating higher text structure in Czech and on the annotation itself. We have critically evaluated the Rhetorical Structure Theory framework, assessed its usability for the intended coherence annotation task and finally adjusted it for our annotation purposes. This includes technical adjustments: dealing with interrupted clauses and the embedded contents, types of unit attachment points, solution of juxtaposed structures - and linguistic adjustments: introduction of new semantico-pragmatic labels, re-consideration of some original labels primarily used for English, analysis of attribution (the relation of the speaker/writer to the uttered content). The resulting annotation design proposal is described in a new annotation manual for annotating RST in Czech, currently in development.
On this basis, we have annotated 50 Czech written documents of different genres (news, comments, essays etc.) in one interconnected representation (projective tree structure) for each document. To annotate Czech texts in the RST system, we used a local installation of the rstWeb web tool, modified for the needs of the project by extending the list of tags offered by the tool to annotators. A certain portion of texts was annotated by two annotators, in order to address consistency issues and general issues of text interpretation and understanding. The outputs of rstWeb tool were used to measure inter-annotator agreement using the RST-Tace tool. The annotated data are in the process of checking, cleaning and will be subsequently published in the Lindat/Clarin repository as an open-access resource. Due to the pandemic situation, the project was extended by six months.
In the second year of the project, we have further focused on assessing and applying the early findings to the benefit of the intended global coherence analysis by corpus methods. A large summarizing study was published in the international journal Dialogue and Discourse (Poláková et al., 2021, see below). In this study, apart from deepening our early work on hierarchies in local annotation and extending and comparing it to English locally-annotated data, we offer a) a description of distribution differences in semantic types of relations in cross-paragraph vs. intra-paragraph settings in the Prague Dependency Treebank, b) a study of paragraph-initial discourse connectives with the identification of Czech connectives only typical for higher structures, c) the detection of prevalence of large left-sided arguments in locally annotated data, d) some new reflections on methodologies of the approaches under scrutiny.
Another line of research was dedicated to the role subjectivity and intentionality in discourse structure, more precisely the role of the so-called pragmatic (epistemic, speech-act) relations in discourse structuring. (Poláková and Synková, 2021). The starting points of the analysis were the extent and the way of author involvement in relation to the text content and text structuring, and an analysis of inferences. The detailed study of pragmatic relations (as opposed to semantic relations) with their widest contexts reveals a considerable diversity within this group and shows some space for improvement in local annotation schemes and also direct consequences for understanding (not only) the nature of rhetorical labels in the global RST framework. In RST, there is a similar division to semantic and presentational pragmatic relations (and a third category - textual), which needs to be reviewed for our purposes.
In preparation for the planned annotation of global coherence in the last year of the project, we have collected and prepared appropriate Czech texts from the PDiT-EDA treebank, a subset of the Prague Dependency Treebank annotated for local implicit relations. Further, we have selected and installed the annotation tool and tested its functionality.
In the first stage of the project, we have concentrated on the research of mutual configurations and hierarchical structures in local discourse relations (in local analytic approach), as compared to the principles of a global approach like the Rhetorical Structure Theory (RST). With qualitative and quantitative corpus methods and advanced querying system, we have described the ways and the extent, in which Czech data annotated for local coherence display features of higher text structure/global coherence. A first step of this research was published this year (Poláková and Mírovský, TSD 2020, see below).
On the basis of these findings, in terms of underlying theories and analytical methods in coherence processing, we have addressed the adequacy of some of the principles of local and global approaches to the description of discourse coherence on real texts, like the tree-like representation of documents (RST) or the minimality principle (Penn Discourse Treebank, PDTB). The findings for Czech data are quite similar to those for English data published earlier: that very few configurations of pairs of local discourse relation in fact break the tree-ness constraint applied in the RST. The most decisive factor here is the definition of a discourse unit (argument) in each theoretical frame, together with the annotators' biases in the local, incrementally proceeding analysis vs. the global perspective. A specific role is also played by the way of treatment of cues/signals of these relations, in our case specifically the treatment of secondary connectives (connective phrases).
We have further explored the role of long-distance (mostly anaphoric) relations and connectives, which, in different (global) analytic perspective, can be regarded as relations between large discourse units, relations of higher structure (Poláková et al., LREC 2020) and we have also studied specific connective roles of most common focalizers, which play a role in thematic progressions of a text and also function as operators in discourse relations (Hajičová, Mírovský, Štěpánková, PBML 2020).
Eva Hajičová, Jan Hajič, Barbora Hladká, Jiří Mírovský, Lucie Poláková, Kateřina Rysová, Magdaléna Rysová, Pavel Straňák, Barbora Štěpánková, Šárka Zikánová (2022): Corpus Annotation as a Feasible and Scientifically Beneficial Task. In: CLARIN: The Infrastructure for Language Resources, Copyright © Walter de Gruyter GmbH, Berlin/Boston, Mannheim, Germany, ISBN 978-3-11-076734-6, pp. 613-646. https://www.degruyter.com/document/doi/10.1515/9783110767377-024/html
Lucie Poláková (2022): Globální koherence českých textů a možnosti jejího korpusového zpracování. Zpráva o aktuálním projektu Ústavu formální a aplikované lingvistiky MFF UK. Jazykovědné aktuality, Vol. LIX, No. 1-2, Copyright © Ústav pro jazyk český AV ČR , Praha, Česká republika, ISSN 1212-5326, pp. 45-50. https://www.jazykovednesdruzeni.cz/jazykovedne-aktuality-2021-2024/
Šárka Zikánová, Jiří Mírovský, Lucie Poláková (2022): Structuration globale du texte: une étude de corpus In: Écho des études romanes, Vol. 18, Copyright © Université de Bohême du Sud, České Budějovice, ISSN 1801-0865, pp. 99-115. Pdf on request.
Discourse Relations and Connectives in Higher Text Structure. In: Dialogue and Discourse, ISSN 2152-9620, vol. 12, no. 2, pp. 1-37, https://journals.uic.edu/ojs/index.php/dad/article/view/11537/10198
Pragmatické aspekty v popisu textové koherence. In: Naše řeč, ISSN 0027-8203, vol. 104, no. 4, pp. 225-242, http://nase-rec.ujc.cas.cz/archiv.php?art=8638
Mining Local Discourse Annotation for Features of Global Discourse Structure. In: 23rd International Conference on Text, Speech and Dialogue, pp. 50-60, Springer, Cham, Switzerland, ISBN 978-3-030-58322-4, https://www.springer.com/gp/book/9783030583224#aboutBook
Focalizers and Discourse Relations. In: The Prague Bulletin of Mathematical Linguistics, ISSN 0032-6585, 115, pp. 187-197, https://ufal.mff.cuni.cz/pbml/115/art-hajicova-mirovsky-stepankova.pdf
GeCzLex: Lexicon of Czech and German Anaphoric Connectives. In: Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 1082-1089, European Language Resources Association, Marseille, France, ISBN 979-10-95546-34-4, http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.137.pdf
Lucie Poláková: Globální textové struktury a jejich anotace v Praze. Contributed talk, Workshop Text Structures and Discourse Relations, Jihočeská Univerzita, České Budějovice, Czech Republic, Dec 2022
Lucie Poláková: Rhetorical Structure Theory as a Model of Global Coherence. Talk, Linguistic Mondays, ÚFAL MFF UK, Prague, Czech Republic, Nov 2022
Workshop: "Explicit and implicit coherence relations: Different, but how exactly?", Humboldt-Universität zu Berlin, Germany, January 17-18, 2020:
Lucie Poláková: Implicit relation questions surfacing in Prague discourse projects
Šárka Zikánová: Factors influencing implicit discourse relations in Czech
Annual Meeting of Societás Linguistica Europea (SLE 2020), August 27:
Eva Hajičová: Focalizers and discourse relations