Year 3 - 2024

Data from the last two remaining subcorpora of PDT-C (i.e., Prague Dependency Treebank of Spoken Czech (PDTSC) and Faust) have been parsed with  the most updated version of the discourse parser, using the recent underlying data of both corpora (simultaneously having been updated in another  project).

Most of 2024 was dedicated to WP8, i.e. checks and corrections in the automatic annotation of PDTSC and Faust. Faust, being a small corpus, was completely checked manually. For PDTSC, based on an previous knowledge and various analyses of samples of the pre-annotated data, the most problematic parts were checked and corrected.

Inter-annotation agreement between the resulting data and a sample of 1 thousand sentences of PDTSC annotated completely manually shows a very high level of annotation quality of the resulting data (F1 measure on existence of relations is 0.94, agreement on discourse types: 83% (Cohen’s kappa 0.8)).

At the end of 2024, as WP9, all data annotated in the whole project, i.e. all four subcorpora of PDT-C (PDT, PCEDT-cz, PDTSC, Faust) were updated to the most recent version of the underlying data, discrepancies were checked and fixed and the data were transformed to the Penn Discourse Treebank 3.0 data format and sense taxonomy. After final checks, the discourse annotation in both frameworks was published as a part of the new PDT-C 2.0 (the PML data format, and as the Prague Discourse Treebank 4.0 (the PDTB data format, at LINDAT/CLARIAH-CZ repository, thus fulfilling the main goal of the project.

Two long conference papers were presented at LREC-Coling 2024, elaborating on key parts of the data annotation process and quality of the resulting data; based on the response from the audience, future international cooperation has been established.