The Czech RST Discourse Treebank 1.0 (CzRST-DT 1.0, Poláková et al. 2023) is a dataset of 54 Czech journalistic texts manually annotated using the Rhetorical Structure Theory (RST, Mann and Thompson 1988). Each text document in the treebank is represented as a single tree-like structure, the nodes (discourse units) are interconnected through hierarchical rhetorical relations.

The dataset also contains concurrent annotations of five double-annotated documents.

The original texts are a part of the data annotated in the Prague Dependency Treebank (Hajič et al., 2020), although the two projects are independent.

The annotation in Czech RST Discourse Treebank is based in large part on the RST version used for annotation in the Potsdam Commentary Corpus and documented in the following two annotation guidelines (English and German):

Annotation scheme for the Czech treebank is described in the Annotation Manual (in Czech, available upon request). Compared to the Stede et. al (2017) version, guidelines for Czech have been modified in the following basic points:

  • Segmetantion: segmentation of discountinuous units, segmentation of relative clauses, attribution and reported contents
  • Structure: changes resulting from the new segmentation principles, some further constraints on the structure
  • Relation inventory: introduction of 5 new labels (mostly for the needs of reversed nuclearity), overall 36 rhetorical relations + 1 technical relation (Same-unit).



