Introduction

Annotation of discourse relations is a project related to the Prague Dependency Treebank 2.5 (PDT; Bejček et al. 2011), which is a revised, updated and extended version of the Prague Dependency Treebank 2.0 (Hajič et al. 2006). It represents a new manually annotated layer of language description, above the existing layers of the PDT (morphology, surface syntax and underlying syntax) and it portrays linguistic phenomena from the perspective of discourse structure and coherence. The discourse layer of the treebank contains two subprojects:

  1. lexically-grounded approach of identification of discourse connectives, discourse units linked by them and semantic relations between these units, and
  2. annotations of extended textual coreference and bridging relations.

With its 49,431 manually annotated sentences from Czech newspapers, the project serves as a large-scale resource for linguistic research in the area of discourse analysis as well as for computational experiments concerning automatic text analysis, information extraction, text summarization and other branches of NLP research.

Contrary to the majority of similarly aimed corpus projects, the discourse-related information has been annotated directly on the syntactic trees and technically is a part of the underlying syntax layer of the PDT. This methodological approach allows us to include discourse-relevant syntactic phenomena annotated earlier (such as e.g. discourse relations expressed by dependent clauses) into the discourse representation, and to take advantage of the syntactic structure itself (resolution of elliptical structures, parentheses, appositions etc.). Also, from the perspective of querying the treebank and visualizing, all the different types of linguistic information are interlinked and available/visible at once.

The Prague Discourse Treebank 1.0 can be downloaded from the LINDAT-Clarin repository (see the Licence).

UPDATE (2013): Please note that an updated version of the corpus was published in December 2013 as Prague Dependency Treebank 3.0.

UPDATE (2016): Even more updated version (enriched by the annotation of secondary connectives) was published in December 2016 as Prague Discourse Treebank 2.0.