Prague Discourse Treebank 2.0 (PDiT 2.0)

PDiT 2.0 (Rysová et al. 2016) is a new version of annotation of discourse relations in the Prague Dependency Treebank. It contains a complex annotation of discourse phenomena newly enriched by the annotation of secondary connectives, i.e. non-grammaticalized and mostly multiword expressions like z tohoto důvodu “for this reason”, za těchto podmínek “under these conditions” etc. It also contains a revised annotation of primary discourse connectives from the previous versions, PDiT 1.0 and PDT 3.0.

Introduction

Annotation of discourse relations is a project related to the Prague Dependency Treebank 3.0 (PDT; Bejček et al. 2013), which is a revised, updated and extended version of the Prague Dependency Treebank 2.5 (Bejček et al. 2011) and  the Prague Dependency Treebank 2.0 (Hajič et al. 2006). It represents a new manually annotated layer of language description, above the existing layers of the PDT (morphology, surface syntax and underlying syntax) and it portrays linguistic phenomena from the perspective of discourse structure and coherence. The discourse layer of the treebank contains two subprojects:

  1. lexically-grounded approach of identification of discourse connectives, discourse units linked by them and semantic relations between these units, and
  2. annotations of extended textual coreference and bridging relations.

With its 49,431 manually annotated sentences from Czech newspapers, the project serves as a large-scale resource for linguistic research in the area of discourse analysis as well as for computational experiments concerning automatic text analysis, information extraction, text summarization and other branches of NLP research.

Contrary to the majority of similarly aimed corpus projects, the discourse-related information has been annotated directly on the syntactic trees and technically is a part of the underlying syntax layer of the PDT. This methodological approach allows us to include discourse-relevant syntactic phenomena annotated earlier (such as e.g. discourse relations expressed by dependent clauses) into the discourse representation, and to take advantage of the syntactic structure itself (resolution of elliptical structures, parentheses, appositions etc.). Also, from the perspective of querying the treebank and visualizing, all the different types of linguistic information are interlinked and available/visible at once.

The Prague Discourse Treebank 2.0 annotates the same texts as the Prague Dependency Treebank 3.0 (PDT 3.0, Bejček et al. 2013), PDT 2.5 (Bejček et al. 2011), PDT 2.0 (Hajič et al. 2006), and the Prague Discourse Treebank 1.0 (PDiT 1.0, Poláková et al. 2012). Apart from fixing errors and improving the annotation on all annotation layers, new information was added to the data in each new issue:

  • from PDT 2.0 to PDT 2.5
    • Multiword expressions
    • Pair/group meaning
    • Clause segmentation
  • from PDT 2.5 to PDiT 1.0
    • Extended textual coreference
    • Bridging anaphora
    • Discourse relations marked by explicit connectives
  • from PDiT 1.0 to PDT 3.0
    • Revision of several grammatemes
    • Revision of sentence modality annotation
    • Replacement of t_lemma #Benef
    • Genres of documents
    • Pronominal textual coreference of 1st and 2nd person
    • Updated discourse relations marked by explicit connectives
  • from PDT 3.0 to PDiT 2.0
    • Annotation of secondary connectives and senses (semantico-pragmatic discourse relations) they express
    • Updated annotation of discourse relations marked by primary connectives:
      • fixes of various individual errors
      • missing connectives filled in (except for relations of 'specification')
      • relations marked with discourse type 'other' changed to a nearest other type
      • fixes in strange low-count connectives

All the additional annotation (with the exception of clause segmentation) was performed on the tectogrammatical trees and technically is a part of the underlying syntax layer of the PDT. The annotation of clause segmentation was done on the analytical layer.

The Prague Discourse Treebank 2.0 can be downloaded from the LINDAT-Clarin repository (see the Licence).

References

Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague Dependency Treebank 3.0. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2013. (http://ufal.mff.cuni.cz/pdt3.0/)

Bejček, E., Panevová, J., Popelka, J., Smejkalová, L., Straňák, P., Ševčíková, M., Štěpánek, J., Toman, J., Žabokrtský, Z., Hajič, J.: Prague Dependency Treebank 2.5. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2011. (http://ufal.mff.cuni.cz/pdt2.5/)

Hajič et al.: Prague Dependency Treebank 2.0. Data/software, Linguistic Data Consortium, Philadelphia, PA, USA, 2006. ISBN 1-58563-370-4 (http://www.ldc.upenn.edu)

Poláková, L., Jínová, P., Zikánová, Š., Hajičová, E., Mírovský, J., Nedoluzhko, A., Rysová, M., Pavlíková, V., Zdeňková, J., Pergler, J., Ocelák, R.: Prague Discourse Treebank 1.0. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2012. (http://ufal.mff.cuni.cz/pdit/)

Rysová, M., Synková, P., Mírovský, J., Hajičová, E., Nedoluzhko, A., Ocelák, R., Pergler, J., Poláková, L., Scheller, V., Zdeňková, J., Zikánová, Š.: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, Lindat/Clarin: http://hdl.handle.net/11234/1-1905, Dec 2016