Prague Discourse Treebank 3.0 (PDiT 3.0)

The Prague Discourse Treebank 3.0 (PDiT 3.0; Synková et al. 2022) is a new version of annotation of discourse relations marked by primary and secondary discourse connectives in the data of the Prague Dependency Treebank. With respect to the previous versions (mostly PDiT 2.0 and PDiT 1.0, for a complete list see below), PDiT 3.0 brings a largely revised annotation of discourse relations and offers the data also in the Penn Discourse Treebank 3.0 (PDTB 3.0; Prasad et al. 2019) format and sense taxonomy.

Introduction

Annotation of discourse relations is a project related to the Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0; Hajič et al. 2020), a consolidated release of several previously published Prague dependency treebanks. PDiT 3.0 relates to the PDT part of the PDT-C 1.0. It represents a new manually annotated layer of language description, above the existing layers of the PDT (morphology, surface syntax and underlying syntax) and it portrays linguistic phenomena from the perspective of discourse structure and coherence. The discourse annotation represents a lexically-grounded approach of identification of discourse connectives, discourse units linked by them and semantic relations between these units.

With its 49,431 manually annotated sentences from Czech newspapers, the project serves as a large-scale resource for linguistic research in the area of discourse analysis as well as for computational experiments concerning automatic text analysis, information extraction, text summarization and other branches of NLP research.

Contrary to the majority of similarly aimed corpus projects, the discourse-related information has been annotated directly on the syntactic trees and technically is a part of the underlying syntax layer of the PDT. This methodological approach allows us to include discourse-relevant syntactic phenomena annotated earlier (such as e.g. discourse relations expressed by dependent clauses) into the discourse representation, and to take advantage of the syntactic structure itself (resolution of elliptical structures, parentheses, appositions etc.). Also, from the perspective of querying the treebank and visualizing, all the different types of linguistic information are interlinked and available/visible at once.

For better availability of the data for the international community, the discourse annotation in PDiT 3.0 is also published in the Penn Discourse Treebank 3.0 format and formalism, i.e. the annotation of discourse relations is recorded in a textual column format with links to the underlying plain texts and the Prague discourse types have been transformed to the Penn senses.

The Prague Discourse Treebank 3.0 annotates the same texts as all the previous releases of Prague discourse annotation on top of the PDT data, i.e. the PDT part of the Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0, Hajič et al. 2020), the Prague Dependency Treebank 3.5 (PDT 3.5, Hajič et al. 2018), the Prague Discourse Treebank 2.0 (PDiT 2.0, Rysová et al. 2016), the Prague Dependency Treebank 3.0 (PDT 3.0, Bejček et al. 2013), and the Prague Discourse Treebank 1.0 (PDiT 1.0, Poláková et al. 2012). For completeness, the underlying PDT corpus had two more previous issues, the Prague Dependency Treebank 2.5 (PDT 2.5, Bejček et al. 2011) and the Prague Dependency Treebank 2.0 (PDT 2.0, Hajič et al. 2006). The following overview enumerates main changes related to discourse annotation between individual issues:

  • from PDT 2.5 to PDiT 1.0
    • Extended textual coreference
    • Bridging anaphora
    • Discourse relations marked by explicit connectives
  • from PDiT 1.0 to PDT 3.0
    • Genres of documents
    • Pronominal textual coreference of 1st and 2nd person
    • Updated discourse relations marked by explicit connectives
  • from PDT 3.0 to PDiT 2.0
    • Annotation of secondary connectives and senses (semantico-pragmatic discourse relations) they express
    • Updated annotation of discourse relations marked by primary connectives:
      • fixes of various individual errors
      • missing connectives filled in (except for relations of 'specification')
      • relations marked with discourse type 'other' changed to a nearest other type
      • fixes in strange low-count connectives
  • from PDiT 2.0 to PDT 3.5 to PDT-C 1.0
    • Fixes of individual errors
  • from PDT-C 1.0 to PDiT 3.0
    • Revisions of discourse types annotation based on work on Czech lexicon of discourse connectives
    • Transformation to the PDTB 3.0 format and sense taxonomy

All the additional annotation was performed on the tectogrammatical trees and technically is a part of the underlying syntax layer of the PDT.

The Prague Discourse Treebank 3.0 can be downloaded from the LINDAT-Clarin repository (see the Licence).

References

Bejček, E., Hajičová, E., Hajič, J., Jínová, P., Kettnerová, V., Kolářová, V., Mikulová, M., Mírovský, J., Nedoluzhko, A., Panevová, J., Poláková, L., Ševčíková, M., Štěpánek, J., Zikánová, Š.: Prague Dependency Treebank 3.0. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2013. (http://ufal.mff.cuni.cz/pdt3.0/)

Bejček, E., Panevová, J., Popelka, J., Smejkalová, L., Straňák, P., Ševčíková, M., Štěpánek, J., Toman, J., Žabokrtský, Z., Hajič, J.: Prague Dependency Treebank 2.5. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2011. (http://ufal.mff.cuni.cz/pdt2.5/)

Hajič, J., Bejček, E. Bémová, A., Buráňová, E., Fučíková, E., Hajičová, E., Havelka, J., Hlaváčová, J., Homola, P., Ircing, P., Kárník, J.,  Kettnerová, V., Klyueva, N., Kolářová, V., Kučová, L., Lopatková, M., Mareček, D., Mikulová, M., Mírovský, J., Nedoluzhko, A., Novák, M., Pajas, P., Panevová, J., Peterek, N., Poláková, L., Popel, M., Popelka, J., Romportl, J., Rysová, M., Semecký, J., Sgall, P., Spoustová, J., Straka, M., Straňák, P., Synková, P., Ševčíková, M., Šindlerová, J., Štěpánek, J., Štěpánková, B., Toman, J., Urešová, Z., Vidová Hladká, B., Zeman, D., Zikánová, Š., Žabokrtský, Z.: Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0). Data/software, LINDAT-CLARIAH, URL: http://hdl.handle.net/11234/1-3185, 2020.

Hajič, J., Bejček, E., Bémová, A., Buráňová, E., Hajičová, E., Havelka, J.,  Homola, P., Kárník, J., Kettnerová, V., Klyueva, N., Kolářová, V., Kučová, L., Lopatková, M., Mikulová, M., Mírovský, J., Nedoluzhko, A., Pajas, P.,  Panevová, J., Poláková, L., Rysová, M., Sgall, P., Spoustová, J.,  Straňák, P., Synková, P., Ševčíková, M., Štěpánek, J., Urešová, Z.,  Vidová Hladká, B., Zeman, D., Zikánová, Š. and Žabokrtský, Z.: Prague Dependency Treebank 3.5. Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University, 2018. (http://hdl.handle.net/11234/1-2621).

Hajič et al.: Prague Dependency Treebank 2.0. Data/software, Linguistic Data Consortium, Philadelphia, PA, USA, 2006. ISBN 1-58563-370-4 (http://www.ldc.upenn.edu)

Poláková, L., Jínová, P., Zikánová, Š., Hajičová, E., Mírovský, J., Nedoluzhko, A., Rysová, M., Pavlíková, V., Zdeňková, J., Pergler, J., Ocelák, R.: Prague Discourse Treebank 1.0. Data/software, Univerzita Karlova v Praze, MFF, ÚFAL, Prague, 2012. (http://ufal.mff.cuni.cz/pdit/)

Prasad, R., Webber, B., Lee, A.  and Joshi, A.: Penn Discourse Treebank Version 3.0. Data/Software, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, LDC2019T05, 2019

Rysová, M., Synková, P., Mírovský, J., Hajičová, E., Nedoluzhko, A., Ocelák, R., Pergler, J., Poláková, L., Scheller, V., Zdeňková, J., Zikánová, Š.: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, Lindat/Clarin: http://hdl.handle.net/11234/1-1905, Dec 2016

Synková, P., Rysová, M., Mírovský, J., Poláková, L., Scheller, V., Zdeňková, J., Zikánová, Š., Hajičová, E.: Prague Discourse Treebank 3.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, LINDAT/CLARIAH-CZ: http://hdl.handle.net/11234/1-4875, Dec 2022