Prague Dependency Treebank 3.5

Introduction

The Prague Dependency Treebank 3.5 is a 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied Linguistics under various projects between 1996 and 2018 on the original texts. There are other members of the "family" of the Prague Dependency Treebanks, available separately and described elsewhere; search for "Prague Dependency Treebank" in the LINDAT/CLARIN repository.

The Prague Dependency Treebank 3.5 contains the same texts as the previous versions since 2.0; there are 49,431 annotated sentences (over 800 thousand nodes) on all layers, from tectogrammatical to words, and additional sentences on the analytical (surface dependency syntax) and morphological layers of annotation (approx. 1.8 million words in total).

For more information about this version of the treebank, see below for a changelog and a download link and the menu tabs above for the description of data, available documentation, credits and support acknowledgements.

Example sentence from PDT 3.5

Sarančata jsou doposud ve stadiu larev a pohybují se pouze lezením. V tomto období je účinné bojovat proti nim chemickými postřiky, ale dožívající družstva ani soukromí rolníci nemají na jejich nákup potřebné prostředky.

Example sentences from PDT 3.5, with tectogrammatical annotation including coreference links (blue and brown arrows), MWEs (red stripes) and discourse annotation (orange arrows and attributes/lables). Lit.: Grasshoppers are still in the larvae stadium, crawling only. At this time of the year, it is efficient to fight them using chemicals, but neither the ailing cooperatives nor private farmers can afford them.

From PDT 1.0 to PDT 3.5

The first version of PDT has been published at LDC in 2001. Since then, various branches of PDT have been developed, adding more annotation. Most importantly, the PDT 2.0 added the tectogrammatical layer, which distinguishes the PDT family of treebanks from most other dependency treebanks available. As of January 2018, PDT 3.5 is the current version encompassing all previous versions, corrections and additional annotation. The history of the PDT editions is briefly listed below.

  • PDT 1.0
    • Words, Tokenization
    • Morphology (13 categories (features): POS, number, gender, case, negation, ...)
    • (Surface) Dependency syntax ("analytical layer"), dependency relations
  • Added in PDT 2.0
    • Tectogrammatical annotation (deep syntax, valency), including valency dictionary PDT-Vallex
    • Coreference (pronominal/textual, grammatical)
    • Information structure
    • Grammatemes (tense, modalities, number, ...)
  • Added in PDT 2.5
    • Multiword expressions
    • Pair/group meaning
    • Clause segmentation (on analytical layer)
  • Added in PDiT 1.0
    • Extended textual coreference
    • Bridging anaphora
    • Discourse relations marked by explicit connectives
  • Added in PDT 3.0
    • Revision of several grammatemes
    • Revision of sentence modality annotation
    • Replacement of t_lemma #Benef
    • Genres of documents
    • Pronominal textual coreference of 1st and 2nd person
    • Updated discourse relations marked by explicit connectives
  • Added in PDiT 2.0
    • Annotation of secondary connectives and senses (semantico-pragmatic discourse relations) they express
    • Updated annotation of discourse relations marked by primary connectives:
      • fixes of various individual errors
      • missing connectives filled in (except for relations of 'specification')
      • relations marked with discourse type 'other' changed to a nearest other type
      • fixes in strange low-count connectives
  • Added in PDT 3.5
    • Consolidated documentation, authorship, licence
    • New and separate item in LINDAT/CLARIN repository

Download

To download the data, please visit the PDT 3.5 item in the LINDAT/CLARIN repository.

Search

To search the treebank please use the PML-TQ (PML Tree Query) service at LINDAT/CLARIN. Please note this leads to search in PDT 3.0, but except for the discourse annotation added later in PDiT 2.0, the data are identical. (PDT 3.5 in PML-TQ is coming soon.)

Cite

To properly acknowledge this resource, please cite the following data item in the LINDAT/CLARIN repository:

For LREC papers (separate language resources references):


@languageresource{lrPDT35,
 title = {Prague Dependency Treebank 3.5},
 author = {Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtina 
 and Bur\'{a}\v{n}ov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i} 
 and Homola, Petr and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava 
 and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie 
 and Lopatkov\'{a}, Mark\'{e}ta and Mikulov\'{a}, Marie and M\'{\i}rovský, Ji\v{r}\'{\i} 
 and Nedoluzhko, Anna and Pajas, Petr and Panevov\'{a}, Jarmila 
 and Pol\'{a}kov\'{a}, Lucie and Rysov\'{a}, Magdal\'{e}na and Sgall, Petr 
 and Spoustov\'{a}, Johanka and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na 
 and Šev\v{c}\'{\i}kov\'{a}, Magda and Štěp\'{a}nek, Jan and Urešov\'{a}, Zde\v{n}ka 
 and Vidov\'{a} Hladk\'{a}, Barbora and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka 
 and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k},
 url = {http://hdl.handle.net/11234/1-2621},
 publisher={Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University},
 address={Prague, Czech Republic}, 
 lindat={http://hdl.handle.net/11234/1-2621},
 year = {2018} }

For general papers and citations:


@misc{11234/1-2621,
 title = {Prague Dependency Treebank 3.5},
 author = {Haji\v{c}, Jan and Bej\v{c}ek, Eduard and B\'{e}mov\'{a}, Alevtina 
 and Bur\'{a}\v{n}ov\'{a}, Eva and Haji\v{c}ov\'{a}, Eva and Havelka, Ji\v{r}\'{\i} 
 and Homola, Petr and K\'{a}rn\'{\i}k, Ji\v{r}\'{\i} and Kettnerov\'{a}, V\'{a}clava 
 and Klyueva, Natalia and Kol\'{a}\v{r}ov\'{a}, Veronika and Ku\v{c}ov\'{a}, Lucie 
 and Lopatkov\'{a}, Mark\'{e}ta and Mikulov\'{a}, Marie and M\'{\i}rovský, Ji\v{r}\'{\i} 
 and Nedoluzhko, Anna and Pajas, Petr and Panevov\'{a}, Jarmila 
 and Pol\'{a}kov\'{a}, Lucie and Rysov\'{a}, Magdal\'{e}na and Sgall, Petr 
 and Spoustov\'{a}, Johanka and Stra\v{n}\'{a}k, Pavel and Synkov\'{a}, Pavl\'{\i}na 
 and Šev\v{c}\'{\i}kov\'{a}, Magda and Štěp\'{a}nek, Jan and Urešov\'{a}, Zde\v{n}ka 
 and Vidov\'{a} Hladk\'{a}, Barbora and Zeman, Daniel and Zik\'{a}nov\'{a}, {\v{S}}\'{a}rka 
 and {\v{Z}}abokrtsk\'{y}, Zden\v{e}k},
 url = {http://hdl.handle.net/11234/1-2621},
 note = {{LINDAT}/{CLARIN} digital library at the Institute of Formal and Applied Linguistics ({{\\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
 copyright = {Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)},
 year = {2018} }

For "plaintext" reference:

(Hajič et al., 2018)

Hajič, J., Bejček, E., Bémová, A., Buráňová, E., Hajičová, E., Havelka, J., Homola, P., Kárník, J., Kettnerová, V., Klyueva, N., Kolářová, V., Kučová, L., Lopatková, M., Mikulová, M., Mírovský, J., Nedoluzhko, A., Pajas, P., Panevová, J., Poláková, L., Rysová, M., Sgall, P., Spoustová, J., Straňák, P., Synková, P., Ševčíková, M., Štěpánek, J., Urešová, Z., Vidová Hladká, B., Zeman, D., Zikánová, Š. and Žabokrtský, Z. (2018). Prague Dependency Treebank 3.5. Institute of Formal and Applied Linguistics, LINDAT/CLARIN, Charles University, LINDAT/CLARIN PID: http://hdl.handle.net/11234/1-2621.

For footnote references, the following is sufficient in LaTeX papers:


\url{http://hdl.handle.net/11234/1-2621}