Data

The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0) can be downloaded as a single zip archive from the LINDAT-Clarin repository (see the Licence).

After unzipping the downloaded archive, the data can be found in directory data, where they are divided into four subdirectories representing four source corpora:

  • Faust
  • PCEDT
  • PDT
  • PDTSC

The fifth subdirectory there (dictionaries) contains two lexicons related to the corpora:

  • pdtvallex-4.0.xml - Czech valency lexicon PDT-Vallex 4.0
  • czech-morfflex-2.0.xz - Czech morphological dictionary

All four corpora are uniformly published in three formats and placed in the respective directories:

  • pml - a PML format (see below) used in the previous versions of the PDT since version 2.0 (each document is represented by four files corresponding to four layers: t-layer (tectogrammatics), a-layer (analytics, surface syntax), m-layer (morphology) and w-layer (word layer, tokenized text)
  • treex - technically also a PML format, used in the NLP system Treex (all annotation layers are in a single file)
  • mrp - a JSON-based format used in the CoNLL 2019 and 2020 shared tasks on meaning representation parsing (see Uniform Graph Interchange Format); unlike the PML and Treex formats, the conversion to the MRP format is lossy - it extracts part of the annotation from the t- and w-layers while discarding morphology and surface syntax (Zeman and Hajič 2020)

Corpora with singular properties are available also in additional formats/directories:

  • PDTSC - the whole PDTSC consisting of all layers from the audio up to the tectogrammatics, in a slightly modified PML format, is placed in directory PDTSC/full; please note that the numbering of files in this directory differs from the PDTSC 1.0; also, in some files, links from the wdata layer to the zdata layer were unfortunately lost during the annotation process of the PDTSC 2.0 and could not be restored. It concerns the following wdata files: pdtsc_023, pdtsc_046, pdtsc_067, pdtsc_081, pdtsc_142, pdtsc_144 and pdtsc_148; they correspond to the following higher-layers files: hs_002.*, lk_016.*, ak_001.*, dk_110.*, dk_111.*, dk_102.* and dk_103.*, respectively.
  • PDT - not the whole PDT has been annotated on all annotation layers; the highest annotated layer is indicated by the directory name:

    In each directory, the PDT data are further divided into ten subdirectories (train-1 ... train-8, dtest, etest), suggesting their recommended use in machine-learning experiments (train data, development test data, evaluation test data).

    • tamw - documents annotated on all three annotation layers (morphological, analytical, tectogrammatical)
    • amw - documents annotated on the morphological and analytical layer (but not on the tectogrammatical layer)
    • mw - documents annotated on the morphological layer only.

Please note that for the PDTSC, directory PDTSC/full should be considered the primary data source, as directory PDTSC/pml only contains (for each document) one reconstructed m-layer file, an artificial w-layer file and no lower layers. However, the morphological annotation is only present in the m-layer files in directory PDTSC/pml (it is not in the mdata files in directory PDTSC/full).

PML

The Prague Markup Language format (PML, Pajas and Štěpánek 2008) is an XML based format for linguistic annotation (esp. treebanks). With the exception of the MRP and audio files, all data formats of the corpora in the PDT-C are instances of the PML. For the sake of completeness, PML schemata of the files can be found in directory resources. (The schemata are XML files that describe the structure of the annotated files.)

How to browse the data

Tree editor TrEd

Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open and browse the PML data, i.e. data in directories pml, treex and, for the PDTSC, also directory full. Please note that in directories pml and PDTSC/full, only t-files, a-files and m-files (mdata-files) can be opened directly in TrEd.

The editor can be downloaded for various platforms from its home page. Please follow installation instructions specified for your operating system.

After the installation, an extension needs to be installed:

  1. Start TrEd.
  2. In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
  3. Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears.
    • For data from directories pml and PDTSC/full, make sure that at least the extension "Prague Dependency Treebank - Consolidated 1.0 (pdtc10)" is checked to install (if it is not in the list, it may have already been installed).
    • For data from directories treex, make sure that at least the extension "EasyTreex - browse and edit Treex files (*.treex, *.treex.gz, *.streex) (easytreex)" is checked to install (if it is not in the list, it may have already been installed).
  4. Click on the button "Install Selected"; the selected extensions get installed.
  5. Close the Manage Extensions dialog window. If you have checked the easytreex extension to install, TrEd may be unresponsive for serveral minutes, as it is now installing Perl modules Treex::Core in the background. Wait until it is finished and TrEd starts to react again.
  6. Close all TrEd windows including the main application window and start TrEd again.

Now, TrEd is able to open t-layer, a-layer and m-layer (mdata-layer) files in directories pml and PDTSC/full, and files in directories treex of the PDT-C 1.0.

In case of troubles with the installation of TrEd or the extensions, or with browsing the data, please refer to the Troubleshooting section or contact the authors at (tred at ufal.mff.cuni.cz).

Editor MEd

Editor MEd can be used to open lower-layer files of data in directory PDTSC/full, i.e. mdata files interlinked down to the audio layer. To install the editor, please follow installation instructions at its home page.

Before opening the mdata files, make sure that you have schemata m-pdtsc-schema.xml, w-pdtsc-schema.xml, z-pdtsc-schema.xml, mdata_c_schema.xml and wdata_c_schema.xml from directory resources in the same directory as the mdata files.

How to search the data

PML Tree Query (PML-TQ; Pajas and Štěpánek 2009) is a powerful client-server system for querying treebanks, developed primarily for searching in the PDT data. The PDT-C as a whole and also all its subcorpora separately are available for searching using the PML-TQ from ÚFAL's public PML-TQ server. There are two clients available:

  • Tree editor TrEd can be used as a user-friendly graphically oriented client, using the extension "PML Tree Query Interface for TrEd (pmltq)" (follow the same installation steps as for the pdtc10 extension described above).
  • The public PML-TQ server of our institute can be accessed also anonymously using a web browser as a client. Unlike the TrEd client, it lacks the possibility to create a query graphically and has limitations in the ways it can display the results. However, it is the quickest way to get to the data.

For further information about the clients, see the PML-TQ clients documentation. For general info about the PML-TQ, please refer to the PML-TQ web page. For documentation about the query language and tutorials, see the PML-TQ user documentation.

There are several PDT-C data sets available for searching (treebank ids for the TrEd client are in parentheses):

  • PDT-C 1.0 (pdtc10): the whole PDT-C 1.0 data (without the English part of the PCEDT and the audio-related layers of the PDTSC)
  • PDT-C 1.0 - Faust (pdtc10_faust): the Faust part of the PDT-C 1.0
  • PDT-C 1.0 - PCEDT-cz (pdtc10_pcedt-cz): the Czech PCEDT part of the PDT-C 1.0
  • PDT-C 1.0 - PDT (pdtc10_pdt): the PDT part of the PDT-C 1.0
  • PDT-C 1.0 - PDTSC (pdtc10_pdtsc): the PDTSC part of the PDT-C 1.0 (without the audio-related layers)

In case of troubles with the PML-TQ or with searching in the PDT-C data particularly, please refer to the Troubleshooting section or try to contact the developers at (pmltq at ufal.mff.cuni.cz).

References

Pajas, P. and Štěpánek, J.: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

Pajas, P. and Štěpánek, J.: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.

Zeman, D. and Hajič, J.: FGD at MRP 2020: Prague Tectogrammatical Graphs. In: Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-952148-64-4, pp. 33-39, 2020.


Troubleshooting

TrEd

Trying to open .treex files, I get the error message "Couldn't open PML schema file 'treex_schema.xml'"

It seems that the TrEd extension easytreex has not succeeded in installing Perl modules Treex::Core. Try to install them manually from the command line via cpanm Treex::Core or cpan Treex::Core.

Trying to open .treex files, I get the message "Error: member 'discourse_special' not declared for type 't-node.type' at..."

It seems that your Perl modules Treex::Core are outdated. To solve this issue quickly, copy schemas treex_schema.xml and treex_subschema_t_layer.xml from directory resources of the PDT-C distribution to the same directory as the .treex files you are opening. For a more permanent solution, try to uninstall and install again the easytreex TrEd extension. Finally, you can try to reinstall manually Perl modules Treex::Core, from the command line by cpanm --reinstall Treex::Core.