Prague Dependency Treebank 3.5

Data

The Prague Dependency Treebank 3.5 (PDT 3.5) can be downloaded as a single zip archive from the LINDAT-Clarin repository (see the Licence).

After unzipping the downloaded archive, the data can be found in the directory data, where they are divided into three directories according to the highest layer on which they have been annotated:

  • tamw contains documents annotated on all three annotation layers (morphological, analytical, tectogrammatical, together with all additional annotation and all corrections done after PDT 2.0 has been released),
  • amw contains documents annotated on the morphological and analytical layer (but not on the tectogrammatical layer),
  • mw contains documents annotated on the morphological layer only.

In each directory, the data are further divided into ten subdirectories (train-1 ... train-8, dtest, etest). Annotation of each document is captured in (up to) four interlinked files, in accordance with the layer of annotation: word layer (files *.w.gz), morphological layer (*.m.gz), analytical layer (*.a.gz), and tectogrammatical layer (*.t.gz).

 

The data are stored in the Prague Markup Language format (PML, Pajas and Štěpánek 2008; as a single PDF also here). PML is an XML based format for linguistic annotation (esp. treebanks). For the sake of completeness, PML schemata of the files can be found in the directory resources. (The schemata are XML files that describe the structure of the annotated files.) Also, the valency lexicon PDT-Vallex 3.0 can be found in the same directory.

How to browse the data

Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open and browse the data. The editor can be downloaded for various platforms from its home page. Please follow the installation instructions specified for your operating system.

After the installation, an extension needs to be installed:

  1. Start TrEd.
  2. In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
  3. Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears.
  4. Make sure that at least the extension "Prague Dependency Treebank 3.5 (pdt35 x.y)" is checked to install (if it is not in the list, it may have already been installed; it contains all that is necessary to view the files for PDT 3.5).
  5. Click on the button "Install Selected"; the selected extensions get installed.
  6. Close all TrEd windows including the main application window and start TrEd again.

Now, TrEd is able to open the data of PDT 3.5. To see the annotation of a document on the tectogrammatical layer, open the respective file with extension .t.gz. You can switch between the standard PDT view (no discourse) and discourse annotation view by using the "Style:" selection button in the upper right corner of TrEd's screen; PDT_35_T shows the standard (no discourse) annotation, PDT_35_T_Discourse adds the discourse annotation. In both styles, a context of two preceding and following sentences is shown in the upper text window. This can be changed by clicking on the neigh_sent setting in the lower right corner, next to the "Scale" slider. In addition, the PDT_35_T_Discourse style allows for displaying selected number of trees (left and right context) in the main TrEd window; click on the neigh_trees setting, next to the neigh_sent one in the lower right corner of the TrEd window.

In case of trouble with the installation of TrEd or with browsing the data, please contact the authors at (tred at ufal.mff.cuni.cz).

How to search the data

PML Tree Query (PML-TQ; Pajas and Štěpánek 2009) is a powerful client-server based system for querying treebanks, developed primarily for searching in PDT data. TrEd can be used as a user-friendly graphically oriented client, using the extension "PML Tree Query Interface for TrEd (pmltq)" (follow the same installation steps as for the pdt35 extension described above). To get access to the full data in the public PML-TQ server provided by the UFAL institute, please contact the administrators at (tred at ufal.mff.cuni.cz). Please refer to the PML-TQ web page for further information.

The public PML-TQ server of our institute can be accessed also anonymously using a web browser as the client. Unlike the TrEd client, it does not need any registration but lacks the possibility to create a query graphically and has limitations in the ways it can display the results. However, it is the quickest way to get to the data.

References

Pajas, P. and Štěpánek, J.: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

Pajas, P. and Štěpánek, J.: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.