Data

The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) can be downloaded as a single zip archive from the LINDAT-Clarin repository (see the Licence). After unzipping the downloaded archive, the data can be found in directory data.

The corpus is published in a PML format closely related to the format of the Prague Dependency Treebank: each document is represented by multiple interlinked files corresponding to annotation layers:

  • t-layer (tectogrammatics, deep syntax),
  • a-layer (analytics, surface syntax),
  • mdata layer (reconstructed text on the morphological layer),
  • wdata layer (word layer, tokenized text, manual transcript of the audio),
  • zdata layer (automatic speech recognition of the audio), and
  • audio layer (the original audio).

Please note that the numbering of files differs from the previous version of the corpus, the PDTSC 1.0. Also, in some files, links from the wdata layer to the zdata layer were unfortunately lost during the annotation process of the PDTSC 2.0 and could not be restored. It concerns the following wdata files: pdtsc_023, pdtsc_046, pdtsc_067, pdtsc_081, pdtsc_142, pdtsc_144 and pdtsc_148; they correspond to the following higher-layers files: hs_002.*, lk_016.*, ak_001.*, dk_110.*, dk_111.*, dk_102.* and dk_103.*, respectively.

Update: The PDTSC 2.0 data with a new morphological annotation and also in several other formats were published as a part of the Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0) in 2020.

PML

The Prague Markup Language format (PML, Pajas and Štěpánek 2008) is an XML based format for linguistic annotation (esp. treebanks). With the exception of the audio files, all data formats of the corpus in the PDTSC 2.0 are instances of the PML. For the sake of completeness, PML schemata of the files can be found in directory resources. (The schemata are XML files that describe the structure of the annotated files.) Directory resources also contains Czech valency lexicon PDT-Vallex 4.0 (pdtvallex-4.0.xml), corresponding to the annotation on the tectogrammatical layer.

How to browse the data

Tree editor TrEd

Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open and browse t-files, a-files and mdata-files in directory data. The editor can be downloaded for various platforms from its home page. Please follow installation instructions specified for your operating system.

After the installation, an extension needs to be installed:

(Note: The PDTSC 2.0 needs the same extension as the PDT-C 1.0.)

  1. Start TrEd.
  2. In the top menu, select Setup -> Manage Extensions...; a dialog window with a list of installed extensions appears.
  3. Click on the button "Get New Extensions"; a dialog window with a list of available (not yet installed) extensions appears. Make sure that at least the extension "Prague Dependency Treebank - Consolidated 1.0 (pdtc10)" is checked to install (if it is not in the list, it may have already been installed).
  4. Click on the button "Install Selected"; the selected extensions get installed.
  5. Close all TrEd windows including the main application window and start TrEd again.

Now, TrEd is able to open t-layer, a-layer and mdata-layer files in directory data.

In case of troubles with the installation of TrEd or the extension, or with browsing the data, please refer to the Troubleshooting section or contact the authors at (tred at ufal.mff.cuni.cz).

Editor MEd

Editor MEd can be used to open lower-layer files in directory data, namely mdata files interlinked down to the audio layer. To install the editor, please follow installation instructions at its home page.

Before opening the mdata files, make sure that you have schemata m-pdtsc-schema.xml, w-pdtsc-schema.xml, z-pdtsc-schema.xml, mdata_c_schema.xml and wdata_c_schema.xml from directory resources in the same directory as the mdata files.

How to search the data

PML Tree Query (PML-TQ; Pajas and Štěpánek 2009) is a powerful client-server system for querying treebanks. There are two clients available:

  • Tree editor TrEd can be used as a user-friendly graphically oriented client, using the extension "PML Tree Query Interface for TrEd (pmltq)" (follow the same installation steps as for the pdtc10 extension described above).
  • The public PML-TQ server of our institute can be accessed also anonymously using a web browser as a client. Unlike the TrEd client, it lacks the possibility to create a query graphically and has limitations in the ways it can display the results. However, it is the quickest way to get to the data.

For further information about the clients, see the PML-TQ clients documentation. For general info about the PML-TQ, please refer to the PML-TQ web page. For documentation about the query language and tutorials, see the PML-TQ user documentation.

The PDTSC 2.0 textual data, as a part of the PDT-C 1.0, is available for searching using the PML-TQ from ÚFAL's public PML-TQ server (the treebank id for the TrEd client is in parentheses):

  • PDT-C 1.0 - PDTSC (pdtc10_pdtsc): the PDTSC part of the PDT-C 1.0 (without the audio-related layers)

In case of troubles with the PML-TQ or with searching in the PDTSC data particularly, please refer to the Troubleshooting section or try to contact the developers at (pmltq at ufal.mff.cuni.cz).

References

Pajas, P. and Štěpánek, J.: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.

Pajas, P. and Štěpánek, J.: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.


Troubleshooting

No entries yet.