The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) can be downloaded as a single zip archive from the LINDAT-Clarin repository (see the Licence). After unzipping the downloaded archive, the data can be found in directory
The corpus is published in a PML format closely related to the format of the Prague Dependency Treebank: each document is represented by multiple interlinked files corresponding to annotation layers:
Please note that the numbering of files differs from the previous version of the corpus, the PDTSC 1.0. Also, in some files, links from the wdata layer to the zdata layer were unfortunately lost during the annotation process of the PDTSC 2.0 and could not be restored. It concerns the following wdata files:
pdtsc_148; they correspond to the following higher-layers files:
Update: The PDTSC 2.0 data with a new morphological annotation and also in several other formats were published as a part of the Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0) in 2020.
The Prague Markup Language format (PML, Pajas and Štěpánek 2008) is an XML based format for linguistic annotation (esp. treebanks). With the exception of the audio files, all data formats of the corpus in the PDTSC 2.0 are instances of the PML. For the sake of completeness, PML schemata of the files can be found in directory
resources. (The schemata are XML files that describe the structure of the annotated files.) Directory
resources also contains Czech valency lexicon PDT-Vallex 4.0 (
pdtvallex-4.0.xml), corresponding to the annotation on the tectogrammatical layer.
Tree editor TrEd (Pajas and Štěpánek 2008) can be used to open and browse t-files, a-files and mdata-files in directory
data. The editor can be downloaded for various platforms from its home page. Please follow installation instructions specified for your operating system.
(Note: The PDTSC 2.0 needs the same extension as the PDT-C 1.0.)
Now, TrEd is able to open t-layer, a-layer and mdata-layer files in directory
In case of troubles with the installation of TrEd or the extension, or with browsing the data, please refer to the Troubleshooting section or contact the authors at (tred
Editor MEd can be used to open lower-layer files in directory
data, namely mdata files interlinked down to the audio layer. To install the editor, please follow installation instructions at its home page.
Before opening the mdata files, make sure that you have schemata
wdata_c_schema.xml from directory
resources in the same directory as the mdata files.
PML Tree Query (PML-TQ; Pajas and Štěpánek 2009) is a powerful client-server system for querying treebanks. There are two clients available:
For further information about the clients, see the PML-TQ clients documentation. For general info about the PML-TQ, please refer to the PML-TQ web page. For documentation about the query language and tutorials, see the PML-TQ user documentation.
The PDTSC 2.0 textual data, as a part of the PDT-C 1.0, is available for searching using the PML-TQ from ÚFAL's public PML-TQ server (the treebank id for the TrEd client is in parentheses):
pdtc10_pdtsc): the PDTSC part of the PDT-C 1.0 (without the audio-related layers)
In case of troubles with the PML-TQ or with searching in the PDTSC data particularly, please refer to the Troubleshooting section or try to contact the developers at (pmltq
Pajas, P. and Štěpánek, J.: Recent Advances in a Feature-Rich Framework for Treebank Annotation. In: The 22nd International Conference on Computational Linguistics - Proceedings of the Conference, The Coling 2008 Organizing Committee, Manchester, UK, ISBN 978-1-905593-45-3, pp. 673-680, 2008.
Pajas, P. and Štěpánek, J.: System for Querying Syntactically Annotated Corpora. In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations, Association for Computational Linguistics, Suntec, Singapore, ISBN 1-932432-61-2, pp. 33-36, 2009.
No entries yet.