The PDTSL - Prague Dependency Treebank of Spoken Language

The project of speech reconstruction of Czech and English has been started at UFAL together with the PIRE project in 2005, and has gradually grown from ideas to (first) annotation specification, annotation software and actual annotation. It is part of the Prague Dependency Treebank family of annotated corpus resources and tools, to which it adds the spoken language layer(s).

The project is supported by various grants, most importantly, by the Center for Computational Linguistics (project MSMT CR LC536, for graduate student salaries), project MSMT CR MSM00216208 (for senior researchers and supervision of graduate students), project MSMT CR ME838 (for travel), project EU Companions IST-FP6-034434 (for the English annotation effort mainly), project GA405/06/0589 of the Grant Agency of the CR (programming work and annotation and travel support). Cooperation within the PIRE project is essential even though it does not bring any salaries directly (NSF PIRE grant No. 0530118). Several graduate students have also supplemental grants from the Grant Agency of the Charles University, such as GAUK 52408.

The Data

The data (in both languages) are in the PML format (PML schemas are included) and consist of three annotation layers:

  • zdata - ASR output aligned to audio (audio files not included in the snapshots - please write to pdtsl@ufal.mff.cuni.cz for more details about licensing and availability of the audio data).
  • wdata - manual transcription
  • mdata - reconstructed text with word alignment to wdata tokens

You can download the data annotated so far (until 2008) from the download link at the left menu (after you register and electronically sign the accompanying license). The tool MEd (see below) can be freely downloaded to view and search the downloaded annotated data.


The speech reconstruction editor MEd

The annotation tool used for the manual annotation at the speech reconstruction level of the above corpora can be downloaded from its author's page at http://ufal.mff.cuni.cz/~pajas/med.


The People

The project as a whole is being coordinated by Jan Hajic. Silvie Cinkova coordinates the English specification and annotation effort, Marie Mikulova does the same for Czech. Petr Podvesky (UFAL's recent graduate) has written the first version of the annotation tool, used for the first experimental annotation in 2005 and 2006 by Martina Otradovcova and Erin Fitzgerald for her experiments within the PIRE project. The tool has then been completely rewritten by Petr Pajas and renamed "MEd". Other UFAL people contribute either as annotators or as a technical and programming support (for a full list see below). Many more people will start working on the project once the manual annotation of the syntactic and semantic layers starts.

We are also grateful to many other people who have contributed data, programs, and/or their time to the project, such as the former Malach project team, the people from the University of West Bohemia in Pilsen (part of the Center for Computational Linguistics, but also others) who contributed part of the Czech Companions recordings, to Nino Peterek of UFAL, who is preparing the other part of Czech recordings, the Napier Companions team who delivered the English dialog recordings, and many others.

Miroslav Spousta of UFAL is now working on first experiments with automatic speech reconstruction based on the data annotated as described above. Watch here for the first results...


Back to top.


Contents of the project web pages: Jan Hajic.
Authors and contributors: Jan Hajic, Silvie Cinkova, Marie Mikulova, Petr Pajas, Petr Podvesky, Martina Otradovcova, Jan Ptacek, Josef Toman, Zdenka Uresova and all the annotators: Anna Hlavacova, Heather McGadie, Petra Mickova, Christine Warkentin, Helena Glucksmannova, Ludmila Kaplanova, Michaela Lunackova, Jana Grollova, Anna Kapsova, Petra Schnaubertova, Hana Stepankova and Jan Ures.
This work was funded in part by the Companions project (www.companions-project.org) sponsored by the European Commission as part of the Information Society Technologies (IST) programme under EC grantnumber IST-FP6-034434, MSM0021620838, ME838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic and GA405/06/0589 of the Grant Agency of the Czech Republic. The data themselves are the sole result of the project GACR GA405/06/0589. The ME838 project is attached to the project No. 0530118 of the PIRE program of the NSF/OISE.
2008 © Institute of Formal and Applied Linguistics. All Rights Reserved.