PDTSC 1.0 brings you a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences* and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcription and manually reconstructed text.**
The corpus consists of two types of dialogs. First we used the Czech portion of the Malach project corpus. The Czech Malach corpus consists of lightly moderated dialogs (testimonies) with Holocaust survivors, originally recorded for the Shoa memory project by the Shoa Visual History Foundation. The dialogs usually start with shorter turns but continue as longer monologues by the survivors, often showing emotion, disfluencies caused by recollecting interviewee’s distant memories, etc.
The second portion of the corpus consists of dialogs that were recorded within the Companions project. The domain is reminiscing about personal photograph collections. The goal of this project was to create virtual companions that would be able to have a natural conversation with humans. The corpus was completely recorded in the Wizard-of-Oz setup; the interviewing speaker is an avatar on the computer screen, which is controlled by a human in a different room. The interviewee is a user of a system that is designed to discuss his photographs with him. The user does not know that the avatar is controlled by a human and believes it is real artifical intelligence he interacts with.
* - Each dialog has multiple annotations in which the sentence counts differ. The sum of sentences for the entire corpus was estimated as a sum of average counts.
** - PDTSC 1.0 is a delayed release of data annotated in 2012. It is an update of Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 (published in 2009). In 2017, Prague Dependency Treebank of Spoken Czech (PDTSC) 2.0 was published as an update of PDTSC 1.0.
Layers of annotation
Figure 1: Linking the layers
PDTSC 1.0 has three hierarchical layers and one external base layer (audio). Figure 1 shows the relations between the layers as annotated and represented in the data.
The highest layer of the corpus contains the reconstructed text that is further subjected to the morphological annotation (tagging, lemmatization) and then the text can be annotated at syntactic layers (surface and deep syntactic annotation). Please note that none of these annotations is part of this release.
Automatic Speech Recognition
The bottom layer of the corpus (z-layer) contains automatic speech recognition output aligned to audio. It is a simplified token layer which is interlinked with the manual transcription using the synchronization points.
The second layer (w-layer) is a literal manual transcript, i.e. everything the speaker has said including all slips of the tongue, coughing, laugh etc. The transcription was produced in Transcriber. The XML-output from Transcriber has been converted into PML (Prague Markup Language), which is an XML subset customized for multi-layered linguistic annotations. By means of XML references, the transcription is interlinked with the tokens at the bottom layer and synchronized with the audio track.
The topmost layer (m-layer), called speech reconstruction, is an edited version of the literal transcript. Disfluencies are removed and sentences are smoothed to meet written-text standards. The editation guidelines are specified in the annotation manual. The text on the speech-reconstruction layer is divided into segments and tokens. The segments correspond to sentences. They are mapped onto the raw-transcript segments. The tokens are also interlinked with its correspondent in the raw transcript. There are many ways to produce correct written text from a literal transcript. To capture this fact, we provide multiple parallel annotations for each transcript (two or three different versions made by different annotators).
The external base layer is the audio file in the Vorbis format (Ogg).
Viewing and browsing
Each of the annotation versions is captured as a quartet of interlinked files. These quartets can be viewed in the annotation editor MEd. We also transformed all PDTSC 1.0 data into HTML files. You can navigate through them easily in our data browser. This is the best choice if you want to have just a quick look at the data.
What is this data good for?
Language processing tools, such as taggers and parsers, have been designed for written text. They perform worse on spontaneous speech. One of the possible ways to tackle speech parsing is to apply machine learning to teach the computer to transform an ASR output into a written-standard conformant text. This data was produced as a basis for such machine-learning experiments.