The Prague Dependency Treebank of Spoken Czech 2.0 is a corpus of spoken language, consisting of 742,257 tokens and 73,835 sentences, representing 6,174 minutes (over 100 hours) of spontaneous dialogs. The dialogs have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcripts and manually reconstructed text. These layers along with morphological annotation were part of the first version of the corpus (PDTSC 1.0). Version 2.0 is extended by annotation at the dependency syntax layer and the “deep” syntax layer, which contains semantic roles and relations as well as annotation of coreference. PDTSC 2.0 is freely and publicly available (licence).
With the release of PDTSC 2.0, we have to a large extent closed the gap between the full annotation of the Prague Dependency Treebank (which is a written text-based corpus (PDT)) and the Prague spoken dialog corpus (PDTSC 1.0). We are not aware of any other spoken language corpus that would have both the “disfluencies” marked and a full annotation of syntax and semantics. In addition, we have kept the unique “reconstruction” layer of annotation, which allows different views of and annotation mapping onto the original data.
As with similar projects, this release is a step towards bigger corpora, with more manual annotation. The PDTSC 2.0 will be also extended in the future, most notably by manual annotation on the m- and a-layers, and will become part of a consolidated Prague Dependency Treebanks, which will contain four different treebanks of Czech, uniformly annotated using the scheme described in part here, with data coming from text, speech and internet sources.
The corpus consists of two types of dialogs. First we used the Czech portion of the Malach project corpus. The Malach corpus consists of lightly moderated dialogs (testimonies) with Holocaust survivors, originally recorded for the Shoa memory project by the Shoa Visual History Foundation. The dialogs usually start with shorter turns but continue as longer monologues by the survivors, often showing emotion, disfluencies caused by recollecting interviewee’s distant memories, etc.
The second portion of the corpus consists of dialogs that were recorded within the Companions project. The domain is reminiscing about personal photograph collections. The goal of this project was to create virtual companions that would be able to have a natural conversation with humans. The corpus was completely recorded in the Wizard-of-Oz setup; the interviewing speaker is an avatar on the computer screen, which is controlled by a human in a different room. The interviewee is a user of a system that is designed to discuss his photographs with him. The user does not know that the avatar is controlled by a human and believes it is real artifical intelligence he interacts with. Domain-identical dialogs were created also in English (PDTSE 1.0), allowing comparison with the Czech data,
even if the English data have not yet been upgraded to version 2.0.
Layers of annotation
PDTSC 2.0 is a treebank from the family of PDT-style corpora developed in Prague. PDTSC differs from other PDT-style corpora mainly in the “spoken” part of the corpus. The layers stack starting at the external base layer with audio files (in the Vorbis format). The bottom layer of the corpus (z-layer) contains automatic speech recognition output synchronized to audio. The next layer, w-layer, contains manual transcript of the audio, i.e. everything the speaker has said including all slips of the tongue as well as non-speech events like coughing, laugh, etc. W-layer is synchronized to the automatic transcript and through it thus to the original audio. The subsequent m-layer contains a manually “reconstructed”, i.e. edited, grammatically corrected version of the transcript, including punctuation and assumed sentence boundaries. The reconstructed tokens are automatically morphologically tagged and lemmatized. From this point on, annotation on the upper layers is the same as in the other PDT-style corpora. The dependency syntax layer (a-layer) is parsed automatically, while the “deep” syntax layer (t-layer) is annotated manually. There is a one-to-one correspondence between the tokens at the m-layer and the nodes at the a-layer. The syntactic dependencies are provided with dependency relations (e.g., Subject or adverbial). The t-layer, which is also a tree-shaped graph (with content words only), is the highest and most complex linguistic representation that combines syntax and semantics in the form of semantic labeling, coreference annotation and argument structure description based on a valency lexicon.
In order not to lose any piece of the original information, tokens (nodes) on a lower layer are explicitly referenced from the corresponding closest (immediately higher) layer. This allows the morphological, syntactic and semantic annotation to be deterministically and fully mapped back to the transcript and audio. It brings new possibilities for modeling morphology, syntax and semantics in spoken language – either at the original transcript with mapped annotation, or at the new layer after (automatic) editing.