PDTSE 1.0

Introduction

This CD brings you a multi-purpose corpus of spoken language. 145,469 tokens*, 12,203 sentences* and 864 minutes of spontaneous dialog speech have been recorded, manually transcribed and manually edited in three interlinked layers, which conform to a specific XML-schema suitable for multi-layered linguistic annotation.

Dialogs

The domain is reminiscing about personal photograph collections, which were recorded within the Companions project. The goal of this project was to create virtual companions that would be able to have a natural conversation with humans. One of the prototypes aimed at senior citizens and was meant to encourage them to reminisce above photographs.

Most dialogs were recorded by the Human-Computer Interaction Team at the Napier University in Edinburgh, UK, led by David Benyon. In some dialogs, the Wizard-of-Oz setup was used; i.e. the recorded person ("user") was interacting with an avatar on a computer screen, without knowing that the talking head was operated by another human. A few Wizard-of-Oz-like dialogs were also recorded at the Institute of Formal and Applied Linguistics at the Charles University in Prague. However, most conversations are ordinary conversations between two humans.

Similar data are available for Czech: PDTSC 1.0 - Prague DaTabase of Spoken Czech.

Audio

The bottom layer is the audio file in the Vorbis format (Ogg).

Literal Manual Transcript

The second layer is a literal manual transcript produced in Transcriber, which also includes non-speech events as coughing, laughter or hesitation sounds. The XML-output from Transcriber has been converted into PML (Prague Markup Language), which is an XML subset which conforms to a specific XML-schema suitable for multi-layered linguistic annotation. The transcription is divided into acoustic segments. These are defined by pauses or pitch indicating the end of an utterance (falling or rising). The segment starts and ends are synchronized with the audio track by XML references.

Speech reconstruction

The topmost layer, called speech reconstruction, is an edited version of the literal transcript. Disfluencies are removed and sentences are freely edited to meet written-text standards. The editing guidelines are specified in the annotation manual. This annotation manual mentions additional linguistic features in the introduction - POS tagging, lemmatization and two additional layers containing syntactic and underlying syntactic annotation. Please note that none of them is part of this release.

The text on the speech-reconstruction layer is divided into segments and tokens. The segments correspond to sentences. They are mapped onto the raw-transcript segments. Each token is interlinked with its corresponding item in the raw transcript, provided there is one.

There are many ways to produce correct written text from a literal transcript. To capture this fact, we provide multiple parallel annotations for each transcript (two to three different versions made by different annotators).

Viewing and browsing

Each dialog is captured as a triplet of interlinked files for each of the annotation versions. These triplets can be viewed in the annotation editor MEd. A preview with aligned speech-reconstruction versions is available here.

What is this data good for?

Language processing tools, such as taggers and parsers, have been designed for written text. They perform worse on spontaneous speech. One of the possible ways to tackle speech parsing is to apply machine learning to teach the computer to transform an ASR output into a written-standard conformant text. This data was produced as a basis for such machine-learning experiments.

* - Each dialog has multiple annotations in which the sentence and token counts differ. The sums of sentences and tokens for the entire corpus were estimated as a sum of average counts.

PDTSE1.0