Methods for rapid discourse annotation in selected corpora

Grant:

Metody pro rychlou diskurzní anotaci ve vybraných korpusech

Tags:

Annotations, Corpora, Data, Discourse, Parsers

RapiDisc

During the three years of the RapiDisc project (2022-2024), various methods were used and developed for creating a high-quality annotation of explicit discourse relations in the four subcorpora of the Prague Dependency Treebank - Consolidated (PDT-C) in a cost-effective way. All available resources were incorporated, starting with existing manual annotation on various layers of the underlying data from morphology up to the deep syntax layer.

Apart from that, in the four PDT-C parts we proceeded in the following ways and used the following resources:

PDT (original Czech newspaper texts) - the previously existing manual discourse annotation was manually revised and updated (total size of the data: 49 thousand sentences, 21 thousand discourse relations),
PCEDT-cz (translated journalistic texts) - annotation projection from English was combined with automatic discourse parsing and the result was partially automatically combined, with subsequent manual checks and fixes of the most problematic phenomena (49 thousand sentences, 29 thousand discourse relations),
PDTSC (original Czech spoken texts) - the data were automatically discourse parsed, the most problematic phenomena were subsequently manually checked and fixed (74 thousand sentences, 31 thousand discourse relations),
Faust (original Czech user-generated data) - automatic discourse parsing, all data manually checked and fixed (3 thousand sentences, 710 discourse relations).

Measurements of the inter-annotator agreement between the result of the described methods and completely manually annotated samples (in PCEDT-cz and PDTSC) indicate a high quality of the resulting discourse annotation, exceeding the inter-annotator agreement between human annotators reported previously on the first version of the Prague Discourse Treebank.

The whole annotated data, summing to 175 thousand sentences with 82 thousand annotated discourse relations, represent a unique large-scale high-quality discourse-annotated resource. It was published under the Creative Commons licence in the LINDAT/CLARIAH-CZ repository in two annotation frameworks: the native Prague style (discourse annotation on top of the deep-syntax layer, http://hdl.handle.net/11234/1-5813) and the Penn Discourse Treebank 3.0 style (both in format and sense taxonomy, http://hdl.handle.net/11234/1-5680), thus making the data accessible to the international research community. Detailed web pages dedicated to the publication were created, with data description, documentation, comparison with previous versions, list of related publications etc.

The scripts developed and used for the annotation can be accessed via a svn repository; to checkout your copy, you may use (when prompted, enter password "nondeprel"):

svn co --username "nondeprel" https://svn.ms.mff.cuni.cz/svn/nondeprel/trunk/common/data/PDT-C-PCEDT-cz .

Please follow the individual web pages describing the progress in the individual project years:

Publications

The article "Prague to Penn Discourse Transformation" promotes the importance of publishing data resources in internationally established standard data formats and theoretical frameworks; the article demonstrates such an approach on the Prague Discourse Treebank and evaluates what can be done automatically and how much human interference is required for transforming data from one annotation framework to another.

Manual discourse annotation of large data is a highly time- and other resources-demanding task. The paper "Cost-Effective Discourse Annotation in the Prague Czech–English Dependency Treebank" emphasizes the need to utilize all relevant available resources to lower the cost and presents a method for cost-effective a high-quality discourse annotation in a specific situation (when a certain set of language resources and tools are at hand).

Manually annotated treebanks serve many purposes, both practical (training/test data) and theoretical (language research), and usually are considered as carriers of ground truth. The paper "Announcing the Prague Discourse Treebank 3.0" demonstrates on the evolution of the Prague Discourse Treebank the importance to question the quality of manual annotations, check consistency of the annotations and (repeatedly) revise manually annotated treebanks to gradually improve their quality.

Search form

RapiDisc

Publications