Anna Nědolužko: Rozšířená textová koreference a asociační anafora
V této práci představujeme jeden z možných modelů zpracovaní rozšířené textové koreference a asociační anafory na velkém korpusu textů, který dále používáme pro anotaci daných vztahů v textech Pražského závislostního korpusu. Na základě literatury z oblasti teorie reference, diskurzu a některých dalších poznatků teoretické lingvistiky na jedné straně a s použitím existujících anotačních metodik na straně druhé jsme vytvořili detailní klasifikaci textově koreferenčních vztahů a typů vztahů asociační anafory. V rámci textové koreference rozlišujeme dva typy textově koreferenčních vztahů – koreferenční vztah mezi jmennými frázemi se specifickou referencí a koreferenční vztah mezi jmennými frázemi s nespecifickou, především generickou referencí. Pro asociační anaforu jsme stanovili šest typů vztahů: vztah PART mezi částí a celkem, vztah SUBSET mezi množinou a podmnožinou/prvkem množiny, vztah FUNCT mezi entitou a určitým objektem, který má vzhledem k této entitě jedinečnou funkci, vztah CONTRAST vyjadřující sémantický a kontextový protiklad, vztah ANAF označující anaforické odkazování mezi nekoreferenčními entitami a vztah REST pro jiné případy asociační anafory. Jedním z úkolů výzkumu bylo vytvořit systém teoretických principů, které je nutno dodržovat při anotaci koreferenčních vztahů a asociační anafory. V rámci tohoto systému byl zaveden například princip důslednosti anotace, princip dodržování maximálního koreferenčního řetězce, princip kooperace se syntaktickou strukturou tektogramatické roviny, princip preference koreferenčního vztahu před asociační anaforou a další. Vypracovanou klasifikaci jsme aplikovali na koreferenční a anaforické vztahy v Pražském závislostním korpusu (Prague Dependency Treebank, PDT). Anotace těchto vztahů byla provedena na polovině korpusu PDT (cca 25 tis. vět). Srovnání shody mezi anotátory při navazování vztahů a určování typů těchto vztahů ukázalo, že použitá klasifikace při daném rozsahu materiálu je spolehlivá zejména pro účely teoretického výzkumu; pro počítačové aplikační účely (strojový překlad, automatické učení atd.) je nutné rozšíření materiálové základny.
The purpose of this book is to describe the annotation of the extended nominal coreference and the bridging anaphora in the Prague Dependency Treebank.
The Prague Dependency Treebank (PDT 2.0) is a large collection of linguistically annotated data and documentation. In PDT 2.0, Czech newspaper texts are annotated using a three-layer annotation scenario. The most abstract (tectogrammatical) layer includes, among other mark-ups, the annotation of coreferential links.
In PDT 2.0, two types of coreference are annotated: grammatical and textual coreference. The grammatical coreference typically occurs within a single sentence, since the antecedent can be derived on the basis of grammatical rules of a given language. It includes relative pronouns, verbs of control, reflexive pronouns, reciprocity and verbal complements. As for textual coreference, it has been restricted up to now to cases in which a demonstrative this or an anaphoric pronoun of the 3rd person, also in its zero form, are used. This thesis focuses namely on the next stage of anaphoric annotation, which is being carried out on PDT now. In this stage, the textual coreference is annotated also for non-pronominal and non-zero NPs, and also for some cases of adjectives, adverbs and verbs. Together with this textual coreference, bridging relations of several types are being annotated.
In the thesis, I propose to base the processing of coreference and bridging anaphora on both theoretical background of the reference theory and practical implementation of coreferenctial data on large textual corpora. A theoretical point of view helped me understand many deep liguistical details of the mechanism of reference, anaphora and coreference. Comparison with the existing schemes of coreference annotation helped me restrict high variety of relations to a reasonable amount that can be processed reliably.
Subject to annotation are pairs of coreferring (by bridging anaphora semantically related) expressions, the preceding expression is called antecedent, the subsequent one is called anaphor. It is possible for an expression to be an antecedent for more than one coreferential and/or bridging expressions at the same time. The reverse is true only for bridging relations, i.e. one expression may have more than one bridging antecedent but just one coreferential antecedent. The coreference and bridging relations are to be marked between elements of the following categories: nouns (Prague – the town), anaphoric adverbs (in the town - there), numerals (by 1999 – this year), verbs if coreferring with NPs (They tried to teach him to read – The attempt was not successful.). Adjectives are annotated only if they are coreferential with a named entity, so e.g. we annotate pairs as German – Germany. Names and other named entities are all subjects to annotation. A substring of a named entity, however, is not to be annotated if it is not a named entity itself. Thus, for the sequence The Charles University of Prague... Prague... the two instances of NP Prague are to be marked coreferential; but in Institute of Nuclear Research... nuclear research the two instances of NP research are not to be coreferred. Due to the syntactic structure of tectogrammatical trees, roots of coordinating and appositional structures can technically also serve as antecedents.
Most of the thesis describes the annotation scheme of extended nominal coreference and bridging anaphora.
Extended textual coreference is further subclassified into two types: coreference of NPs with specific reference (coref_text, type SPEC) and relations between NPs with generic reference (coref_text, type GEN). This decision is made on the basis of the expectation, that generic coreferential chains have different anaphoric rules from the specific ones. This group also includes a big number of abstract nouns whose coreference is not quite clear in every particular case. So, the generic type of textual coreference serves as the ambiguity group too.
Textual coreference covers also the cases of endoforic references to the segment of (preceding) text larger than one sentence, or phrase, including also the cases when the antecedent is understood by inference from a broader co-text. The pronominal anaphoras being already annotated in PDT 2.0, we add links, in which the anaphora is expressed by an NP or an adverb.
A specifically marked link for exophora denotes that the referent is “out“ of the co-text, it is known only from the situation. In the same way that it was done for segments, the new nominal and adverbial links are added.
By bridging relations, we annotate only those expressions that are non-coreferential and that stand in some conceptual relation to their antecedent. The participation on the text cohesion is considered to be important, so in ambiguous cases, the relations that are important for the text cohesion are annotated.
At present, we consider the following relations to be relevant:
- part-whole (having two directions PART_WHOLE and WHOLE_PART),
- set-subset/element of the set (also two-directional SET_SUB and SUB_SET),
- object-function (FUNCT for e.g. class-teacher),
- CONTRAST for coherence relevant discourse opposites (e.g. People don't chew, it's cows who chew),
- ANAF for non-cospecifying anaphoric Nps
- underspecified group REST for capturing bridging references – potential candidates for a new group of bridging relations (e.g. location – resident, relations between relatives (mother – son, etc.), event – argument (listening – listener) and some other relations).
In some cases, the distinction between SUB_SET and PART groups is quite problematic, so that the only reason to decide for the type of a bridging relation is the countability of corresponding nouns. For the time being, the instruction for such type of ambiguities is to annotate type PART only in clear cases of non-separable parts.
In order to develop maximally consistent annotation scheme, we follow a number of basic principles. Some of them are presented below:
- Chain principle: Coreference relations in text are organized in ordered chains. The most recent mention of an entity is marked as antecedent. This principle is checked automatically. The chain principle does not concern bridging anaphora.
- Principle of the maximum length of coreferential chains. This principle, similar to the chain principle, concerns only the cases of textual coreference. It states that in case of multiple choices, we prefer to continue the existing coreference chain, rather than to begin a new one. To fulfill this principle, grammatical coreferential chains (already annotated in PDT) are being continued by textual ones, and similarly, the already annotated textual coreferential chains are continued by currently annotated non-pronominal links in turn.
- Principle of maximal size of an anaphoric expression. This principle claims that the whole subtree of the antecedent/anaphor is always subject to annotation. This principle is partially governed by the dependency structure of the tectogrammatical trees and may be sometimes counter-intuitive.
- Principle of cooperation with the syntactic structure of the given dependency tree. We do not annotate relations that are already captured by the syntactic structure of the tectogrammatical tree. So, for example, we do not annotate predication and apposition relations. Also, bridging relations are not to be annotated if the anaphora is a direct child of its antecedent in the tectogrammatical tree, and it has some of the predefined labels for the valence relations (functors), such as PAT(iens), AUTH(or), APP(urtenance), etc.. So, for example, the relation between strop (ceiling) and místnost (room) in the phrase strop této místnosti (the ceiling of this room) is not annotated, as in the tectogrammatical tree, the node místnost has the functor APP, being the direct child of the node strop.
- Principle of primary coreference to anaphora. Coreference, not anaphora, is subject to textual coreference annotation. Unlike most existing coreference schemes, we try to strictly distinguish identity relations and anaphoric relations. In many cases, an anaphoric relation is also a coreferential relation, although this is not always the case. In a Slavonic language, lacking the grammatical category of definiteness, we cannot afford to choose only definite NPs for anaphoric annotation, so we have to annotate all NPs that refer to the same entity. Non-coreferential anaphoric entities are annotated separately as a bridging relation.
- Preference of coreference over bridging anaphora. The preference says that in case of multiple choice, we always prefer textual coreference to bridging relation.
Coreference and bridging annotation is being performed using the TrEd annotation tool, developed at the Institute of Formal and Applied Linguistics at Charles University in Prague. The annotation is carried out on tectogrammatical tree structures assigned to the sentences in text. The present scenario of PDT provides a number of coreferential attributes. Coreference relations are captured by arrows leading from the anaphor to the antecedent and the various types of relations (bridging, textual, grammatical) are distinguished by different colours of the arrows.
The annotation scheme described in the thesis has been applied on a large scale to the whole PDT corpus by two instructed annotators, students of linguistics. So far, 50% of PDT has been annotated.
For the purpose of checking and improving the annotation guidelines, we regularly provide and describe the inter-annotator measurements. A detailed study of the texts annotated by both annotators revealed several sources of typical errors. The inter-annotator agreement is also greatly affected by parameters of the text as a whole. The interpretations of short texts are generally far less than of the longer texts of 20 to 120 sentences. Agreement is getting more difficult, the more complex the judgments that the annotators have to make become. Also, the degree of abstraction plays a crucial role in the results of the inter-annotator agreement.
The first phase of the coreference annotation process has revealed several problematic cases concerning annotation of anaphoric relations in Czech. The most problematic aspect in annotating textual coreference concerns abstract nouns. Given that in some cases such NPs are clearly coreferential and anaphoric, we cannot exclude them from the annotation. However, there are many more cases in which the decision for postulation of coreference is not certain, sometimes appearing to be quite redundant. The following questions arise when annotation of abstract nouns is carried out: Should we annotate such cases at all? If we annotate them, what kind of coreference type is that (specific or non-specific coreference)? For the time being, we annotate relations between abstract nouns as generic coreference (coref_text, type SPEC), in order to be able to exclude them if needed. Yet, there still remains the problem of distinguishing between abstract and concrete nouns, the boundary between them being rather gradual.
There are some other questions left unanswered, such as annotating coreference in prepositional phrases, annotation of complex nouns, etc., which are mainly solved using formal conventions.