VPS-GradeUp

Version 1.0, November 19, 2015

Vít Baisa, Silvie Cinková, Ema Krejčová, Anna Vernerová

Introduction

VPS-GradeUp is a collection of triple manual annotations of 29 English verbs based on the Pattern Dictionary of English Verbs (PDEV)[1] and comprising the following lemmas: abolish, act, adjust, advance, answer, approve, bid, cancel, conceive, cultivate, cure, distinguish, embrace, execute, hire, last, manage, murder, need, pack, plan, point, praise, prescribe, sail, seal, see, talk, urge . It contains results from two different tasks:

  1. Graded decisions
  2. Best-fit pattern (WSD) .

In both tasks, the annotators were matching verb senses defined by the PDEV patterns with 50 actual uses of each verb (using concordances from the BNC [2]). The verbs were randomly selected from a list of completed PDEV lemmas with at least 3 patterns and at least 100 BNC concordances not previously annotated by PDEV’s own annotators. Also, the selection excluded verbs contained in VPS-30-En[3], a data set we developed earlier. This data set was built within the project Reviving Zellig S. Harris: more linguistic information for distributional lexical analysis of English and Czech and in connection with SEMEVAL 2015

The annotators were all trained linguists familiar with PDEV, but they were not English native speakers.

Data format

VPS-GradeUp comes as a single .csv file separated with semicolons and each cell enclosed by double quotes, encoded in UTF 8. It contains 22,800 rows and 43 columns with a header. Click here to download the data, documentation and PDEV entry snapshots from October 2015 (as used for the annotation) via the LINDAT-CLARIN repository.

Rows

Each row primarily represents one observation in the Graded-Decision experiment; i.e. one score on a 7-point Likert scale rendering how well a given PDEV pattern (e.g. [Pattern] 3) of a given verb lemma (e.g. abolish) illustrates a given KWIC identified by an index (e.g. 2.1).   

Graded decisions are filled in all rows containing .1 at the end of the KWIC index (Column SentID). Most rows with KWICs indexed with .2 do not contain any graded decisions (NA filled in). These rows contain the unused alternative readings. 

Each row also contains all WSD (best-fit) annotation related to the given KWIC; i.e. the WSD information repeats for each KWIC as many times as the given verb has PDEV patterns. To explore the WSD results independently of the graded decisions, mind to eliminate duplicate rows.

Columns

Column name

Description

Example

JointID

Unique ID for each row containing lemma, KWIC ID, and pattern number

abolish:Sent_1.1:Pattern_1

PatternID

NB: only unique in combination with the Lemma column; when working with all lemmas, use JointID!

1

Lemma

 

abolish

SentID

NB: only unique in combination with the Lemma column; when working with all lemmas, use JointID!

1.1

 

LikAV

LikEK

LikSC

Score on the 7-point Likert scale saying how well the given PDEV pattern illustrates the given KWIC according to the annotator identified by their initials. 1 = Irrelevant, 7 = Perfect match.

7

5

7

WSDNumAV

WSDNumEK

WSDNumSC

For each annotator separately: ID of the best-fitting pattern in a classical WSD setup, when the annotator is forced to select only one pattern, or claim that the given KWIC is not a verb (value not verb) or that no pattern is really suitable (unclassified).

2

3

3

UnderstandAV

UnderstandEK

UnderstandSC

For each annotator, the options are 1 and 0. (1 = the annotator is confident that they understand the KWIC well, 0 indicates comprehension problems)

1

1

1

 

KWIC

The annotated BNC KWIC – the largest span allowed by BNC. The key word is capitalized and surrounded by three spaces on both sides. Apostrophes and double quotes are escaped. Horizontal ellipsis is rendered by the corresponding HTML entity … (as copied from the BNC).

Anna Tomforde and Michael Farr PRESIDENT Franois Mitterrand , the first head of state of the wartime Allies to visit East Germany , said yesterday that the existence of two sovereign German states could not be `   ABOLISHED   at a stroke \' . Reflecting French anxiety over German reunification , Mr Mitterrand said the two Germanys were jointly responsible for stability in Europe . ` German unity depends first of all on the German people …

BNCdocID

The document code the KWIC was associated with in the BNC

AAK/1

Number of Patterns

How many patterns (senses) the given verb lemma has in PDEV (Pattern Dictionary of English Verbs)

3

CommentsAV

CommentsEK

CommentsSC

Annotators’ comments. Most of them are in English, but some are in Czech.

NA

WSDExploitAV_ coercion_agent

WSDExploitEK_ coercion_agent

WSDExploitSC_ coercion_agent

 

 

Exploitation markup. Binary values.

 1 = the agent of the keyword was coerced into a different PDEV Semantic Type, although it actually corresponds to the Semantic Type listed in the pattern definition.

0 = no markup

0

(1 would occur e.g. if the pattern definition contained the Semantic Type Liquid for agent and the KWIC said: The second cup poured on the floor. Although, strictly speaking, cup corresponds to Container, Liquid is evidently meant at the same time.

WSDExploitAV_ coercion_object

WSDExploitEK_ coercion_object

WSDExploitSC_ coercion_object

 

Cf. coercion agent above. Applies to direct object.

Binary (0,1).

 

WSDExploitAV_ coercion_other

WSDExploitEK_ coercion_other

WSDExploitSC_ coercion_other

 

Cf. coercion agent above. Typically applies to indirect object and adverbials, but it can apply to any clause element except agent and object.

Binary (0,1).

 

WSDExploitAV_ meaning_shift

WSDExploitEK_ meaning_shift

WSDExploitSC_ meaning_shift

Exploitation markup indicating any type of meaning shift between the implicature of the selected pattern (sense) and the KWIC; e.g., metaphor or any rhetorical figure.

Binary (0,1).

 

WSDExploitAV_ unexpected_agent

WSDExploitEK_ unexpected_agent

WSDExploitEK_ unexpected_agent

Exploitation markup indicating that the agent of the given KWIC does not conform to the Semantic Type prescribed by PDEV. A more general markup than coercion. Binary (0,1).

 

WSDExploitAV_ unexpected_object

WSDExploitEK_ unexpected_object

WSDExploitEK_ unexpected_object

Cf. unexpected agent and coercion object above, applies to direct object. Binary (0,1).

 

WSDExploitAV_ unexpected_other

WSDExploitEK_ unexpected_other

WSDExploitEK_ unexpected_other

Cf. unexpected agent and coercion other above, applies to indirect object and all other clause elements except agent and direct object. Binary (0,1).

 

 

References

[1]          P. Hanks and J. Pustejovsky, “A Pattern Dictionary for Natural Language Processing,” Rev. Francaise Linguist. Appliquée, vol. 10, no. 2, 2005.

[2]          “British National Corpus, version 3 (BNC XML edition).” British National Corpus Consortium, 2007.

[3]          S. Cinková, M. Holub, A. Rambousek, and L. Smejkalová, “A database of semantic clusters of verb usages,” in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), \.Istanbul, Turkey, 2012, pp. 3176–3183.

 

How to cite

If you make use of this data set in 2015, please cite this web site. Several papers have been submitted, but we have not received any notification yet. Please return to this web site to obtain a more appropriate reference by March 2016.