VPS-30-En

VPS-30-En (Verb Pattern Sample, 30 English verbs) is a newly developed lexical resource.
It contains descriptions of the following 30 English verbs:

Verb Entries Lexicographer-revised Annotation Adjudicated Multiple Annotation
# verbs patterns pattern definitions original trial adjudicated annotators samples IAA confusion matrices
1 access 8 html xml orig-250 trial1 adj 4 html xls csv 0.600 txt
2 ally 6 html xml orig-250 - adj 4 html xls csv 0.710 txt
3 arrive 6 html xml orig-250 - adj 4 html xls csv 0.806 txt
4 breathe 17 html xml orig-250 trial1 trial2 adj 4 html xls csv 0.793 txt
5 claim 9 html xml orig-500 - adj 4 html xls csv 0.764 txt
6 cool 14 html xml orig-250 trial1 adj 4 html xls csv 0.843 txt
7 crush 19 html xml orig-250 trial1 trial2 adj 4 html xls csv 0.549 txt
8 cry 18 html xml orig-250 - adj 4 html xls csv 0.754 txt
9 deny 10 html xml orig-250 trial1 adj 4 html xls csv 0.651 txt
10 enlarge 4 html xml orig-250 trial1 adj 4 html xls csv 0.536 txt
11 enlist 5 html xml orig-250 trial1 adj 4 html xls csv 0.693 txt
12 forge 12 html xml orig-250 trial1 trial2 adj 4 html xls csv 0.594 txt
13 furnish 7 html xml orig-250 trial1 adj 4 html xls csv 0.773 txt
14 hail 9 html xml orig-250 trial1 adj 4 html xls csv 0.727 txt
15 halt 3 html xml orig-250 - adj 4 html xls csv 0.540 txt
16 part 11 html xml orig-250 trial1 adj 4 html xls csv 0.791 txt
17 plough 17 html xml orig-250 - adj 4 html xls csv 0.820 txt
18 plug 12 html xml orig-250 trial1 adj 4 html xls csv 0.607 txt
19 pour 21 html xml orig-250 trial1 adj 4 html xls csv 0.652 txt
20 say 14 html xml orig-500 - adj 4 html xls csv 0.798 txt
21 smash 10 html xml orig-250 trial1 adj 4 html xls csv 0.657 txt
22 smell 9 html xml orig-250 trial1 adj 4 html xls csv 0.746 txt
23 steer 22 html xml orig-250 trial1 adj 4 html xls csv 0.572 txt
24 submit 5 html xml orig-250 - adj 4 html xls csv 0.764 txt
25 swell 23 html xml orig-250 trial1 adj 4 html xls csv 0.765 txt
26 tell 19 html xml orig-500 - adj 4 html xls csv 0.715 txt
27 throw 72 html xml orig-1000 - adj 4 html xls csv 0.524 txt
28 trouble 13 html xml orig-250 trial1 adj 4 html xls csv 0.693 txt
29 wake 10 html xml orig-250 trial1 adj 4 html xls csv 0.717 txt
30 yield 11 html xml orig-250 trial1 adj 4 html xls csv 0.716 txt

If you would like to get all the data as a single package, please, write an e-mail to <smejkalova(at)ufal.mff.cuni.cz>

Data overview

All pattern definitions (lexicon entries) as well as all annotations we have produced are listed and links to the actual files in several formats are provided directly in the table.

The table is divided into three sections:

The first section Verb Entries contains alphabetically ordered verbs, number of their patterns in the Validation Database and pattern definitions. The entries are provided in both html format (only preview) and xml format.

The following two sections contain annotated corpus concordances. Section Lexicographer-revised Annotation contains

The column named adjudicated contains the adjudicated results of the last multiple annotation round. “Adjudicated” means that the lexicographer considered all values suggested by the annotators and eventually selected “the best one”, which remains the only one kept in the file. The files stored in this column constitute a single-value annotation based on the feedback from a multiple annotation. The adjudication was only performed for a round where the interannotator agreement came out reasonably good and the manual disagreement analysis did not reveal any obvious need of corrections of the verb entry. Whenever the outcome of the multiple annotation was not good enough and the entry needed a revision, the multiple annotation was discarded. The lexicographer revised the entry and updated the reference sample (“original”) to match the revised pattern definitions. The same was done to the sample that had been subject to the multiple annotation round that triggered the entry revision. Each annotation round that resulted in an entry revision produced one such sample. These samples are called trial. Usually, there is one or two per verb.

The last section - Adjudicated Multiple Annotation - contains the final annotation round with its multiple values – the one that was declared as satisfactory and after which no entry revisions followed. The lexicographer checked all the annotations and deleted evident errors. The record of the disagreement analysis is stored in the adjudication tables located in the samples column. It is available as html (just a preview without concordance ID numbers), xls (the original file with red-errors) and csv (without colors, but the erroneous values are not contained in the list of acceptable values). All files (save the html preview) also contain the BNC-native sentence IDs.

Section Adjudicated Multiple Annotation contains complementary information about the number of annotators (the annotators column), the interannotator agreement (the IAA column), as well as the confusion matrices (column confusion matrices) for the last annotation round.

Usual sample size

The sample that is released along with the entry (“original”) is usually 250 concordances. More frequent or more complex verbs (e.g. say and throw) get a larger sample. The other samples (“trial” and “adjudicated”) contain 50 concordances each. The multiple annotation is only available for the last annotation round, i.e. 50 concordances.

Patterns Compilation Procedure

  1. There are three annotators and one lexicographer. The lexicographer is in charge of the revisions of the entry as well as of keeping all annotation samples in line with the current pattern definitions. This includes the analysis of interannotator disagreements after each annotation round and the adjudication of the last annotation. The lexicographer also does the multiple annotations.
  2. The annotators receive the entry along with the 250‐concordance reference sample. They get to annotate another 50‐concordance set, using the knowledge of the entry, the reference sample and the manual together.
  3. Interannotator agreement (IAA) is measured, confusion matrices are computed for each annotator pair and disagreements are manually analyzed.
  4. When the interannotator confusion suggests that a revision of the entry is desirable, the lexicographer revises the entry, the 250‐concordance (“original”) as well as the 50‐concordance sample. The annotators get them along with a new 50‐concordance sample for annotation. (In other words, the sample was not approved as the final multiple annotation, hence it has become a “trial”.) This procedure could be repeated as long as the agreement is low and the entry is identified as the problem, but in practice we have faced three rounds at worst.
  5. When the IAA is satisfactory and the entry does not require any further modifications, the lexicographer makes the final disagreement analysis. A record of the analysis is kept (column Adjudicated Multiple Annotation/samples in the table). This record contains the multiple values for each concordance. The lexicographer marks evident errors and selects one “best” value.

By this procedure, we make sure that the 250‐concordance “original” sample, as well as all the subsequent 50‐concordance samples (“trial”), is in line with the last entry revision, and we merge them into an emerging gold standard sample for machine learning (in the table, they are kept separate). Consequently, we get at least 300 consistently annotated concordances for each verb. Entries of more complex verbs are based on a larger reference sample. In addition, we gain a multiple‐value annotation cleared of evident annotator errors ‐ typos or confusing a transitive pattern for an intransitive one, etc.

Annotation Scheme Description

Infrastructure

The infrastructure is provided by the CPA project of the Natural Language Processing Centre at Masaryk University in Brno, under the supervision of Pavel Rychlý, Adam Rambousek, and Vít Baisa.

2012 © Institute of Formal and Applied Linguistics. All Rights Reserved.