This project draws on the Corpus Pattern Analysis coined by Patrick W. Hanks and on the Pattern Dictionary of English verbs. We find very appealing the idea of semantic analysis of words (verbs) in contexts and seek to explore its application in NLP tasks. We have been investigating the option of having PDEV as gold-standard data for statistical machine learning.

Since PDEV has been mainly created by one single person, the first question to ask was: can humans agree when recognizing semantic patterns of verb usage? As a first step, we had a sample of PDEV data processed by paralelly working human annotators. We measured the interannotator agreement and analyzed the causes of disagreement. At that point, we took a snapshot of PDEV and created a small data sample that contains revisions of both the corpus annotation and the entries driven by the analysis of interannotator disagreements. This snapshot is not part of PDEV. You can browse it as VPS-30-En (Verb Pattern Sample, 30 English verbs). The entire sample can be most comfortably downloaded from the LINDAT-CLARIN repository.

As a next step, we moved away from manual resources towards automatically created data. We have built GRASS - a tagger that provides morphosyntactic information on verb form and verb complementizers in dependency parsed trees tailored to the distributional analysis of verbs and nouns.

 

We are very grateful to Patrick W. Hanks as well as to Karel Pala, Pavel Rychlý, Adam Rambousek and Vít Baisa from the Natural Language Processing Centre at the Masaryk University in Brno for all know-how, infrastructure and support we have been receiving from them when creating VPS-30-En.

Please see our publications for more detail.