TLT 2006 Prague, Czech Republic

Invited Talks

Gosse Bouma:

Searching Treebanks

Information Science, University of Groningen
Netherlands

Automatic annotation (using an accurate wide-coverage parser) can be used to produce annotated corpora of arbitrary size. In this talk, we will review a number of application areas where such corpora have been found useful. Large, automatically constructed, treebanks are not free of errors, but due to their size (i.e. we have used syntactically annotated corpora of up to 600M words) can provide information which is hard to obtain otherwise. We have used syntactically annotated corpora to study the distribution of syntactic constructions (i.e. word order in indirect object constructions, the distribution of focus particles inside PPs, (alleged) extraction of PPs from NPs, etc.) In addition, we have used such corpora to acquire lexical and ontological information (ranging from support verb constructions to definition sentences). Finally, annotated corpora can be used in information extraction and question answering.

An issue in all applications is the development of tools for searching, extracting, and combining information from XML-data. In the second part of the talk we will address the potential of (emerging) XML standards for sophisticated retrieval of data from treebanks. In particular, Xquery and XPath offer the expressive power which allows straightforward retrieval of virtually all conceivable relations from treebanks. For retrieval of relations from graphs or of relations which are defined by a combination of syntactic and semantic (ontological) knowledge (i.e. causes of diseases) semantic web languages such as RDF(S) and OWL, and graph query languages such as SPARQL are available. Adopting standard XML technology has the benefit that a wide range of tools and languages can be used which support the standard. Adopting semantic web standards facilitates integration of ontological resources and syntactic annotation.

Martha Palmer:

SemLink - Linking PropBank, VerbNet, FrameNet and WordNet

University of Colorado in Boulder
USA

PropBank has been widely used as training data for Semantic Role Labeling. However, because this training data is taken from the WSJ, the resulting machine learning models tend to overfit on idiosyncrasies of that text's style, and do not port well to other genres. In addition, since PropBank was designed on a verb-by-verb basis, the argument labels Arg2 - Arg5 get used for very diverse argument roles with inconsistent training instances. For example, the verb "make" uses Arg2 for the "Material" argument; but the verb "multiply" uses Arg2 for the "Extent" argument. As a result, it can be difficult for automatic classifiers to learn to distinguish arguments Arg2-Arg5. We have created a mapping between PropBank and VerbNet that provides a VerbNet thematic role label for each verb-specific PropBank label. Since VerbNet uses argument labels that are more consistent across verbs, we are able to demonstrate that these new labels are easier to learn. This talk will describe PropBank and VerbNet, and the mapping between them, and present the resulting improvements in automatic Semantic Role Labeling. Preliminary mappings to FrameNet and CYC will also be described.

top