Monday, May 20, 2013 - 13:30

Towards a multilayer and multidimensional corpus annotation: Following the footprints of the Meaning-Text Theory

Abstract: An increasing number of treebanks is available for training statistical Natural Language Processing applications. Nearly all of them capture linguistic phenomena of different nature (at least word order, morphological features and syntactic dependencies), but only a few (among them, the Prague Dependency Treebank, PDT) actually separate these phenomena in terms of different levels of annotation; the majority uses one single agglomerated annotation structure. Such a structure can be considered deficient from the theoretical (linguistic) point of view. It also reduces the quality of the annotated resources, which in turn hampers the quality of the applications trained on them. As already pointed out by numerous scholars, the annotation of corpora is of higher quality when a well-defined linguistic model which supports multi-level annotation is followed. In my talk, I will present the annotation of Spanish and English corpora rooted in the linguistic model of the Meaning-Text Theory. I will introduce the annotation schema we have developed for the surface-syntactic layer of Spanish and discuss how we (semi-)automatically derive from the surface-syntactic annotation the more abstract deep-syntactic and semantic annotations. In the second half of my talk, I will report on our work in progress on the annotation of the Penn Treebank with the Theme/Rheme structure. To conclude, I will draw some parallels between the annotation philosophy underlying PDT 2.0 and ours.

Leo Wanner earned his Diploma degree in Computer Science from the University of Karlsruhe and his PhD in Linguistics from the University of The Saarland. Prior to joining ICREA, he held positions at the German National Centre for Computer Science, University of Waterloo, the University of Stuttgart and the Pompeu Fabra University, Barcelona. As visiting researcher, he was also affiliated with U of Montreal, U of Sydney, U of Southern California's Institute for Information Sciences, U Paris 7, and the Columbia University, New York. Throughout his career, Dr Wanner has been involved in various large scale national, European, and transatlantic research projects. He has published seven books and over 100 refereed journal and conference articles and serves as regular reviewer for a number of high profile conferences and journals.