Monday, November 14, 2011 -
13:30 to 15:00

Combining symbolic and statistical methods in corpus-based NLP

Abstract: Linguists developing formal models of language seek to provide detailed accounts of linguistic phenomena, making predictions that can be tested systematically. Computational linguists building broad-coverage grammar implementations must balance several competing demands if the resulting systems
are to be both effective and linguistically satisfying. There is an emerging consensus within computational linguistics that hybrid approaches combining rich symbolic resources and powerful statistical techniques will be necessary to produce NLP applications with a satisfactory balance of robustness and precision. In this talk I will present one approach to this division of labor which we have been exploring at CSLI as part of an international consortium of researchers working on deep linguistic processing ( I will argue for the respective roles of a large-scale effort at manual construction of a grammar of English, and the systematic construction of statistical models building on annotated corpora parsed with such a grammar, and then manually disambiguated. Illustrations of this approach will come from three applications of NLP: machine translation, information extraction from scientific texts, and grammar checking in online elementary school writing courses.

Dan Flickinger is a Senior Research Associate with the Education Program for Gifted Youth (EPGY) at Stanford University. He is the principal developer of the English Resource Grammar (ERG), a precise broad-coverage implementation of Head-driven Phrase Structure Grammar (HPSG), under steady development since 1994 within the Linguistic Grammars Online (LinGO) lab at the Center for the Study of Language and Information (CSLI). Current LinGO research is focused on collaborating with the University of Oslo in parsing the English Wikipedia for improved information retrieval. His research for EPGY centers on applying the ERG to improved educational software for teaching writing in a self-paced online course. Flickinger's primary research interests are in wide-coverage grammar engineering for both parsing and generation, lexical representation, the syntax-semantics interface within the HPSG framework, the development of grammar-based treebanks for disambiguation and regression testing, and practical applications of deep processing.