Monday, March 26, 2012 - 13:30

The Statistical Problem of Language Acquisition

Abstract: The talk reports recent work with Tom Kwiatkowski, Sharon Goldwater, and Luke Zettlemoyer on semantic parser induction by machine from a number of corpora pairing sentences with logical forms, including GeoQuery and a corpus consisting of real child-directed utterance from the CHILDES corpus. The problem of semantic parser induction and child language acquisition are both similar to the problem of inducing a grammar and a parsing model from a treebank such as the Penn treebank, except that the trees are unordered logical forms, in which the preterminals are not aligned with words in the target language, and there may be noise and spurious distracting logical forms supported by the context but irrelevant to the utterance. The talk shows that this class of problem can be solved if the child or machine initially parses with the entire space of possibilities that universal grammar allows under the assumptions of the Combinatory Categorial theory of grammar (CCG), and learns a statistical parsing model for that space using EM-related methods such as Variational Bayes learning. This can be done without all-or-none "parameter-setting" or attendant "triggers", and without invoking any "subset principle" of the kind proposed in linguistic theory, provided the system is presented with a representative sample of reasonably short string-meaning pairs from the target language.

Mark Steedman is Professor of Cognitive Science in the School of Informatics at the University of Edinburgh, to which he moved in 1998 from the University of Pennsylvania, where he previously taught as Professor in the Department of Computer and Information Science. He is a Fellow of the British Academy, the Royal Society of Edinburgh, and the American Association for Artificial Intelligence. His research covers a range of problems in computational linguistics, artificial intelligence, computer science, and cognitive science, including syntax and semantics of natural language, and parsing and comprehension of natural language discourse by humans and by machine using Combinatory Categorial Grammar (CCG). Much of his current NLP research concerns wide-coverage parsing for robust semantic interpretation and natural language inference. Some of his research concerns the analysis of music by humans and machines.