Monday, March 17, 2014 - 13:30

How to evaluate a corpus

15th lecture of the Fred Jelinek Seminar series


The linguistics researcher or language technologist often wonders
“what corpus should I use, or should I build one of my own? If I build one of my
own, how will I know if I have done a good job?” Currently there is very little
help available for them. They are in need of a framework for evaluating corpora.
We develop such a framework, in relation to corpora which aim for good coverage
of ‘general language’. The task we set is automatic creation of a
publication-quality collocations dictionary. For a sample of 100 headwords of
Czech and 100 of English, we identify a gold standard dataset of (ideally) all
the collocations that should appear for these headwords in a collocations
dictionary. We then use them to determine precision and recall for a range of
corpora, with a range of parameters.

Adam Kilgarriff is Director of Lexical Computing Ltd. He has led the development of the Sketch Engine, a leading tool for corpus research used for dictionary-making at Oxford University Press, Cambridge University Press, HarperCollins, Le Robert and elsewhere. His scientific interests lie at the intersection of computational linguistics, corpus linguistics, and dictionary-making. Following a PhD on "Polysemy" from Sussex University, he has worked at Longman Dictionaries, Oxford University Press, and the University of Brighton. He is a Visiting Research Fellow at the University of Leeds. He is active in moves to make the web available as a linguists' corpus and was the founding chair of ACL-SIGWAC (Association for Computational Linguistics Special Interest Group on Web as Corpus). He also has been chair of the ACL-SIG on the lexicon and Board member of EURALEX (European Association for Lexicography). See also