"Information retrieval is the task of searching a body of information for objects that statisfied an information need."
This course is offered at the Faculty of Mathematics and Physics to graduate students interested in the area of information retrieval, web search, document classification, and other related areas. It is based on the book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. The course covers both the foundations of information retrieval and some more advanced topics.
SIS code: NPFL103
Semester: summer
E-credits: 5
Examination: 2/2 C+Ex
Lecturer: Pavel Pecina, pecina@ufal.mff.cuni.cz
Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech/Slovak.
1. Introduction, Boolean retrieval, Inverted index, Text preprocessing Slides Questions
2. Dictionaries, Tolerant retrieval, Spelling correction Slides Questions
No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts.
To pass the course, students need to complete two homework assignments and a written test. See grading for more details.
Where stated, the teaching materials for this course are available under CC BY-SA 4.0.
Note: The slides available on this page might get updated during the semestr. For each lecture, any updates will be published before the lecture starts.
Lecture 1 Questions
Information retrieval, Information need, Query, Document relevance; Boolean model, Boolean query; Word, Term, Token; Text normalization, Stemming, Lemmatization; Stopwords, Stopword removal; Inverted index, Dictionary, Postings, Postings intersection; Biword index, Positonal index; Phrasal search, Proximity search.
Draw the postings for terms rise and home that would be built for the following document collection:
| Doc1: | new home sales top forecasts |
| Doc2: | home sales rise in july |
| Doc3: | increase in home sales in july |
| Doc4: | july new home sales rise |
Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. Recommend a query processing order for the following two queries:
trees AND skies AND kaleidoscope(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)and the postings list sizes:
| Term | Postings Size |
|---|---|
| eyes | 213,312 |
| kaleidoscope | 87,009 |
| marmalade | 107,913 |
| skies | 271,658 |
| tangerine | 46,653 |
| trees | 316,812 |
For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal?
Explain why it is, or give an example where it isn’t.
Write out a postings merge algorithm for the query:
x OR y
How should the following Boolean query be handled?
x AND NOT y
Why is naive evaluation of this query normally very expensive?
Write out a postings merge algorithm that evaluates this query efficiently.
Are the following statements true or false?
The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning.
Lecture 2 Questions
Implementation of dictionary and postings in inverted index, B-trees; Wildcard queries, Permuterm index, k-gram index; Spelling correction (isolated, context-sensitive), Spelling correction in IR, (Weighted) Edit distance, Levenshtein distance, Soundex.
Assume a biword index. Give an example of a document which will be returned
for a query of New York University but is actually a false positive which should not be
returned.
How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled?
In the permuterm index, each permuterm vocabulary term points to the original vocabulary term(s) from which it was derived. How many original vocabulary terms can there be in the postings list of a permuterm vocabulary term?
Write down the entries in the permuterm index dictionary that are generated by the term mama.
If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup on?
Vocabulary terms in the postings of in a k-gram index are lexicographically ordered. Why is this ordering useful?
If |a| denotes the length of string a, show that the edit distance between a and b is never more than max{|a|, |b|}.
Compute the edit distance between paris and alice. Write down the 5 × 5 array of distances between all prefixes as computed by the Levenshtein algorithm.
Consider the four-term query catched in the rye and suppose that each of the query terms has five alternative terms suggested by isolated-term correction. How many possible corrected phrases must we consider if we do not trim the space of corrected phrases, but instead try all six variants for each of the terms?
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Cambridge University Press, 2008, ISBN: 978-0521865715.
Available online.
David A. Grossman and Ophir Frieder, Springer, 2004, ISBN 978-1402030048.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999, ISBN: 978-0201398298.
