"Information retrieval is the task of searching a body of information for objects that statisfied an information need."
This course is offered at the Faculty of Mathematics and Physics to graduate students interested in the area of information retrieval, web search, document classification, and other related areas. It is based on the book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. The course covers both the foundations of information retrieval and some more advanced topics.
SIS code: NPFL103
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Lecturer: Pavel Pecina, pecina@ufal.mff.cuni.cz
Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.
Oct 5
1. Introduction, Boolean retrieval, Inverted index, Text preprocessing Slides Questions
2. Dictionaries, Tolerant retrieval, Spelling correction Slides Questions
3. Index construction and index compression Slides Questions
4. Ranked retrieval, Term weighting, Vector space model Slides Questions
5. Ranking, Complete search system, Evaluation, Benchmarks Slides 1. Vector space models
6. Result summaries, Relevance Feedback, Query Expansion Slides
7. Probabilistic information retrieval Slides
8. Language models, Text classification Slides
9. Vector space classification Slides 2. Retrieval frameworks
10. Document clustering Slides
11. Latent Semantic Indexing Slides
12. Web search, Crawling, Duplicate detection, Spam detection Slides
No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts.
To pass the course, students need to complete two homework assignments and a written test. See grading for more details.
Note: The slides available on this page might get updated during the semestr. For each lecture, any updates will be published before the lecture starts.
Nov 2 Slides 1. Vector space models
Nov 9 Slides
Nov 16 Slides
Nov 23 Slides
Nov 30 Slides 2. Retrieval frameworks
Dec 7 Slides
Dec 14 Slides
Dec 21 Slides
Lecture 1 Questions
Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. Recommend a query processing order for the following two queries:
a) trees AND skies AND kaleidoscope
b) (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
and the postings list sizes:
Term | Postings Size |
---|---|
eyes | 213,312 |
kaleidoscope | 87,009 |
marmalade | 107,913 |
skies | 271,658 |
tangerine | 46,653 |
trees | 316,812 |
For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal?
Explain why it is, or give an example where it isn’t.
How should the following Boolean query be handled?
x AND NOT y
Why is naive evaluation of this query normally very expensive?
Write out a postings merge algorithm that evaluates this query efficiently.
Lecture 2 Questions
Are the following statements true or false?
a. In a Boolean retrieval system, stemming never lowers precision.
b. In a Boolean retrieval system, stemming never lowers recall.
c. Stemming increases the size of the vocabulary.
d. Stemming should be invoked at indexing time but not while processing a query.
The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning.
a. abandon/abandonment
b. absorbency/absorbent
c. marketing/markets
d. university/universe
e. volume/volumes
Assume a biword index. Give an example of a document which will be returned
for a query of New York University
but is actually a false positive which should not be
returned.
How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled?
Write down the entries in the permuterm index dictionary that are generated by the
term mama
.
Vocabulary terms in the postings of in a k-gram index are lexicographically ordered. Why is this ordering useful?
If |a| denotes the length of string a, show that the edit distance between a and b is never more than max{|a|, |b|}.
Lecture 3 Questions
For n = 2 and 1 ≤ T ≤ 30, perform a step-by-step simulation of the Logarithmic merge algorithm. Create a table that shows, for each point in time at which T = 2 ∗ k tokens have been processed (1 ≤ k ≤ 15), which of the three indexes l_{0}, . . . , l_{3} are in use. The first three lines of the table are given below.
I_{ 3 } | I_{ 2 } | I_{ 1 } | I_{ 0 } | |
---|---|---|---|---|
2 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 1 | 0 |
8 | ? | ? | ? | ? |
An auxiliary index can impair the quality of collection statistics. An example is the term weighting method idf, which is defined as log(N/df_{i}) where N is the total number of documents and df_{i} is the number of documents that term i occurs in. Show that even a small auxiliary index can cause significant error in idf when it is computed on the main index only. Consider a rare term that suddenly occurs frequently (e.g., Flossie as in Tropical Storm Flossie).
Assuming one machine word per posting, what is the size of the uncompressed (nonpositional) index for different tokenizations based on Slide 48? How do these numbers compare with numbers on Slide 71?
Estimate the space usage of the Reuters-RCV1 dictionary with blocks of size k = 8 and k = 16 in blocked dictionary storage.
Lecture 4 Questions
What is the idf of a term that occurs in every document? Compare this with the use of stop word lists.
Can the tf-idf weight of a term in a document exceed 1?
How does the base of the logarithm in idf affect the calculation of the tf-idf score for a given query and a given document? How does the base of the logarithm affect the relative scores of two documents on a given query?
If we were to stem jealous and jealousy to a common stem before setting up the vector space, detail how the definitions of tf and idf should be modified.
Note: Detailed specification of the assignments (with a link to data download) will be distributed via email.
Deadline: Nov 29, 2020 23:59 100 points
Design, develop and evaluate your own retrieval system based on vector space model.
Deadline: Jan 3, 2021, 23:39 100 points
Design, develop and evaluate a state-of-the-art retrieval system using an off-the-shelf retrieval framework of your choice.
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Cambridge University Press, 2008, ISBN: 978-0521865715.
Available online.
David A. Grossman and Ophir Frieder, Springer, 2004, ISBN 978-1402030048.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999, ISBN: 978-0201398298.