"Information retrieval is the task of searching a body of information for objects that statisfied an information need."
This course is offered at the Faculty of Mathematics and Physics to graduate students interested in the area of information retrieval, web search, document classification, and other related areas. It is based on the book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. The course covers both the foundations of information retrieval and some more advanced topics.
SIS code: NPFL103
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Lecturer: Pavel Pecina, pecina@ufal.mff.cuni.cz
Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.
Oct 5
1. Introduction, Boolean retrieval, Inverted index, Text preprocessing Slides Questions
2. Dictionaries, Tolerant retrieval, Spelling correction Slides Questions
3. Index construction and index compression Slides Questions
4. Ranked retrieval, Term weighting, Vector space model Slides Questions
5. Ranking, Complete search system, Evaluation, Benchmarks Slides 1. Vector space models
6. Result summaries, Relevance Feedback, Query Expansion Slides
7. Probabilistic information retrieval Slides
8. Language models, Text classification Slides
9. Vector space classification Slides 2. Retrieval frameworks
10. Document clustering Slides
11. Latent Semantic Indexing Slides
12. Web search, Crawling, Duplicate detection Slides
No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts.
To pass the course, students need to complete two homework assignments and a written test. See grading for more details.
Note: The slides available on this page might get updated during the semestr. For each lecture, any updates will be published before the lecture starts.
Nov 02 Slides 1. Vector space models
Nov 09 Slides
Nov 16 Slides
Dec 30 Slides
Dec 07 Slides 2. Retrieval frameworks
Dec 14 Slides
Dec 21 Slides
Jan 4 Slides
Lecture 1 Questions
Information retrieval, Information need, Query, Document relevance; Boolean model, Boolean query; Word, Term, Token; Text normalization, Stemming, Lemmatization; Stopwords, Stopword removal; Inverted index, Dictionary, Postings, Postings intersection; Biword index, Positonal index; Phrasal search, Proximity search.
Draw the postings for terms rise and home that would be built for the following document collection:
Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july Doc 4 july new home sales rise
Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. Recommend a query processing order for the following two queries:
trees AND skies AND kaleidoscope
(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
and the postings list sizes:
Term | Postings Size |
---|---|
eyes | 213,312 |
kaleidoscope | 87,009 |
marmalade | 107,913 |
skies | 271,658 |
tangerine | 46,653 |
trees | 316,812 |
For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal?
Explain why it is, or give an example where it isn’t.
How should the following Boolean query be handled?
x AND NOT y
Why is naive evaluation of this query normally very expensive?
Write out a postings merge algorithm that evaluates this query efficiently.
Lecture 2 Questions
Implementation of dictionary and postings in inverted index, B-trees; Wildcard queries, Permuterm index, k-gram index; Spelling correction (isolated, context-sensitive), Spelling correction in IR, (Weighted) Edit distance, Levenshtein distance, Soundex.
Are the following statements true or false?
The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning.
Assume a biword index. Give an example of a document which will be returned
for a query of New York University
but is actually a false positive which should not be
returned.
How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled?
In the permuterm index, each permuterm vocabulary term points to the original vocabulary term(s) from which it was derived. How many original vocabulary terms can there be in the postings list of a permuterm vocabulary term?
Write down the entries in the permuterm index dictionary that are generated by the term mama
.
If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup on?
Vocabulary terms in the postings of in a k-gram index are lexicographically ordered. Why is this ordering useful?
If |a| denotes the length of string a, show that the edit distance between a and b is never more than max{|a|, |b|}.
Compute the edit distance between paris and alice. Write down the 5 × 5 array of distances between all prefixes as computed by the Levenshtein algorithm.
Consider the four-term query catched in the rye and suppose that each of the query terms has five alternative terms suggested by isolated-term correction. How many possible corrected phrases must we consider if we do not trim the space of corrected phrases, but instead try all six variants for each of the terms?
Lecture 3 Questions
Index construction (sort-based, blocked sort-based, single-pass); Distributed indexing, MapReduce; Dynamic indexing, Logarithmic index merging; Index compression, lossy vs. loss-less, Heap's law, Zipf's law; Dictionary compression (as a string, with blocking, with front coding); Postings compression (gap encoding).
For n = 2 and 1 ≤ T ≤ 30, perform a step-by-step simulation of the Logarithmic merge algorithm. Create a table that shows, for each point in time at which T = 2 ∗ k tokens have been processed (1 ≤ k ≤ 15), which of the three indexes l_{0}, . . . , l_{3} are in use. The first three lines of the table are given below.
I_{ 3 } | I_{ 2 } | I_{ 1 } | I_{ 0 } | |
---|---|---|---|---|
2 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 1 |
6 | 0 | 0 | 1 | 0 |
8 | ? | ? | ? | ? |
An auxiliary index can impair the quality of collection statistics. An example is the term weighting method idf, which is defined as log(N/df_{i}) where N is the total number of documents and df_{i} is the number of documents that term i occurs in. Show that even a small auxiliary index can cause significant error in idf when it is computed on the main index only. Consider a rare term that suddenly occurs frequently (e.g., Flossie as in Tropical Storm Flossie).
Assuming one machine word per posting, what is the size of the uncompressed (nonpositional) index for different tokenizations based on Slide 48? How do these numbers compare with numbers on Slide 71?
Estimate the space usage of the Reuters-RCV1 dictionary with blocks of size k = 8 and k = 16 in blocked dictionary storage.
Lecture 4 Questions
Boolean rerieval vs. Ranked retrieval, document-query scoring; Bag-of-Words model; Term weighting, Term frequency, Document frequency, TF-IDF; Vector space model, Similarity in vector space, Length normalization, Pivoted length normalization.
What is the idf of a term that occurs in every document? Compare this with the use of stop word lists.
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 (bellow) and compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values in the following table:
term | df | idf |
---|---|---|
car | 18,165 | 1.65 |
auto | 6,723 | 2.08 |
insurance | 19,241 | 1.62 |
best | 25,235 | 1.5 |
Doc1 | Doc2 | Doc3 | |
---|---|---|---|
car | 27 | 4 | 24 |
auto | 3 | 33 | 0 |
insurance | 0 | 33 | 29 |
best | 14 | 0 | 17 |
Can the tf-idf weight of a term in a document exceed 1?
How does the base of the logarithm in idf affect the calculation of the tf-idf score for a given query and a given document? How does the base of the logarithm affect the relative scores of two documents on a given query?
If we were to stem jealous and jealousy to a common stem before setting up the vector space, detail how the definitions of tf and idf should be modified.
Show that if a query vector and document vectors are all normalized to unit vectors, then the rank ordering produced by Euclidean distance is identical to that produced by cosine similarities.
In turning a query into a unit vector we can assigned equal weights to each of the query terms. What other principled approaches are plausible?
Lecture 5 Questions
Importance of ranking in IR; Computing top-k results, Composite scores (PageRank), Document-at-a-time processing, Term-at-a-time processing; Tiered indexes; Evaluation, Document relevance, Confusion matrix, Precision, Recall, F-measure, Precision-recall trade-off, Precision-recall curve, Averaged precision, A/B testing, Benchmarking, test collections.
An IR system returns 8 relevant documents, and 10 nonrelevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall?
Derive the equivalence between the two formulas for F measure on slide 51 (using α ad β as parameters), given that α = 1/(β^{2} + 1).
Must there always be a break-even point between precision and recall? Either show there must be or give a counter-example.
The following list of Rs and Ns represents relevant (R) and nonrelevant (N) returned documents in a ranked list of 20 documents retrieved in response to a query from a collection of 10,000 documents. The top of the ranked list (the document the system thinks is most likely to be relevant) is on the left of the list. This list shows 6 relevant documents. Assume that there are 8 relevant documents in total in the collection.
R R N N N N N N R N R N N N R N N N N R
Lecture 6 Questions
Presenting IR results, Summaries, Snippets; Relevance feedback, Rocchio algorithm, Positive vs. negative feedback, Pseudo-relevance feedback; Query expansion, thesaurus-based, co-occurence-based.
In Rocchio’s 1971 algorithm (SMART), what weight setting for α/β/γ does a “Find pages like this one” search correspond to?
Give three reasons why relevance feedback has been little used in web search.
Why is positive feedback likely to be more useful than negative feedback to an IR system? Why might only using one nonrelevant document be more effective than using several?
Lecture 7 Questions
Probability Ranking Principle, Binary Independece Model (assumptions), Retrieval Status Value, BM25 extension.
What are the differences between standard vector space tf-idf weighting and the BIM probabilistic retrieval model (in the case where no document relevance information is available)?
Let X_{t} be a random variable indicating whether the term t appears in a document. Suppose we have |R| relevant documents in the document collection and that X_{t} = 1 in s of the documents. Take the observed data to be just these observations of X_{t} for each document in R. Show that the MLE for the parameter p_{t} = P(X_{t} = 1|R = 1,q), that is, the value for p_{t} which maximizes the probability of the observed data, is p_{t} = s/|R|.
Lecture 8 Questions
Language models for IR (principles, parameter estimation, smoothing, Jelinek-Mercer, Dirichlet); Text classification, Naive Bayes classification (classification rule, parameter estimation, smoothing, assumptions); Evaluation of classification (Precision, Recall, F), One-class vs. Multi-class classification (Micro, Macro).
Including STOP probabilities in the calculation, what will the sum of the probability estimates of all strings in the language of length 1 be? Assume that you generate a word and then decide whether to stop or not (i.e., the null string is not part of the language).
How might a language model be used in a spelling correction system? In particular, consider the case of context-sensitive spelling correction, and correcting incorrect usages of words, such as their in Are you their?
Consider making a language model from the following training text:
the martian has landed on the latin pop sensation ricky martin
Suppose we have a collection that consists of the 4 documents given in the below table.
DocID | Document text |
---|---|
Doc1 | click go the shears boys click click click |
Doc2 | click click |
Doc3 | metal here |
Doc4 | metal shears click here |
Build a query likelihood language model for this document collection. Assume a mixture model between the documents and the collection, with both weighted at 0.5. Maximum likelihood estimation (MLE) is used to estimate both as unigram models. Work out the model probabilities of the queries click, shears, and hence click shears for each document, and use those probabilities to rank the documents returned by each query. Fill in these probabilities in the below table:
Query | Doc1 | Doc2 | Doc3 | Doc4 |
---|---|---|---|---|
click | ||||
shears | ||||
click shears |
What is the final ranking of the documents for the query click shears?
The rationale for the positional independence assumption is that there is no useful information in the fact that a term occurs in position k of a document. Find exceptions. Consider formulaic documents with a fixed document structure.
Lecture 9 Questions
Vector space classification, k-nn classification (principle, complexity, probabilistic variant); Linear vs. non-linear classfiers; Multi-class classification (one-of, any-of); Support Vector Machines (large margin classification, separability, Slack variables).
Prove that the number of linear separators of two classes is either infinite or zero.
Which of the three vectors a = (0.5, 1.5), x = (2, 2), b = (4, 4), and c = (8, 6) is:
What is the minimum number of support vectors that there can be for a data set (which contains instances of each class)?
Lecture 10 Questions
Clustering (comparison with classificaton, cluster hypothesis); Clustering in IR; Flat vs. Hierarchical clustering; Hard vs. Soft clustering; K-means clustering (algorithm, convergence vs. optimality, RSS); Evaluation of clustering (Purity, Rand index); Optimization of number of clusters; Hierarchical clustering (aglomerative, bi-secting), cluster similarity, problems (chaining, inversions).
Define two documents as similar if they have at least two proper names like Clinton or Sarkozy in common. Give an example of an information need and two documents, for which the cluster hypothesis does not hold for this notion of similarity.
Make up a simple one-dimensional example (i.e. points on a line) with two clusters where the inexactness of cluster-based retrieval shows up. In your example, retrieving clusters close to the query should do worse than direct nearest neighbor search.
Purity as an external evaluation criterion for cluster quality. Majority class and number of members of the majority class for the three clusters are: x, 5 (cluster 1); o, 4 (cluster 2); and ⋄, 3 (cluster 3). Purity is (1/17) × (5 + 4 + 3) ≈ 0.71.
Replace every point d with two identical copies of d in the same class in the above figure.
Why are documents that do not use the same term for the concept car likely to end up in the same cluster in K-means clustering?
Two of the possible termination conditions for K-means were (1) assignment does not change, (2) centroids do not change. Do these two conditions imply each other?
For a fixed set of N documents there are up to N^{2} distinct similarities between clusters in single-link and complete-link clustering. How many distinct cluster similarities are there in GAAC and centroid clustering?
Consider running 2-means clustering on a collection with documents from two different languages. What result would you expect? Would you expect the same result when running an HAC algorithm?
Lecture 11 Questions
Latent Semantic Indexing (principle, SVD, dimensionality reduction); LSI in IR; LSI as soft clustering;
Lecture 12 Questions
Web crawling (principles, sub-tasks, robots.txt), Mercator URL frontier (principles); Duplicate detection, Near duplicate detection (shingles, sketches).
Why is it better to partition hosts (rather than individual URLs) between the nodes of a distributed crawl system?
Why should the host splitter precede the Duplicate URL Eliminator?
Note: Detailed specification of the assignments (with a link to data download) will be distributed via email.
Deadline: Nov 29, 2020 23:59 100 points
Design, develop and evaluate your own retrieval system based on vector space model.
Deadline: Jan 3, 2021, 23:39 100 points
Design, develop and evaluate a state-of-the-art retrieval system using an off-the-shelf retrieval framework of your choice.
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Cambridge University Press, 2008, ISBN: 978-0521865715.
Available online.
David A. Grossman and Ophir Frieder, Springer, 2004, ISBN 978-1402030048.
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999, ISBN: 978-0201398298.