# NPFL103 – Information Retrieval

"Information retrieval is the task of searching a body of information for objects that statisfied an information need."

This course is offered at the Faculty of Mathematics and Physics to graduate students interested in the area of information retrieval, web search, document classification, and other related areas. It is based on the book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. The course covers both the foundations of information retrieval and some more advanced topics.

SIS code: NPFL103
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Lecturer: Pavel Pecina, pecina@ufal.mff.cuni.cz

Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech.

### Timespace Coordinates

• Mondays, 15:40-17:10 (18:50) Zoom/S8 Malostranské nám. 25
• Consultations upon request.

### News

Oct 5

• The course in 2020/21 will be taught via Zoom (meeting details will be available to all enrolled students).
• The first class will take place on Oct 5 at 15:40.

### Prerequisities

No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts.

### Passing Requirements

To pass the course, students need to complete two homework assignments and a written test. See grading for more details.

Note: The slides available on this page might get updated during the semestr. For each lecture, any updates will be published before the lecture starts.

#### 1. Introduction, Boolean retrieval, Inverted index, Text preprocessing

Oct 05 Slides Questions

• definition of Information Retrieval
• boolean model, boolean queries, boolean search
• inverted index, dictionary, postings, index construction, postings list intersection
• text processing, tokenization, term normalization, diacritics, case folding, stop words, lemmatization, stemming, Porter algorithm
• phrase queries, biword index, positional index, proximity search

#### 2. Dictionaries, Tolerant retrieval, Spelling correction

Oct 12 Slides Questions

• dictionary data structures, hashes, trees
• wildcard queries, permuterm index, k-gram index
• spelling correction, correcting documents, correcting queries, spellchecking dictionaries
• edit distance, Levenshtein distance, weighted edit distance, query spelling correction
• Soundex

#### 3. Index construction and index compression

Oct 19 Slides Questions

• Reuters RCV1 collection
• sort-based index construction, Blocked sort-based indexing, Single-pass in-memory indexing
• distributed indexing, Map Reduce, dynamic indexing
• index compression, Heap's law, Zipf's law, dictionary compression, postings compression

#### 4. Ranked retrieval, Term weighting, Vector space model

Oct 26 Slides Questions

• ranked retrieval vs. boolen retrieval, query-document scoring, Jaccard coefficient
• Bag-of-words model, term-frequency weighing, document-frequency weighting, tf-idf, collection frequency vs. document frequency
• vector space model, measuring similarity, distance vs. angle, cosine similarity
• length normalization, cosine normalization, pivot normalization

#### 5. Ranking, Complete search system, Evaluation, Benchmarks

• ranking, motivation and implementation, document-at-a-time vs. term-at-a-time processing
• tiered indexes
• query processing, query parser
• evaluation, precision, recall, F-measure, confusion matrix, precision-recall trade-off, precision-recall curves, average precision, mean everage precision, averaged precision-recall graph, A/B testing
• existing benchmarks

#### 6. Result summaries, Relevance Feedback, Query Expansion

Nov 9 Slides

• static summaries, dynamic summaries
• relevance feedback, Rocchio algorithm, pseudo-relevance feedback
• query expansion, thesaurus construction

#### 7. Probabilistic information retrieval

Nov 16 Slides

• probabilistic approaches to IR
• Probabilistic Ranking Principle
• Binary Independence Models
• Okapi BM25

#### 8. Language models, Text classification

Nov 23 Slides

• language models for IR
• text classification
• Naive Bayes classifier
• evaluation of text classification, micro averaging, macro averaging,

#### 9. Vector space classification

• vector space classification
• k Nearest Neighbors
• linear classifiers
• Support Vector Machines

#### 10. Document clustering

Dec 7 Slides

• document clustering in IR
• vector space clustering
• K-means clustering
• setting number of clusters
• evaluation of clustering
• hierarchical clustering, dendrogram, cluster similarity measures

#### 11. Latent Semantic Indexing

Dec 14 Slides

• Singular Value Decomposition
• dimensionality reduction
• LSI in information retrieval
• LSI as soft clustering

#### 12. Web search, Crawling, Duplicate detection, Spam detection

Dec 21 Slides

Lecture 1 Questions

##### Exercise 1.1

Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. Recommend a query processing order for the following two queries:

a) trees AND skies AND kaleidoscope

b) (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

and the postings list sizes:

Term Postings Size
eyes 213,312
kaleidoscope 87,009
skies 271,658
tangerine 46,653
trees 316,812
##### Exercise 1.2

For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal?
Explain why it is, or give an example where it isn’t.

##### Exercise 1.3

How should the following Boolean query be handled?

x AND NOT y

Why is naive evaluation of this query normally very expensive?
Write out a postings merge algorithm that evaluates this query efficiently.

Lecture 2 Questions

##### Exercise 2.1

Are the following statements true or false?

a. In a Boolean retrieval system, stemming never lowers precision.
b. In a Boolean retrieval system, stemming never lowers recall.
c. Stemming increases the size of the vocabulary.
d. Stemming should be invoked at indexing time but not while processing a query.

##### Exercise 2.2

The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning.

a. abandon/abandonment
b. absorbency/absorbent
c. marketing/markets
d. university/universe
e. volume/volumes

##### Exercise 2.3

Assume a biword index. Give an example of a document which will be returned for a query of New York University but is actually a false positive which should not be returned.

##### Exercise 2.4

How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled?

##### Exercise 2.5

Write down the entries in the permuterm index dictionary that are generated by the term mama.

##### Exercise 2.6

Vocabulary terms in the postings of in a k-gram index are lexicographically ordered. Why is this ordering useful?

##### Exercise 2.7

If |a| denotes the length of string a, show that the edit distance between a and b is never more than max{|a|, |b|}.

Lecture 3 Questions

##### Exercise 3.1

For n = 2 and 1 ≤ T ≤ 30, perform a step-by-step simulation of the Logarithmic merge algorithm. Create a table that shows, for each point in time at which T = 2 ∗ k tokens have been processed (1 ≤ k ≤ 15), which of the three indexes l0, . . . , l3 are in use. The first three lines of the table are given below.

I 3  I 2  I 1  I 0
2 0 0 0 0
4 0 0 0 1
6 0 0 1 0
8 ? ? ? ?
##### Exercise 3.2

An auxiliary index can impair the quality of collection statistics. An example is the term weighting method idf, which is defined as log(N/dfi) where N is the total number of documents and dfi is the number of documents that term i occurs in. Show that even a small auxiliary index can cause significant error in idf when it is computed on the main index only. Consider a rare term that suddenly occurs frequently (e.g., Flossie as in Tropical Storm Flossie).

##### Exercise 3.3

Assuming one machine word per posting, what is the size of the uncompressed (nonpositional) index for different tokenizations based on Slide 48? How do these numbers compare with numbers on Slide 71?

##### Exercise 3.4

Estimate the space usage of the Reuters-RCV1 dictionary with blocks of size k = 8 and k = 16 in blocked dictionary storage.

Lecture 4 Questions

##### Exercise 4.1

What is the idf of a term that occurs in every document? Compare this with the use of stop word lists.

##### Exercise 4.2

Can the tf-idf weight of a term in a document exceed 1?

##### Exercise 4.3

How does the base of the logarithm in idf affect the calculation of the tf-idf score for a given query and a given document? How does the base of the logarithm affect the relative scores of two documents on a given query?

##### Exercise 4.4

If we were to stem jealous and jealousy to a common stem before setting up the vector space, detail how the definitions of tf and idf should be modified.

Note: Detailed specification of the assignments (with a link to data download) will be distributed via email.

### 1. Vector space models

Deadline: Nov 29, 2020 23:59  100 points

Design, develop and evaluate your own retrieval system based on vector space model.

### 2. Retrieval frameworks

Deadline: Jan 3, 2021, 23:39  100 points

Design, develop and evaluate a state-of-the-art retrieval system using an off-the-shelf retrieval framework of your choice.

### Homework assignments

• There are two homework assignments during the semester with a fixed deadline announced on the webpage.
• The assignments are to be worked on independently and require a substantial amount of programming, experimentation, and reporting to complete.
• Students will present their solutions during the practicals in 10 minute presentations.
• The assignments will be awarded by 0-100 points each.
• Late submissions received up to 2 weeks after the deadline will be penalized by 50% point reduction.
• Submissions received later than 2 weeks after the deadline will be awarded 0 points.

### Exam

• There will be an exam in a form of a written test at the end of semester.
• The test includes approximately 20 short-answer questions covered by the topics discussed during the lectures.
• The maximum duration of the test is 90 minutes.
• The test will be graded by 0-100 points.

• Completion of both the homework assignments and exam is required to pass the course.
• The students need to earn at least 50 points for each assignment (before late submission penalization) and at least 50 points for the test.
• The points received for the assignments and the test will be available in SIS.
• The final grade will be based on the average results of the exam and the two homework assignments, all three weighted equally:
• ≥ 90%: grade 1 (excellent)
• ≥ 70%: grade 2 (very good)
• ≥ 50%: grade 3 (good)
• < 50%: grade 4 (fail)

### Plagiarism

• No plagiarism will be tolerated.
• All cases of plagiarism will be reported to the Student Office.

#### Introduction to Information Retrieval

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Cambridge University Press, 2008, ISBN: 978-0521865715.

Available online.

#### Information Retrieval, Algorithms and Heuristics

David A. Grossman and Ophir Frieder, Springer, 2004, ISBN 978-1402030048.

#### Modern Information Retrieval

Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999, ISBN: 978-0201398298.