Needle in a haystack

NPFL103 – Information Retrieval

"Information retrieval is the task of searching a body of information for objects that statisfied an information need."

This course is offered at the Faculty of Mathematics and Physics to graduate students interested in the area of information retrieval, web search, document classification, and other related areas. It is based on the book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. The course covers both the foundations of information retrieval and some more advanced topics.

About

SIS code: NPFL103
Semester: summer
E-credits: 5
Examination: 2/2 C+Ex
Lecturer: Pavel Pecina, pecina@ufal.mff.cuni.cz

Language: The course is taught in English. All the materials are in English, the homework assignments and/or exam can be completed in English or Czech/Slovak.

Timespace Coordinates

  • Tuesdays, 12:20-13:50 (14:00-15:30) S9, Malostranské nám. 25
  • Consultations upon request.

News

  • The course will start on Feb 24, 2026
  • No lecture/practicals on March 10, 2026

Lectures

1. Introduction, Boolean retrieval, Inverted index, Text preprocessing Slides Questions

2. Dictionaries, Tolerant retrieval, Spelling correction Slides Questions


Assignments


Prerequisities

No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts.

Passing Requirements

To pass the course, students need to complete two homework assignments and a written test. See grading for more details.

License

Where stated, the teaching materials for this course are available under CC BY-SA 4.0.

Note: The slides available on this page might get updated during the semestr. For each lecture, any updates will be published before the lecture starts.

1. Introduction, Boolean retrieval, Inverted index, Text preprocessing

 Feb 24 Slides Questions

  • definition of Information Retrieval
  • boolean model, boolean queries, boolean search
  • inverted index, dictionary, postings, index construction, postings list intersection
  • text processing, tokenization, term normalization, diacritics, case folding, stop words, lemmatization, stemming, Porter algorithm
  • phrase queries, biword index, positional index, proximity search

2. Dictionaries, Tolerant retrieval, Spelling correction

 Mar 3 Slides Questions

  • dictionary data structures, hashes, trees
  • wildcard queries, permuterm index, k-gram index
  • spelling correction, correcting documents, correcting queries, spellchecking dictionaries
  • edit distance, Levenshtein distance, weighted edit distance, query spelling correction
  • Soundex

Lecture 1 Questions

Key concepts

Information retrieval, Information need, Query, Document relevance; Boolean model, Boolean query; Word, Term, Token; Text normalization, Stemming, Lemmatization; Stopwords, Stopword removal; Inverted index, Dictionary, Postings, Postings intersection; Biword index, Positonal index; Phrasal search, Proximity search.

Exercise 1.1

Draw the postings for terms rise and home that would be built for the following document collection:

Doc1:  new home sales top forecasts
Doc2:  home sales rise in july
Doc3:  increase in home sales in july
Doc4:  july new home sales rise
Exercise 1.2

Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. Recommend a query processing order for the following two queries:

  1. trees AND skies AND kaleidoscope
  2. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

and the postings list sizes:

Term Postings Size
eyes 213,312
kaleidoscope 87,009
marmalade 107,913
skies 271,658
tangerine 46,653
trees 316,812
Exercise 1.3

For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal?
Explain why it is, or give an example where it isn’t.

Exercise 1.4

Write out a postings merge algorithm for the query:

x OR y

Exercise 1.5

How should the following Boolean query be handled?

x AND NOT y

Why is naive evaluation of this query normally very expensive?
Write out a postings merge algorithm that evaluates this query efficiently.

Exercise 1.6

Are the following statements true or false?

  1. In a Boolean retrieval system, stemming never lowers precision.
  2. In a Boolean retrieval system, stemming never lowers recall.
  3. Stemming increases the size of the vocabulary.
  4. Stemming should be invoked at indexing time but not while processing a query.
Exercise 1.7

The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs would you argue shouldn’t be conflated. Give your reasoning.

  1. abandon/abandonment
  2. absorbency/absorbent
  3. marketing/markets
  4. university/universe
  5. volume/volumes

Lecture 2 Questions

Key concepts

Implementation of dictionary and postings in inverted index, B-trees; Wildcard queries, Permuterm index, k-gram index; Spelling correction (isolated, context-sensitive), Spelling correction in IR, (Weighted) Edit distance, Levenshtein distance, Soundex.

Exercise 2.1

Assume a biword index. Give an example of a document which will be returned for a query of New York University but is actually a false positive which should not be returned.

Exercise 2.2

How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled?

Exercise 2.3

In the permuterm index, each permuterm vocabulary term points to the original vocabulary term(s) from which it was derived. How many original vocabulary terms can there be in the postings list of a permuterm vocabulary term?

Exercise 2.4

Write down the entries in the permuterm index dictionary that are generated by the term mama.

Exercise 2.5

If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup on?

Exercise 2.6

Vocabulary terms in the postings of in a k-gram index are lexicographically ordered. Why is this ordering useful?

Exercise 2.7

If |a| denotes the length of string a, show that the edit distance between a and b is never more than max{|a|, |b|}.

Exercise 2.8

Compute the edit distance between paris and alice. Write down the 5 × 5 array of distances between all prefixes as computed by the Levenshtein algorithm.

Exercise 2.9

Consider the four-term query catched in the rye and suppose that each of the query terms has five alternative terms suggested by isolated-term correction. How many possible corrected phrases must we consider if we do not trim the space of corrected phrases, but instead try all six variants for each of the terms?


Homework assignments

  • There are two homework assignments during the semester with a fixed deadline announced on the webpage.
  • The assignments are to be worked on independently and require a substantial amount of programming, experimentation, and reporting to complete.
  • Students will present their solutions during the practicals in 5-10 minute presentations.
  • The assignments will be awarded by 0-100 points each.
  • Late submissions received up to 2 weeks after the deadline will be penalized by 50% point reduction.
  • Submissions received later than 2 weeks after the deadline will be awarded 0 points.

Exam

  • The exam in a form of a written test takes places at the end of semester (up to three terms).
  • The test includes approximately 20 questions covered by the topics discussed during the lectures, similar to the ones in questions
  • The maximum duration of the test is 90 minutes.
  • The test will be graded by 0-100 points.

Final Grading

  • Completion of both the homework assignments and exam is required to pass the course.
  • The students need to earn at least 50 points for each assignment (before any late submission penalization) and at least 50 points for the test.
  • The points received for the assignments and the test will be available in the Study group roster in SIS.
  • The final grade will be based on the average results of the exam and the two homework assignments, all three weighted equally:
    • ≥ 90%: grade 1 (excellent)
    • ≥ 70%: grade 2 (very good)
    • ≥ 50%: grade 3 (good)
    • < 50%: grade 4 (fail)

Plagiarism

  • No plagiarism will be tolerated.
  • All cases of plagiarism will be reported to the Student Office.

Introduction to Information Retrieval

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Cambridge University Press, 2008, ISBN: 978-0521865715.

Available online.




Information Retrieval, Algorithms and Heuristics

David A. Grossman and Ophir Frieder, Springer, 2004, ISBN 978-1402030048.





Modern Information Retrieval

Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999, ISBN: 978-0201398298.