NPFL103 – Information Retrieval

"Information retrieval is the task of searching a body of information for objects that statisfied an information need."

This course is offered at the Faculty of Mathematics and Physics to graduate students interested in the area of information retrieval, web search, document classification, and other related areas. it is based on the book by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Introduction to Information Retrieval. The course covers both the foundations of information retrieval and some more advanced topics.

About

SIS code: NPFL103
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Lecturer: Pavel Pecina, pecina@ufal.mff.cuni.cz

Language: The course is taught in English. All the materials are in English, the homework assignments and exam can be completed in English or Czech.

Timespace Coordinates

  • Mondays, 15:40-17:10 (18:50) S11, Malostranské nám. 25
  • Consultations upon request by email.

News

 Oct 15 The lecture room has changed to S11.

 Oct 08 The course will start on Oct 8 at 15:40 in S1.


Lectures

1. Introduction, Boolean retrieval, Inverted index, Text preprocessing Slides

2. Dictionaries, Tolerant retrieval, Spelling correction Slides

3. Index construction and index compression Slides

4. Ranked retrieval, Term weighting, Vector space model Slides

5. Ranking, Complete search system, Evaluation, Benchmarks Slides 1. Vector space models

6. Result summaries, Relevance Feedback, Query Expansion Slides

7. Probabilistic information retrieval Slides

8. Language models, Text classification Slides

9. Vector space classification Slides

10. Document clustering Slides 2. Retrieval frameworks

11. Latent Semantic Indexing Slides

12. Web search, Crawling, Duplicate and spam detection, PageRank Slides

13. Review session, Questions and Answers


Assignments

1. Vector space models

2. Retrieval frameworks


Prerequisities

No formal prerequisities are required. Students should have a substantial programming experience and be familar with basic algorithms, data structures, and statistical/probabilistic concepts.

Passing Requirements

To pass the course, students need to complete two homework assignments and a written test. See grading for more details.

Note: The slides available on this page might get updated during the semestr. For each lecture, any updates will be published before the lecture starts.

1. Introduction, Boolean retrieval, Inverted index, Text preprocessing

 Oct 08 Slides

  • definition of information retrieval
  • boolean model, boolean queries, boolean search
  • term incidence, inverted index, dictionary, postings, index construction, postings list intersection
  • text processing, tokenization, term normalization, diacritics, case folding, stop words, lemmatization, stemming, Porter algorithm
  • phrase queries, biword index, positional index, proximity search

2. Dictionaries, Tolerant retrieval, Spelling correction

 Oct 15 Slides

  • dictionary data structures, hashes, trees
  • wildcard queries, permuterm index, k-gram index
  • spelling correction, correcting documents, correcting queries, spellchecking dictionaries
  • edit distance, Levenshtein distance, weighted edit distance, query spelling correction
  • Soundex

3. Index construction and index compression

 Oct 22 Slides

  • RCV1 collection
  • sort-based index construction, Blocked sort-based indexing, Single-pass in-memory indexing
  • distributed indexing, Map Reduce, dynamic indexing
  • index compression, Heap's law, Zipf's law, dictionary compression, postings compression

4. Ranked retrieval, Term weighting, Vector space model

 Nov 29 Slides

  • ranked retrieval vs. boolen retrieval, query-document scoring, Jaccard coefficient
  • Bag-of-words model, term-frequency weighing, document-frequency weighting, tf-idf, collection frequency vs. document frequency
  • vector space model, measuring similarity, distance vs. angle, cosine similarity
  • length normalization, cosine normalization, pivot normalization

5. Ranking, Complete search system, Evaluation, Benchmarks

 Nov 5 Slides 1. Vector space models

  • ranking, motivation and implementation, document-at-a-time vs. term-at-a-time processing
  • tiered indexes
  • query processing, query parser
  • evaluation, precision, recall, F-measure, confusion matrix, precision-recall trade-off, precision-recall curves, average precision, mean everage precision, averaged precision-recall graph, A/B testing
  • existing benchmarks

6. Result summaries, Relevance Feedback, Query Expansion

 Nov 12 Slides


7. Probabilistic information retrieval

 Nov 19 Slides


8. Language models, Text classification

 Nov 26 Slides


9. Vector space classification

 Dec 3 Slides


10. Document clustering

 Dec 10 Slides 2. Retrieval frameworks


11. Latent Semantic Indexing

 Dec 17 Slides


12. Web search, Crawling, Duplicate and spam detection, PageRank

 Jan 7 Slides


13. Review session, Questions and Answers

 Jan 14


Note: Detailed specification of the assignments will be distributed via email. The deadlines are hard, no extensions are allowed.

1. Vector space models

 Deadline: Dec 2 23:59  100 points

Design, develop and evaluate your own retrieval system based on vector space models.

2. Retrieval frameworks

 Deadline: Jan 6 23:39  100 points

Design, develop and evaluate a state-of-the-art retrieval system using an off-the-shelf retrieval framework.

Homework assignments

  • There are two homework assignments during the semester.
  • The assignments are to be worked on independently and require a substantial amount of programming, experimentation, and reporting to complete.
  • The students will present their solutions during the practicals in 10 minute presentations.
  • The assignments are graded by 0-100 points each.

Exam

  • The exam has a form of a written test, scheduled at the end of semester.
  • The test includes approximately 20 short-answer questions covered by the topics discussed during the lectures.
  • The maximum duration of the test is 90 minutes.
  • The test is graded by 0-100 points.

Grading

  • Completion of both the homework assignments and exam is required to pass the course.
  • The students need to earn at least 50 points for each assignment and at least 50 points for the test.
  • The final grade will be based on the average results of the exam and the two homework assignments, all three weighted equally:
    • ≥ 90%: grade 1 (excellent)
    • ≥ 70%: grade 2 (very good)
    • ≥ 50%: grade 3 (good)
    • < 50%: grade 4 (fail)

Plagiarism

  • No plagiarism will be tolerated.
  • All cases of plagiarism will be reported to the Student Office.

Introduction to Information Retrieval

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Cambridge University Press, 2008, ISBN: 978-0521865715.

Available online.




Information Retrieval, Algorithms and Heuristics

David A. Grossman and Ophir Frieder, Springer, 2004, ISBN 978-1402030048.





Modern Information Retrieval

Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999, ISBN: 978-0201398298.