Unsupervised Machine Learning in NLP

The seminar focuses on deeper understanding of selected unsupervised machine learning methods for students who already have basic knowledge of machine learning and probability models. The first half of the semester is devoted to methods of unsupervised learning using Bayesian inference (Dirichlet-Categorical models, Mixture of Categoricals, Mixture of Gaussians, Expectation Maximization, Gibbs sampling) and implementation of these methods on selected tasks. Other lectures will be devoted to clustering methods, componet analysis and unsupervised inspecting deep neural networks.

About

SIS code: NPFL097
Semester: winter
E-credits: 3
Examination: 1/1 C
Guarantor: David Mareček

Timespace Coordinates

  • The lectures are given on Thursdays 9:00 - 10:30, in room S11, the first lecture is on Sep 30.
  • Due to the lower number of students enrolled so far, I decided to have only one course in English.

Course prerequisities

Students are expected to be familiar with basic probabilistic concepts, roughly in the extent of:

  • NPFL067 - Statistical methods in NLP I

In the second half of the course, it will be an advantage for you if you know the basics of deep-learning methods. I recommend to attend

Course passing requirements

  • There are three programming assignments during the term. For each one, you can obtain 10 points. When submitted after the deadline, you can obtain at most half of the points.
  • At the end of the course, there will be a test, from which you can get additional 15 points. You will get 5 questions from the list, each for 3 points.
  • You pass the course if you obtain at least 30 points.

Lectures

1. Introduction Slides

2. Beta-Bernoulli probabilistic model Beta-Bernoulli (by C.E.Rasmussen) Beta distribution

3. Dirichlet-Categorical probabilistic model, Modeling document collections Dirichlet-Categorical (by C.E.Rasmussen) Posteriors and Predictions Document collections (by C.E.Rasmussen) Categorical Mixture Models (by C.E.Rasmussen) Modeling Document Collections

4. Expectation-Maximization, Bayesian Mixture Models, Latent Dirichlet Allocation Modeling Document Collections Latent Dirichlet allocation (by C.E.Rasmussen)

5. Gibbs Sampling in Latent Dirichlet Allocation, Entropy, Assignment 1 Gibbs Sampling (by C.E.Rasmussen) Gibbs Sampling Latent Dirichlet allocation (by C.E.Rasmussen) Algorithms for LDA and Mixture of Categoricals Latent Dirichlet Allocation

6. Chinese Restaurant Process Chinese Restaurant Process CRP Demo Bayessian inference with Tears

7. Unsupervised Text Segmentation Chinese Restaurant Process Chinese Segmentation

8. Unsupervised Word-Alignment, POS tagging, and Dependency Parsing Tagging, Alignment, Parsing EM algorithm for Word Alignment (by P.Koehn)

9. K-Means clustering, Mixture of Gaussians K-Means and Gaussian Mixture Models (by D.Rosenberg)

10. Aglomerative Clustering, Evaluation methods Aglomerative Clustering (by A.Chouldechova) Clustering Methods and Evaluation

11. Dimesionality Reduction Dimensionality Reduction t-SNE and PCA demo Clustering and Component Analysis on Word Vectors

12. Final test, Interpretation of Neural Networks Interpretation of Neural Networks Hidden in the Layers

1. Introduction

 Sep 30

  • Course overview Slides
  • revision of the basics of probability and machine learning theory

2. Beta-Bernoulli probabilistic model

 Oct 07

3. Dirichlet-Categorical probabilistic model, Modeling document collections

 Oct 14

4. Expectation-Maximization, Bayesian Mixture Models, Latent Dirichlet Allocation

 Oct 21

5. Gibbs Sampling in Latent Dirichlet Allocation, Entropy, Assignment 1

 Nov 04

6. Chinese Restaurant Process

 Nov 11

7. Unsupervised Text Segmentation

 Nov 18

8. Unsupervised Word-Alignment, POS tagging, and Dependency Parsing

 Nov 25

9. K-Means clustering, Mixture of Gaussians

 Dec 02

10. Aglomerative Clustering, Evaluation methods

 Dec 09 Aglomerative Clustering (by A.Chouldechova) Clustering Methods and Evaluation

11. Dimesionality Reduction

 Dec 16

12. Final test, Interpretation of Neural Networks

 Jan 06

Latent Dirichlet Allocation

 Deadline: Nov 25, 23:59  10 points

Chinese Segmentation

 Deadline: Dec 09 23:59  10 points

Clustering and Component Analysis on Word Vectors

 Deadline: Jan 20 23:59  10 points

List of questions for the final test

  1. Define Beta distribution, describe its parameters. Plot (roughly) the following distributions: Beta(1,1), Beta(0.1,0.1), Beta(10, 10).

  2. Derive the posterior distribution from the prior (Beta distribution) and likelihood (Binomial distribution). Derive the predictive distribution for the Beta-Bernoulli posterior.

  3. Explain Dirichlet distribution, describe its parameters. Plot (roughly) the following distributions: Dir(1,1,1), Dir(0.1,0.1,0.1), Dir(10, 10, 10).

  4. Derive the posterior distribution from the prior (Dirichlet distribution) and likelihood (Multinomial distribution). Derive the predictive distribution for the Dirichlet-Categorical posterior.

  5. Explain the "Mixture of Categoricals" model (a topic is assigned to each document) for Modeling document collections. Describe all its parameters and hyperparameters. From what distributions are they drawn? Describe the Expectation-Maximization algorithm for training such model.

  6. Explain the Latent Dirichlet Allocation model (a topic is asigned to each word in each document). Describe all its parameters and hyperparameters. From what distributions are they drawn? What are the latent variables? Describe the learning algorithm.

  7. Explain Collapsed Gibbs sampling. Choose one unsupervised task from the lectures (word alignment, tagging, segmentation) and describe the basic algorithm. What is annealing?

  8. Explain Chinese Restaurant Process. What distributions does it generate? What is exchangeability? Explain its generalization to the Pitman-Yor process.

  9. Explain the K-means and Gaussian Mixture model for clustering. What are the advantages of Gaussian Mixture model? Provide an example of clusters in 2D where K-means fails and where Gaussian Mixture model works well.

  10. Explain Hierarchical Agglomerative clustering methods. What are their advantages over K-means? What linkage criteria do you know? Provide examples of clusters in 2D where these criteria fail.

  11. What is t-SNE? How does it work? What is it used for?

  12. What is Principal Component Analysis? How does it work? What is it used for? Explain it in a 2D example.

  • Christopher Bishop: Pattern Recognition and Machine Learning, Springer-Verlag New York, 2006 (read here)

  • Kevin P. Murphy: Machine Learning: A Probabilistic Perspective, The MIT Press, Cambridge, Massachusetts, 2012 (read here)

  • David Mareček, Jindřich Libovický, Tomáš Musil, Rudolf Rosa, Tomasz Limisiewicz: HIDDEN IN THE LAYERS: Interpretation of Neural Networks for Natural Language Processing. Institute of Formal and Applied Linguistics, 2020 (read_here)