SIS code: 
2/2 C+Ex

Úvod do strojového učení

Introduction to machine learning

Time and place


  • Czech    Wed, 10:40 12:10, S4
  • English  Thu, 9:00  10:30, S3

Lab session

  • Czech   Mon, 17:20  18:80, SU2
                   Thu, 12:20
     13:50, SU2
  • English  Fri, 9:00  10:30, SU2

Math and programming requirements

Probability and statistics

  • The most important requirements from probability and statistics are listed here: Preliminaries.Probability-Statistics
  • Make sure that you are familiar at least with the very basics: Prob-Stat.zaklady.2014
  • As to the MFF students, we expect the knowledge covered in the obligatory course "Pravděpodobnost a statistika" (NMAI059).
  • Gentle entry test in probability and statistics – a brief evaluation: Oct 2019, Oct 2018.

R programming

  • You can start with a simple tutorial Tutorial-on-R.2013
  • If you are not familiar with elementary R functions, use the resources listed below.

Calendar 2019/20

  Lecture date Lecture Lab
1. 2–3/10 Introduction to Machine Learning
    – What is Machine Learning
    – Basic formal concepts
    – Entropy, its meaning and definition
    – Overview of the course  
    – Requirements for getting credits  

Annotation experiment and data analysis
    – Practical experience with manual annotation
    – Annotation data analysis
    – Inter-annotator agreement
    – Confusion matrices and error analysis

A gentle tutorial on elementary data analysis in R
    – with homework

2. 9–10/10 Data analysis
    – Basic data exploration
    – Association between attributes
    – K-Means clustering
    – Hierarchical agglomerative clustering

R script
    – Feature frequency on the MOV data set
    – K-means kmeans() on the USArrests data set
    – Hierarchical clustering hclust() on USArrests

Wed 9/10
    – HW #1 assignment

3. 16–17/10

Working with data, evaluation, overfitting

Intro to Decision Trees and Random Forests

Tutorial on probability distributions and entropy in R
    – Data: xy.100.csv

Hints on computing entropy in R

Tutorial on Decision Trees

4. 23–24/10 Linear regression, Logistic regression
    – Regression and classification
    – Least square method
    – Gradient Descent Algorithm
    – Sigmoid function

R script
– Loss function, minimal value
    – Gradient Descent Algorithm
    Auto data setStudent data set
    lm() for linear regression
    glm() for logistic regression

Fri 25/10
    – HW #1 early submission date

5. 30–31/10 Lecture #5
    – Details on learning Decision Trees
    – Decision Trees and overfitting
    – More about evaluation heuristics
    – Use of statistical tests will be discussed next time

Mon 28/10 lab canceled (State holiday)

Wed 30/10
     – HW #1 late submission date
     – HW #2  assignment 

Exercises on decision trees – evaluation and tuning the complexity of classification trees
example R code

6. 6–7/11 More supervised learning algorithms   
    – Instance-based learning
    – Naive Bayes classifier
    – Bayesian networks
R script
   – feature scaling
   – unbalanced data set Caravan
   – knn() for k-Nearest Neighbor algorithm
   – naiveBayes() for Naive Bayes classifier
   – glm() for Logistic Regression
7. 13–14/11

Use of statistical t-test for evaluation
    – comparing two classifiers
    – one sample and two sample paired t-tests
    – see the materials from Lecture #5

Ensemble learning methods
    – Part I: General principles and bagging
– boosting methods will be explained next time

Wed 13/11
     – HW #2 early submission date

Exercises on t-test

8. 20–21/11

Wed 20/11
    – Obligatory written test in the lecture time
    – Final Homework HW #3 assignment

Thu 21/11 lecture canceled (Open Door Day)

Wed 20/11
     – HW #2 late submission date

Thu 21/11 lab canceled (Open Door Day)

Fri 22/11
    – Obligatory written test in the lab time
    – Final Homework HW #3 assignment

9. 27–28/11

Ensemble learning methods
    – Part II: Boosting approaches

The curse of dimensionality and feature selection
    – Why we need feature selection
    – Feature selection heuristics
    – Bayes error (postponed to my last lecture)
    – Chi-square tests (will be discussed at lab sessions)

Recommended materials and illustrations
    – Curse of dimensionality – illustration
    – FSelector package

Exercises on chi-squared tests
    – Tutorial 1
​    – Tutorial 2

Random Forests
    – using randomForest() package
    – hints on homework HW #3
    – evaluation with different cut-off values
    – evaluation using ROC

10. 4–5/12 Regularization and ROC
    – ROC curve for binary classifiers
    – Bias and Variance
    – Regularization on linear and logistic regression
    – Ridge, Lasso, Elastic net

R script I
    – data set College
    – binary classification using Decision trees
       and Logistic Regression
    – evaluation using ROC curve

R script II
    – data set College
    – regularization using glmnet() package

11. 11–12/12 Support Vector Machines
    – hyperplane, dot product, quadratic programming
    – Large Margin Classifier
    – Soft Margin Classifier
    – Kernel tricks

Native Language Identification

R script
    – a subset of TOEFL11 corpus 
    – SVM using e1071() package

12. 18–19/12 Fundamentals of Neural Networks  
13. 8–9/1 Wed 8/1 and Thu 9/1
    – Obligatory final written test
in the lecture time
Mon 6/1
     – HW #3 hard deadline
  Oral exam dates 
  • Jan 23, 28, 30
  • 9am - 1pm
  • Room S7
  • Sign up in the SIS system 

Dear students, your HW #3 and Test #2 scores will not be posted this week because dr. Holub is on vacation this week. Then the scores will be posted in accordance with the exam dates. The students signed up for January 23 will be acquainted with their scores first, then the '28 January' students and finally the '30 January' students.

Please do not hesitate to contact me if you have any questions.

All the best,

Barbora Hladka, January 13 2020


Recommended readings

  • James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert. An Introduction to Statistical Learning. Springer New York, 2013. (link
  • Lantz, Brett. Machine learning with R. Packt Publishing Ltd. 2013. [available  in the MFF library]
  • Barbora Hladká — Martin Holub — Vilém Zouhar: A Collection of Machine Learning Excercises

Introductory readings

  • Alpaydin, Ethem. Introduction to Machine Learning. The MIT Press. 2004, 2010. (link)
  • Domingos, Pedro. A few useful things to know about Machine learning. Communication of the ACM, vol. 55, Issue 10, October 2012, pp. 78--87, ACM, New York, USA. (link)
  • Gonick, Larry and Woollcott Smith. The Cartoon Guide to Statistics. Harper Resource. 2005.
  • Hladká Barbora, Holub Martin: A Gentle Introduction to Machine Learning for Natural Language Processing: How to start in 16 practical steps.In: Language and Linguistics Compass, vol. 9, No. 2, pp. 55-76, 2015.
  • Hladká Barbora, Holub Martin: Machine Learning in Natural Language Processing using R. Course at ESSLLI2013, 2013.
  • Kononenko, Igor and Matjaz Kukar. Machine Learning and Data Mining: Introduction to Principles and Algorithms. Horwood Publishing, 2007. (linka light survey of the whole field)

Advanced readings

  • Baayen, R. Harald. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press, 2008.
  • Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
  • Burges Christopher J. C.  A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. (link)
  • Cristianni, Nello and John Shawe-Taylor. An Introduction to Support Vector M​achines and other Kernel-based Learning Methods. Cambridge University Press, 2000.
  • Duda, Richard O., Peter R. Hart and David G. Stork. Pattern Classification. Second Edition. Wiley, 2001.
  • Guyon, Isabelle and Gunn, Steve and Nikravesh, Masoud and Zadeh, Lotfi A. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). Springer-Verlag New York, Inc. 2006.
  • Hastie, Trevor, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. (link)
  • Hsu Chih-Wei, Chang Chih-Chung Chang and Chih-Jen Lin. A Practical Guide to Support Vector Classication. 2010. (link)

About the R system

  • Everitt, B.S and Hothorn, Torsten. A Handbook of Statistical Analyses using R. CRC Press. 2010.
  • Dalgaard, Peter. Introductory Statistics with R. Springer, 2008.
  • Kerns, G. Jay. Introduction to Probability and Statistics Using R. 2011. (link) ​
  • Paradis, Emmanuel. R for Beginners. 2005. (link)
  • Rodrigue, German. Introducing R -- Getting started. (link)
  • Venables, W.N, D. M. Smith and the R core team. An Introduction to R. (link)
  • Venables, W. N. and B. D. Ripley. Modern Applied Statistics with S. Springer, 2002. (link)

Sample student projects from the past

This course was originally focused on machine learning in natural language processing. To get credits for lab sessions, students needed to do experimental projects

  Default Task Default Task Description Nice Student Reports
2014/15 Native Language Identification npfl054-term-project-2014-15.pdf
CUNI report

Reuters-21578 Text Categorization

Default task:
Sentiment analysis task:
2012/13 Word Sense Disambiguation PFL054.project.2012-13.pdf
2011/12 Semantic Pattern Classification PFL054.project.2011-12.specification.pdf
2010/11 Semantic Collocation Recognition PFL054.project.2010-11.pdf,
2009/10 Verb Sense Disambiguation PFL054_2009_10_project.pdf ML_report_Fabian.pdf,
2008/09 Coreference Resolution


2007/08 Named-entity Type Classification PFL054_2007_08_project.pdf Jana.Kravalova-FinalReport.pdf,


Other machine learning courses organized by UFAL

  • NPFL097 Selected problems in machine learning
  • NPFL104 Machine learning exercises 
  • NPFL114 Deep learning

MFF UK Internal Regulations