SIS code: 
Semester: 
summer
E-credits: 
5
Examination: 
2/2 C+Ex

Úvod do strojového učení v systému R

Introduction to machine learning in the R system

Schedule

  • English lecture   Wednesday 10:40-12:10, S11
  • English lab         Wednesday 9:00-10:30, SU2

Math and programming requirements

Probability and statistics

R programming

  • You can start with a simple tutorial Tutorial-on-R.2013
  • If you are not familiar with elementary R functions, use the resources listed below.

Calendar

No. Date Lecture Lab
1. Feb 15 Introduction to ML Entry test on practical probability calculations
Random processes  –  simulations in R
2. Feb 22 Data analysis (pp. 1-38)

Annotation experiment  –  Demo

Inter-annotator agreement  –  Cohen's kappa

Working with R  –  Tutorial on annotation data analysis

3. Mar 1

On evaluation and overfitting
Decision Trees (basic structure)
Entropy

Programming questions
    – ml-lab.2023-03.01.R
4. Mar 8 Clustering (pp. 39-73)
Linear Regression

Exercises on IAA, Cohen's kappa, and error analysis
    – Presentation (by Iván)

Exercises on entropy and conditional entropy
    – Tutorial on distributions (exercises) + data set xy.100

5. Mar 15 Decision Trees and Random Forests Programming questions
    – ml-lab.2023-03.15.R
    – tf-idf.pdf
6. Mar 22 Logistic regression (pp. 1-29)
 
Tutorial on Decision Trees
    – forbes.data-preparation.R
    – cp-and-pruning.forbes.R
    – forbes.DT-RF.R
7. Mar 29 Evaluation of Binary Classification (pp. 29-43)
Naive Bayes algorithm

Programming questions
    – ml-lab.2023-03-29.R
HW1 assignment

8. Apr 5 More on practical evaluation
Bayes classifier and Bayes error
Statistical tests in ML

Test #1
ROC and AUC
    – ml-lab2023-04.04.R

9. Apr 12 Support Vector Machines

HW1 submission deadline

10. Apr 19 Bias and Variance, Regularization, IBL SVM+Multi-Class Task Evaluation
    – ml-lab.SVM.2023-04.19.R
11. Apr 26 Ensemble learning methods:
Part II — Boosting
Regularization
   – ml-lab.2023-04.26.R
12. May 3 Foundations of Neural Networks

Exercises on statistical tests
    — t-test Example code
    — t-test Exercise
Chi-square tests
    — Theory   
    — Exercise on Goodness-of-fit test

Discussion on the homework term project
    — Presentation (by Sára)

  May 10 No classes

 

13. May 17    

Literature

Recommended readings

  • James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert. An Introduction to Statistical Learning. Springer New York, 2013. (link
  • Lantz, Brett. Machine learning with R. Packt Publishing Ltd. 2013. [available  in the MFF library]
  • Barbora Hladká — Martin Holub — Vilém Zouhar: A Collection of Machine Learning Excercises

Introductory readings

  • Alpaydin, Ethem. Introduction to Machine Learning. The MIT Press. 2004, 2010. (link)
  • Domingos, Pedro. A few useful things to know about Machine learning. Communication of the ACM, vol. 55, Issue 10, October 2012, pp. 78--87, ACM, New York, USA. (link)
  • Gonick, Larry and Woollcott Smith. The Cartoon Guide to Statistics. Harper Resource. 2005.
  • Hladká Barbora, Holub Martin: A Gentle Introduction to Machine Learning for Natural Language Processing: How to start in 16 practical steps.In: Language and Linguistics Compass, vol. 9, No. 2, pp. 55-76, 2015.
  • Hladká Barbora, Holub Martin: Machine Learning in Natural Language Processing using R. Course at ESSLLI2013, 2013.
  • Kononenko, Igor and Matjaz Kukar. Machine Learning and Data Mining: Introduction to Principles and Algorithms. Horwood Publishing, 2007. (linka light survey of the whole field)

Advanced readings

  • Baayen, R. Harald. Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press, 2008.
  • Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
  • Burges Christopher J. C.  A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. (link)
  • Cristianni, Nello and John Shawe-Taylor. An Introduction to Support Vector M​achines and other Kernel-based Learning Methods. Cambridge University Press, 2000.
  • Duda, Richard O., Peter R. Hart and David G. Stork. Pattern Classification. Second Edition. Wiley, 2001.
  • Guyon, Isabelle and Gunn, Steve and Nikravesh, Masoud and Zadeh, Lotfi A. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing). Springer-Verlag New York, Inc. 2006.
  • Hastie, Trevor, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. (link)
  • Hsu Chih-Wei, Chang Chih-Chung Chang and Chih-Jen Lin. A Practical Guide to Support Vector Classication. 2010. (link)

About the R system

  • Everitt, B.S and Hothorn, Torsten. A Handbook of Statistical Analyses using R. CRC Press. 2010.
  • Dalgaard, Peter. Introductory Statistics with R. Springer, 2008.
  • Kerns, G. Jay. Introduction to Probability and Statistics Using R. 2011. (link) ​
  • Paradis, Emmanuel. R for Beginners. 2005. (link)
  • Rodrigue, German. Introducing R -- Getting started. (link)
  • Venables, W.N, D. M. Smith and the R core team. An Introduction to R. (link)
  • Venables, W. N. and B. D. Ripley. Modern Applied Statistics with S. Springer, 2002. (link)

Sample student projects from the past

This course was originally focused on machine learning in natural language processing. To get credits for lab sessions, students needed to do experimental projects

  Default Task Default Task Description Nice Student Reports
2014/15 Native Language Identification npfl054-term-project-2014-15.pdf
CUNI report
 
2013/14

Reuters-21578 Text Categorization

text-categorization.pdf
test-collection.README.txt
3-classes.distribution.pdf
Default task:
    Luksova.report.final.2013-14.pdf
Sentiment analysis task:
    Tam.report.final.2013-14.pdf
2012/13 Word Sense Disambiguation PFL054.project.2012-13.pdf Barancikova.report.final.2012-13.pdf
Machacek.report.final.2012-13.pdf
Franky.report.final.2012-13.pdf
2011/12 Semantic Pattern Classification PFL054.project.2011-12.specification.pdf Krejcova.report.final.2011-12.pdf
Long.report.final.2011-12.pdf
Tamchyna.report.final.2011-12.pdf
2010/11 Semantic Collocation Recognition PFL054.project.2010-11.pdf,
features.description.pdf
Lauschmannova.report.final.2010-11.pdf
Hajic.report.updated.2010-11.pdf
Kriz.report.final.2010-11.pdf
2009/10 Verb Sense Disambiguation PFL054_2009_10_project.pdf ML_report_Fabian.pdf,
ML_report_Galuscakova.pdf,
ML_report_Larasati.pdf
2008/09 Coreference Resolution

PFL054_2008_09_project.pdf

ML_report_Dusek.pdf,
ML_report_LeThanhDinh.pdf,
ML_report_Novak.pdf
2007/08 Named-entity Type Classification PFL054_2007_08_project.pdf Jana.Kravalova-FinalReport.pdf,
Sergio.Duante-finalReport.pdf,
Zorana.Ratkovic-Final_Report.pdf

 

Other machine learning courses organized by UFAL

  • NPFL097 Unsupervised Machine Learning in NLP (advanced)
  • NPFL114 Deep learning (introductory)
  • NPFL122 Deep Reinforcement Learning (advanced)
  • NPFL129 Introduction to Machine Learning with Python (introductory)

Acknowledgement

The preparation of teaching materials was financed by the project  "Zvýšení kvality vzdělávání na UK a jeho relevance pro potřeby trhu práce", reg. č. CZ.02.2.69/0.0/0.0/16_015/0002362.