NPFL104 Machine Learning Methods
Summer Term 2017/2018
Course aim
To provide students with an intensive practical experience on applying Machine Learning techniques on real data.
Course strategy
Until (time) exhausted, loop as follows:
 DoItYourself step  develop your own toy implementations of ML basic techniques (in Python), to understand the core concepts,
 DoItWell step  learn to use existing Python libraries and routinize their application on a number of example datasets
Course schedule
Week 1  Introduction
 Quick recap of Machine Learning principles. Why ML? Classification of ML approaches.
 Quick intro to Python (if needed)
Let's borrow some googled materials:
 Practice coding basics in Python:
 Homework
codingbat
:
 create your git repository at UFAL's redmine, follow these instructions, just replace
npfl092
with npfl104
and 2017 with 2018
 Continue practicing your knowledge of Python:
 First, implement at least 10 tasks
from CodingBat (only from categories Warmup2,
List2, or String2), or 10 tasks
from Torbjörn Lager's list, or any 10task
mixture from the two sets.
 Second, implement a simple class,
anything with at least two methods
and some data attributes (really anything).
 For all tasks, add short testing
code snippets into the respective source codes (e.g. count_evens.py should
contain your implementation of count_evens
function, as well as a short test checking the
function's answer on at least one
problem instance). Run them from a single
Makefile: after typing 'make' we should see
confirmations of correct functionality.
 Submit the solutions into
hw/codingbat
in your git repository for this course.
 Deadline: 12th March 2018
Week 2  Selected classification methods
 Classification setup in ML
 Some basic methods in detail
 decision trees
 perceptron
 K nearest neighbors
 Naive Bayes
 Homework
myclassifiers
:
 DoItYourself style: implement (in Python)
simple binary classifiers and evaluate their performance on given datasets
 choose three classification techniques out of the
following four: perceptron, Naive Bayes, KNN, decision trees
 apply the classifiers and measure their accuracy on the following datasets:
 finish the exercise, store it into
hw/myclassifiers
in the git repository
 organize the execution of the experiments using a Makefile; 'make download' downloads the data files from the URLs given above; typing 'make perc' should run training and evaluation of perceptron (if it's in your selection) for both datasets and print out the final accuracy for both (while other output info is stored in perc.log file); 'make nb', 'make knn', make 'dt' should work analogously (three of four are enough), 'make all' should call data download and all three classification targets
 please doublecheck that 'make all' works in a fresh git clone in the deafult SU1 environment (you can access the SU1 computers remotely by ssh)
 Deadline: 12th March 2018
Week 3  Selected classification methods, cont.
 some more classification methods in detail:
 logistic regression
 support vector machines
 Multiclass classification
 native binary vs. native multiclass setups
 conversion strategies (one against one, one against all)
 Homework
mydataset
:
Week 4  Regression
 locally weighted regression (simple interpolation, Knn averaging, weighting by a kernel function)
 setup of the regression task in machine learning
 linear regression in detail
 basis functions for handling nonlinear dependencies
 probabilistic interpretation of the least squares optimization criterion (consequence of central limit theorem)
 optimization by stochastic gradient descent
 homework
myregression
 doityourself style task: implement any regression
technique (e.g. least squares by stochastic
gradient descent) and apply it on the following datasets:
 organize the execution of the experiment using a Makefile, typing
make all
should trains and evaluate (e.g. via mean square error) models for both datasets
 store your solution into
myregression
directory in the git repository
 Deadline: 9th April 2018
Week 5  ML Diagnostics
 Slides as presented:
 Visualizing
 BiasVariance, OverfittingUnderfitting
 Search vs. Modelling Error
 Error Analysis, Ablative Analysis
 Other nice sources: by Cohen on biasvariance decomposition, by Ng on diagnosing in general, BiasVariance proof.
 homework
scikitclass
 Ondrej will assemble all datasets from the
mydataset
homework and share the link.
 (The "doitwell" version of
myclassifiers
homework.)

Apply scikitlearn classification modules on all the datasets collected by you and your colleagues (each student has to do this homework on his own).
 Choose at least 4 different classifier from those offered in Scikitlearn (there are many of them, e.g.
svm.SVC, KNeighborsClassifier, BernoulliNB, tree.DecisionTreeClassifier, LogisticRegression ...).
 you can use sklearn modules also for
preprocessing, e.g. for loading matrices of floats (numpy.loadtxt) or for replacing categorial
features by numbers (from sklearn.feature_extraction import DictVectorizer)
 scikit cheat sheet listing which ML methods are available in scikit
 Deadline: May 10, 2018 (two weeks after Ondrej distributed all the datasets).
Week 6  Feature engineering, Regularization
 slides on feature
engineering and regularization
 additional reading:
 homework
scikitregression
 Apply at least three scikitlearn regression modules on the
datasets from the previous class on regression. You can use
e.g. modules for
Generalized Linear Models, Support Vector
regressors, KNN regressors, Decision Tree
regressors, or any other.
 make your solution (i.e. training and evaluation
(e.g. via mean square error) on both datasets)
runnable just by typing 'make'
 submit your solution into
into
hw/scikitregression/
in your
git repository.
 Deadline: 23rd April 2018
Week 7  Kernel methods
 Slides as presented:
 Kernel trick.
 Linear, Polynomial and RBF kernels.
 SVM parameters, kernel parameters and theirs effects.
 Additional slides on kernelization by Mark Johnson.
 homework
classificationgridsearch
 For PAMAPEasy as divided into train+test:
 Crossvalidate on train to choose between linear, poly and RBF.
 Create the heatmap for RBF (i.e. plot score for all values of C and gamma).
 Use the GridSearchCV to find the best C and gamma (i.e. find the best without plotting anything).
 NEVER USE THE
test.txt.gz
FOR THE GRIDSEARCH
 Enter the accuracy of the best setting on
test.txt.gz
to CLASSIFICATION_RESULTS.txt
, mention C and gamma in the comment.
 Deadline: 8th May 2018
Week 8  Clustering
 clustering vs. classification
 Kmeans
 Homework
myclustering
:
 DoItYourself style: implement the KMeans algorithm in a Python script that reads data in the PAMAPeasy format and clusters the data. At the end, the script should print a summary table like the one in this example.
 Dataset for the homework:
 Commit your script and the Makefile into the
myclustering/
directory in the usual place.
 As usual, please doublecheck that typing
make
in this directory in a fresh clone works in the default SU1 environment.
 Deadline: 10th May 2018
Week 9  Clustering, cont.
 clustering vs. classification
 Mixture of Gaussians and EM
 Hierarchical clustering
 Clustering evaluation
 RAND index, purity and NMI here or here.
Literature
Required work
 each student's own solutions of all homework tasks must be submitted in time
Final exam test
 Please see the list of possible test questions here. Each test will contain around 10 questions selected from the list. The questions are to be answered in the written form in 75 minutes.
Determination of final grade
 excellent: > 90 %
 very good: > 70 %
 good: > 50 %
 (homework assignments 50%, final exam written test 40%, lab activity 10%)