Machine learning is reaching notable success when solving complex tasks in many fields. This course serves as an introduction to basic machine learning concepts and techniques, focusing both on the theoretical foundation, and on implementation and utilization of machine learning algorithms in Python programming language. High attention is paid to the ability of application of the machine learning techniques on practical tasks, in which the students try to devise a solution with highest performance.
Python programming skills are required, together with basic probability theory knowledge.
Official name: Introduction to Machine Learning with Python
SIS code: NPFL129
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Instructors: Jindřich Libovický (lecture),
Jan Bronec,
Tomáš Musil,
Kristýna Onderková,
Dušan Variš ,
Gianluca Vico (practicals),
Milan Straka (assignments & ReCodEx),
Jakub Klesa, Tymur Kotkov, Tymofii Shchetilin (teaching assistants)
This course is also part of the inter-university programme prg.ai Minor. It pools the best of AI education in Prague to provide students with a deeper and broader insight into the field of artificial intelligence. More information is available at prg.ai/minor.
All lectures and practicals will be recorded and available on this website.
After this course students should…
1. Introduction to Machine Learning Slides PDF Slides CS Lecture EN Lecture EN Practicals linear_regression_manual linear_regression_features Questions
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
The lecture content, including references to some additional study materials. The main study material is the Pattern Recognition and Machine Learning by Christopher Bishop, referred to as PRML.
Note that the topics in italics are not required for the exam.
Sep 29, Oct 4 Slides PDF Slides CS Lecture EN Lecture EN Practicals linear_regression_manual linear_regression_features Questions
Learning objectives. After the lecture you should be able to…
Covered topics and where to find more:
To pass the practicals, you need to obtain at least 70 points, excluding the bonus points. Note that up to 40 points above 70 (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 105 points (not including the bonus points) will be available.
The tasks are evaluated automatically using the ReCodEx Code Examiner.
The evaluation is performed using Python 3.11, scikit-learn 1.7.2, numpy 2.3.3, scipy 1.16.2, pandas 2.3.2, and matplotlib 3.10.6. You should install the exact version of these packages yourselves.
Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are allowed to share code and submit identical solutions. Note that all students involved in cheating will be punished, so if you share your source code with a friend, both you and your friend will be punished. That also means that you should never publish your solutions.
Deadline: Oct 15, 22:00 3 points
Starting with the linear_regression_manual.py template, solve a linear regression problem using the algorithm from the lecture which explicitly computes the matrix inversion. Then compute root mean square error on the test set.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 linear_regression_manual.py --test_size=0.1
52.38
python3 linear_regression_manual.py --test_size=0.5
54.58
python3 linear_regression_manual.py --test_size=0.9
59.46
Deadline: Oct 15, 22:00 3 points
Starting with the
linear_regression_features.py
template, use scikit-learn
to train a model of a 1D curve.
Try using a concatenation of features for from 1 to a given range, and report RMSE of every such configuration.
Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).
python3 linear_regression_features.py --data_size=10 --test_size=5 --range=6
Maximum feature order 1: 0.74 RMSE
Maximum feature order 2: 1.87 RMSE
Maximum feature order 3: 0.53 RMSE
Maximum feature order 4: 4.52 RMSE
Maximum feature order 5: 1.70 RMSE
Maximum feature order 6: 2.82 RMSE
python3 linear_regression_features.py --data_size=30 --test_size=20 --range=9
Maximum feature order 1: 0.56 RMSE
Maximum feature order 2: 1.53 RMSE
Maximum feature order 3: 1.10 RMSE
Maximum feature order 4: 0.28 RMSE
Maximum feature order 5: 1.60 RMSE
Maximum feature order 6: 3.09 RMSE
Maximum feature order 7: 3.92 RMSE
Maximum feature order 8: 65.11 RMSE
Maximum feature order 9: 3886.97 RMSE
python3 linear_regression_features.py --data_size=50 --test_size=40 --range=9
Maximum feature order 1: 0.63 RMSE
Maximum feature order 2: 0.73 RMSE
Maximum feature order 3: 0.31 RMSE
Maximum feature order 4: 0.26 RMSE
Maximum feature order 5: 1.22 RMSE
Maximum feature order 6: 0.69 RMSE
Maximum feature order 7: 2.39 RMSE
Maximum feature order 8: 7.28 RMSE
Maximum feature order 9: 201.70 RMSE
In the competitions, your goal is to train a model and then predict target values on the test set available only in ReCodEx.
When submitting a competition solution to ReCodEx, you should submit a trained model and a Python source capable of running it.
Furthermore, please also include the Python source and hyperparameters
you used to train the submitted model. But be careful that there still must be
exactly one Python source with a line starting with def main(
.
Do not forget about the maximum allowed model size and time and memory limits.
Before the deadline, ReCodEx prints the exact achieved score, but only if it is worse than the baseline.
If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached score.
After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.
After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.
Each competition will be scored after the first deadline.
The bonus points will be computed in the following fashion:
Let be the maximal number of bonus points that can be achieved in the competition.
All of the solutions that surpass the baseline will be sorted and divided into groups of equal size.
Every solution in the top group gets B points, the next group gets points, etc., the last group gets 0 bonus points.
The team solution only occupies one position in the table of the competition results.
Please, do not forget that every member of the team needs to upload the solution to ReCodEx and to submit both the training/prediction source code and the trained model itself.
numpy
or scipy
, anything you
implement yourself, and, unless specified otherwise in assignment
description, any method from sklearn
. Furthermore, the solution must be
created by you, and you must understand it fully. Do not use deep
network frameworks like TensorFlow or PyTorch.Installing to central user packages repository
You can install all required packages to central user packages repository using
pip3 install --user scikit-learn==1.7.2 numpy==2.3.3 scipy==1.16.2 pandas==2.3.2 matplotlib==3.10.6
.
Installing to a virtual environment
Python supports virtual environments, which are directories containing
independent sets of installed packages. You can create a virtual environment
by running python3 -m venv VENV_DIR
followed by
VENV_DIR/bin/pip3 install scikit-learn==1.7.2 numpy==2.3.3 scipy==1.16.2 pandas==2.3.2 matplotlib==3.10.6
(or VENV_DIR/Scripts/pip3
on Windows).
Windows installation
python3
is not in PATH, while py
command
is – in that case you can use py -m venv VENV_DIR
, which uses the newest
Python available, or for example py -3.11 -m venv VENV_DIR
, which uses
Python version 3.11.Is it possible to keep the solutions in a Git repository?
Definitely. Keeping the solutions in a branch of your repository, where you merge them with the course repository, is probably a good idea. However, please keep the cloned repository with your solutions private.
On GitHub, do not create a public fork with your solutions
If you keep your solutions in a GitHub repository, please do not create a clone of the repository by using the Fork button – this way, the cloned repository would be public.
Of course, if you just want to create a pull request, GitHub requires a public fork and that is fine – just do not store your solutions in it.
How to clone the course repository?
To clone the course repository, run
git clone https://github.com/ufal/npfl129
This creates the repository in the npfl129
subdirectory; if you want a different
name, add it as a last parameter.
To update the repository, run git pull
inside the repository directory.
How to keep the course repository as a branch in your repository?
If you want to store the course repository just in a local branch of your existing repository, you can run the following command while in it:
git remote add course_repo https://github.com/ufal/npfl129
git fetch course_repo
git checkout --track course_repo/master -b BRANCH_NAME
This creates a branch BRANCH_NAME
, and when you run git pull
in that
branch, it will be updated to the current state of the course repository.
How to merge the course repository updates with your modified branch?
If you want to store your solutions in your branch and gradually update this branch to track the changes in the course repository, you should start by
git remote add course_repo https://github.com/ufal/npfl129
git fetch course_repo
git checkout --no-track course_repo/master -b BRANCH_NAME
which creates a branch BRANCH_NAME
with the current state of the
course repository. However, unlike to the previous case, git pull
and git push
in this branch will not operate on the course repository.
Therefore, you can then commit to this branch and push it to your own
repository.
To update your branch with the changes from the course repository, run
git fetch course_repo
git merge course_repo/master
while in your branch. Of course, it might be necessary to resolve conflicts if both you and I modified the same lines in the templates.
What files can be submitted to ReCodEx?
You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.
What file does ReCodEx execute and what arguments does it use?
Exactly one file with py
suffix must contain a line starting with def main(
.
Such a file is imported by ReCodEx and the main
method is executed
(during the import, __name__ == "__recodex__"
).
The file must also export an argument parser called parser
. ReCodEx uses its
arguments and default values, but it overwrites some of the arguments
depending on the test being executed – the template should always indicate which
arguments are set by ReCodEx and which are left intact.
What are the time and memory limits?
The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution. For competition assignments, the time limit is 5 minutes.
To pass the practicals, you need to obtain at least 70 points, excluding the bonus points. Note that up to 40 points above 70 (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 105 points (not including the bonus points) will be available.
To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every pair of lectures). In addition, you can get at most 40 surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.
Lecture 1 Questions
Explain how reinforcement learning differs from supervised and unsupervised learning in terms of the type of input the learning algorithms use to improve model performance. [5]
Explain why we need separate training and test data. What is generalization, and how does the concept relate to underfitting and overfitting? [10]
Define the three key components of Mitchell's definition of machine learning (Task , Performance measure , and Experience ). Give a concrete example for each component in the context of email spam classification. [10]
Explain the difference between classification and regression tasks. For each task type, provide: (a) the mathematical representation of the target variable, (b) a real-world example, and (c) one appropriate evaluation metric. [10]
Define the prediction function of a linear regression model and write down -regularized mean squared error loss. [10]
Starting from the unregularized sum of squares error of a linear regression model, show how the explicit solution can be obtained, assuming is invertible. [10]