Introduction to Machine Learning with Python – Winter 2025/26

Machine learning is reaching notable success when solving complex tasks in many fields. This course serves as an introduction to basic machine learning concepts and techniques, focusing both on the theoretical foundation, and on implementation and utilization of machine learning algorithms in Python programming language. High attention is paid to the ability of application of the machine learning techniques on practical tasks, in which the students try to devise a solution with highest performance.

Python programming skills are required, together with basic probability theory knowledge.

About

Official name: Introduction to Machine Learning with Python
SIS code: NPFL129
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Instructors: Jindřich Libovický (lecture), Jan Bronec, Tomáš Musil, Kristýna Onderková, Dušan Variš , Gianluca Vico (practicals), Milan Straka (assignments & ReCodEx), Jakub Klesa, Tymur Kotkov, Tymofii Shchetilin (teaching assistants)

This course is also part of the inter-university programme prg.ai Minor. It pools the best of AI education in Prague to provide students with a deeper and broader insight into the field of artificial intelligence. More information is available at prg.ai/minor.

Timespace Coordinates

  • lecture: English lecture is held on Monday 12:20 in S3, Czech lecture on Thursday 15:40 in S3; first lecture is on Sep 29
  • practicals: English practicals are held on Thursday 14:00 in S3, Czech practicals on Friday 9:00 in S3; first practicals are on Oct  02

All lectures and practicals will be recorded and available on this website.

Course Objectives

After this course students should…

  • Be able to reason about task/problems suitable for ML
    • Know when to use classification, regression and clustering
    • Be able to choose from this method Linear and Logistic Regression, Multilayer Perceptron, Nearest Neighbors, Naive Bayes, Gradient Boosted Decision Trees, kk-means clustering
  • Think about learning as (mostly probabilistic) optimization on training data
    • Know how the ML methods learn including theoretical explanation
  • Know how to properly evaluate ML
    • Think about generalization (and avoiding overfitting)
    • Be able to choose a suitable evaluation metric
    • Responsibly decide what model is better
  • Be able to implement ML algorithms on a conceptual level
  • Be able to use Scikit-learn to solve ML problems in Python

Lectures

1. Introduction to Machine Learning Slides PDF Slides CS Lecture EN Lecture EN Practicals linear_regression_manual linear_regression_features Questions

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

The lecture content, including references to some additional study materials. The main study material is the Pattern Recognition and Machine Learning by Christopher Bishop, referred to as PRML.

Note that the topics in italics are not required for the exam.

1. Introduction to Machine Learning

 Sep 29, Oct 4 Slides PDF Slides CS Lecture EN Lecture EN Practicals linear_regression_manual linear_regression_features Questions

Learning objectives. After the lecture you should be able to…

  • Explain to a non-expert what machine learning is.
  • Explain the difference between classification and regression.
  • Implement a simple linear-algebra-based algorithm for training linear regression.

Covered topics and where to find more:

  • Introduction to machine learning
  • Basic definitions [Sections 1 and 1.1 of PRML]
  • Linear regression model [Section 3.1 of PRML]

Requirements

To pass the practicals, you need to obtain at least 70 points, excluding the bonus points. Note that up to 40 points above 70 (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 105 points (not including the bonus points) will be available.

Environment

The tasks are evaluated automatically using the ReCodEx Code Examiner.

The evaluation is performed using Python 3.11, scikit-learn 1.7.2, numpy 2.3.3, scipy 1.16.2, pandas 2.3.2, and matplotlib 3.10.6. You should install the exact version of these packages yourselves.

Teamwork

Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.

No Cheating

Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are allowed to share code and submit identical solutions. Note that all students involved in cheating will be punished, so if you share your source code with a friend, both you and your friend will be punished. That also means that you should never publish your solutions.

linear_regression_manual

 Deadline: Oct 15, 22:00  3 points

Starting with the linear_regression_manual.py template, solve a linear regression problem using the algorithm from the lecture which explicitly computes the matrix inversion. Then compute root mean square error on the test set.

Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).

  1. python3 linear_regression_manual.py --test_size=0.1
52.38
  1. python3 linear_regression_manual.py --test_size=0.5
54.58
  1. python3 linear_regression_manual.py --test_size=0.9
59.46

linear_regression_features

 Deadline: Oct 15, 22:00  3 points

Starting with the linear_regression_features.py template, use scikit-learn to train a model of a 1D curve.

Try using a concatenation of features x1,x2,,xDx^1, x^2, …, x^D for DD from 1 to a given range, and report RMSE of every such configuration.

Note that your results may be slightly different (because of varying floating point arithmetic on your CPU).

  1. python3 linear_regression_features.py --data_size=10 --test_size=5 --range=6
Maximum feature order 1: 0.74 RMSE
Maximum feature order 2: 1.87 RMSE
Maximum feature order 3: 0.53 RMSE
Maximum feature order 4: 4.52 RMSE
Maximum feature order 5: 1.70 RMSE
Maximum feature order 6: 2.82 RMSE

Test visualization

  1. python3 linear_regression_features.py --data_size=30 --test_size=20 --range=9
Maximum feature order 1: 0.56 RMSE
Maximum feature order 2: 1.53 RMSE
Maximum feature order 3: 1.10 RMSE
Maximum feature order 4: 0.28 RMSE
Maximum feature order 5: 1.60 RMSE
Maximum feature order 6: 3.09 RMSE
Maximum feature order 7: 3.92 RMSE
Maximum feature order 8: 65.11 RMSE
Maximum feature order 9: 3886.97 RMSE

Test visualization

  1. python3 linear_regression_features.py --data_size=50 --test_size=40 --range=9
Maximum feature order 1: 0.63 RMSE
Maximum feature order 2: 0.73 RMSE
Maximum feature order 3: 0.31 RMSE
Maximum feature order 4: 0.26 RMSE
Maximum feature order 5: 1.22 RMSE
Maximum feature order 6: 0.69 RMSE
Maximum feature order 7: 2.39 RMSE
Maximum feature order 8: 7.28 RMSE
Maximum feature order 9: 201.70 RMSE

Test visualization

In the competitions, your goal is to train a model and then predict target values on the test set available only in ReCodEx.

Submitting to ReCodEx

When submitting a competition solution to ReCodEx, you should submit a trained model and a Python source capable of running it.

Furthermore, please also include the Python source and hyperparameters you used to train the submitted model. But be careful that there still must be exactly one Python source with a line starting with def main(.

Do not forget about the maximum allowed model size and time and memory limits.

Competition Evaluation

  • Before the deadline, ReCodEx prints the exact achieved score, but only if it is worse than the baseline.

    If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached score.

  • After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.

  • After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.

  • Each competition will be scored after the first deadline.

  • The bonus points will be computed in the following fashion:

    • Let BB be the maximal number of bonus points that can be achieved in the competition.

    • All of the solutions that surpass the baseline will be sorted and divided into B+1B+1 groups of equal size.

    • Every solution in the top group gets B points, the next group gets B1B-1 points, etc., the last group gets 0 bonus points.

    • The team solution only occupies one position in the table of the competition results.

  • Please, do not forget that every member of the team needs to upload the solution to ReCodEx and to submit both the training/prediction source code and the trained model itself.

What Is Allowed

  • You can use only the given annotated data, both for training and evaluation.
  • Additionally, you can use any unannotated or manually created data for training and evaluation.
  • The test set annotations must be the result of your system (so you cannot manually correct them; but your system can contain other parts than just trained models, like hand-written rules).
  • Do not use test set annotations in any way, if you somehow get access to them.
  • You can use any method present in numpy or scipy, anything you implement yourself, and, unless specified otherwise in assignment description, any method from sklearn. Furthermore, the solution must be created by you, and you must understand it fully. Do not use deep network frameworks like TensorFlow or PyTorch.

Install

  • Installing to central user packages repository

    You can install all required packages to central user packages repository using pip3 install --user scikit-learn==1.7.2 numpy==2.3.3 scipy==1.16.2 pandas==2.3.2 matplotlib==3.10.6.

  • Installing to a virtual environment

    Python supports virtual environments, which are directories containing independent sets of installed packages. You can create a virtual environment by running python3 -m venv VENV_DIR followed by VENV_DIR/bin/pip3 install scikit-learn==1.7.2 numpy==2.3.3 scipy==1.16.2 pandas==2.3.2 matplotlib==3.10.6 (or VENV_DIR/Scripts/pip3 on Windows).

  • Windows installation

    • On Windows, it can happen that python3 is not in PATH, while py command is – in that case you can use py -m venv VENV_DIR, which uses the newest Python available, or for example py -3.11 -m venv VENV_DIR, which uses Python version 3.11.

Git

  • Is it possible to keep the solutions in a Git repository?

    Definitely. Keeping the solutions in a branch of your repository, where you merge them with the course repository, is probably a good idea. However, please keep the cloned repository with your solutions private.

  • On GitHub, do not create a public fork with your solutions

    If you keep your solutions in a GitHub repository, please do not create a clone of the repository by using the Fork button – this way, the cloned repository would be public.

    Of course, if you just want to create a pull request, GitHub requires a public fork and that is fine – just do not store your solutions in it.

  • How to clone the course repository?

    To clone the course repository, run

    git clone https://github.com/ufal/npfl129
    

    This creates the repository in the npfl129 subdirectory; if you want a different name, add it as a last parameter.

    To update the repository, run git pull inside the repository directory.

  • How to keep the course repository as a branch in your repository?

    If you want to store the course repository just in a local branch of your existing repository, you can run the following command while in it:

    git remote add course_repo https://github.com/ufal/npfl129
    git fetch course_repo
    git checkout --track course_repo/master -b BRANCH_NAME
    

    This creates a branch BRANCH_NAME, and when you run git pull in that branch, it will be updated to the current state of the course repository.

  • How to merge the course repository updates with your modified branch?

    If you want to store your solutions in your branch and gradually update this branch to track the changes in the course repository, you should start by

    git remote add course_repo https://github.com/ufal/npfl129
    git fetch course_repo
    git checkout --no-track course_repo/master -b BRANCH_NAME
    

    which creates a branch BRANCH_NAME with the current state of the course repository. However, unlike to the previous case, git pull and git push in this branch will not operate on the course repository. Therefore, you can then commit to this branch and push it to your own repository.

    To update your branch with the changes from the course repository, run

    git fetch course_repo
    git merge course_repo/master
    

    while in your branch. Of course, it might be necessary to resolve conflicts if both you and I modified the same lines in the templates.

ReCodEx

  • What files can be submitted to ReCodEx?

    You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.

  • What file does ReCodEx execute and what arguments does it use?

    Exactly one file with py suffix must contain a line starting with def main(. Such a file is imported by ReCodEx and the main method is executed (during the import, __name__ == "__recodex__").

    The file must also export an argument parser called parser. ReCodEx uses its arguments and default values, but it overwrites some of the arguments depending on the test being executed – the template should always indicate which arguments are set by ReCodEx and which are left intact.

  • What are the time and memory limits?

    The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution. For competition assignments, the time limit is 5 minutes.

Requirements

To pass the practicals, you need to obtain at least 70 points, excluding the bonus points. Note that up to 40 points above 70 (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 105 points (not including the bonus points) will be available.

To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every pair of lectures). In addition, you can get at most 40 surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.

Exam Questions

Lecture 1 Questions

  1. Explain how reinforcement learning differs from supervised and unsupervised learning in terms of the type of input the learning algorithms use to improve model performance. [5]

  2. Explain why we need separate training and test data. What is generalization, and how does the concept relate to underfitting and overfitting? [10]

  3. Define the three key components of Mitchell's definition of machine learning (Task TT, Performance measure PP, and Experience EE). Give a concrete example for each component in the context of email spam classification. [10]

  4. Explain the difference between classification and regression tasks. For each task type, provide: (a) the mathematical representation of the target variable, (b) a real-world example, and (c) one appropriate evaluation metric. [10]

  5. Define the prediction function of a linear regression model and write down L2L^2-regularized mean squared error loss. [10]

  6. Starting from the unregularized sum of squares error of a linear regression model, show how the explicit solution can be obtained, assuming XTX\boldsymbol X^T \boldsymbol X is invertible. [10]