Machine Learning for Greenhorns – Winter 2023/24

Machine learning is reaching notable success when solving complex tasks in many fields. This course serves as in introduction to basic machine learning concepts and techniques, focusing both on the theoretical foundation, and on implementation and utilization of machine learning algorithms in Python programming language. High attention is paid to the ability of application of the machine learning techniques on practical tasks, in which the students try to devise a solution with highest performance.

Python programming skills are required, together with basic probability theory knowledge.

About

Official name: Introduction to Machine Learning with Python
SIS code: NPFL129
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Guarantors: Jindřich Libovický, Zdeněk Kasner, Tomáš Musil

Timespace Coordinates

  • lecture: Czech lecture is held on Tuesday 10:40 in S5, English lecture on Tuesday 17:20 in S9; first lecture is on Oct 03
  • practicals: Czech practicals are held on Tuesday 12:20 in S5, English practicals on Wednesday 9:00 in S9; first practicals are on Oct 03

Lectures

1. Introduction to Machine Learning Slides PDF Slides

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

The lecture content, including references to some additional study materials. The main study material is the Pattern Recognition and Machine Learning by Christopher Bishop, referred to as PRML.

Note that the topics in italics are not required for the exam.

1. Introduction to Machine Learning

 Oct 03 Slides PDF Slides

  • Introduction to machine learning
  • Basic definitions [Sections 1 and 1.1 of PRML]
  • Linear regression model [Section 3.1 of PRML]

Requirements

Environment

The tasks are evaluated automatically using the ReCodEx Code Examiner.

The evaluation is performed using Python 3.11, scikit-learn 1.3.0, numpy 1.24.2, scipy 1.11.2, pandas 2.1.0, and matplotlib 3.7.2. You should install the exact version of these packages yourselves.

Teamwork

Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.

No Cheating

Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are expected to share code and submit identical solutions.

In the competitions, your goal is to train a model and then predict target values on the test set available only in ReCodEx.

Submitting to ReCodEx

When submitting a competition solution to ReCodEx, you should submit a trained model and a Python source capable of running it.

Furthermore, please also include the Python source and hyperparameters you used to train the submitted model. But be careful that there still must be exactly one Python source with a line starting with def main(.

Do not forget about the maximum allowed model size and time and memory limits.

Competition Evaluation

  • Before the deadline, ReCodEx prints the exact achieved score, but only if it is worse than the baseline.

    If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached score.

  • After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.

  • After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.

What Is Allowed

  • You can use only the given annotated data, both for training and evaluation.
  • Additionally, you can use any unannotated or manually created data for training and evaluation.
  • The test set annotations must be the result of your system (so you cannot manually correct them; but your system can contain other parts than just trained models, like hand-written rules).
  • Do not use test set annotations in any way, if you somehow get access to them.
  • Unless stated otherwise, you can use any algorithm present in numpy or scipy, anything you implement yourself, and any pre/post-processing or ensembling methods in sklearn. Apart from the allowed algorithms, the implementation must be created by you and you must understand it fully. Do not use deep network frameworks like TensorFlow or PyTorch.

Install

  • Installing to central user packages repository

    You can install all required packages to central user packages repository using pip3 install --user scikit-learn==1.3.0 numpy==1.24.2 scipy==1.11.2 pandas==2.1.0 matplotlib==3.7.2.

  • Installing to a virtual environment

    Python supports virtual environments, which are directories containing independent sets of installed packages. You can create a virtual environment by running python3 -m venv VENV_DIR followed by VENV_DIR/bin/pip3 install scikit-learn==1.3.0 numpy==1.24.2 scipy==1.11.2 pandas==2.1.0 matplotlib==3.7.2 (or VENV_DIR/Scripts/pip3 on Windows).

  • Windows installation

    • On Windows, it can happen that python3 is not in PATH, while py command is – in that case you can use py -m venv VENV_DIR, which uses the newest Python available, or for example py -3.11 -m venv VENV_DIR, which uses Python version 3.11.

Git

  • Is it possible to keep the solutions in a Git repository?

    Definitely. Keeping the solutions in a branch of your repository, where you merge them with the course repository, is probably a good idea. However, please keep the cloned repository with your solutions private.

  • On GitHub, do not create a public fork with your solutions

    If you keep your solutions in a GitHub repository, please do not create a clone of the repository by using the Fork button – this way, the cloned repository would be public.

    Of course, if you just want to create a pull request, GitHub requires a public fork and that is fine – just do not store your solutions in it.

  • How to clone the course repository?

    To clone the course repository, run

    git clone https://github.com/ufal/npfl129
    

    This creates the repository in the npfl129 subdirectory; if you want a different name, add it as a last parameter.

    To update the repository, run git pull inside the repository directory.

  • How to keep the course repository as a branch in your repository?

    If you want to store the course repository just in a local branch of your existing repository, you can run the following command while in it:

    git remote add upstream https://github.com/ufal/npfl129
    git fetch upstream
    git checkout -t upstream/master
    

    This creates a branch master; if you want a different name, add -b BRANCH_NAME to the last command.

    In both cases, you can update your checkout by running git pull while in it.

  • How to merge the course repository with your modifications?

    If you want to store your solutions in a branch merged with the course repository, you should start by

    git remote add upstream https://github.com/ufal/npfl129
    git pull upstream master
    

    which creates a branch master; if you want a different name, change the last argument to master:BRANCH_NAME.

    You can then commit to this branch and push it to your repository.

    To merge the current course repository with your branch, run

    git merge upstream master
    

    while in your branch. Of course, it might be necessary to resolve conflicts if both you and I modified the same place in the templates.

ReCodEx

  • What files can be submitted to ReCodEx?

    You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.

  • What file does ReCodEx execute and what arguments does it use?

    Exactly one file with py suffix must contain a line starting with def main(. Such a file is imported by ReCodEx and the main method is executed (during the import, __name__ == "__recodex__").

    The file must also export an argument parser called parser. ReCodEx uses its arguments and default values, but it overwrites some of the arguments depending on the test being executed – the template should always indicate which arguments are set by ReCodEx and which are left intact.

  • What are the time and memory limits?

    The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution. For competition assignments, the time limit is 5 minutes.

Requirements

Exam Questions