The objective of this course is to provide a comprehensive introduction to deep reinforcement learning, a powerful paradigm that combines reinforcement learning with deep neural networks. This approach has demonstrated super-human capabilities in diverse domains, including complex games like Go and chess, optimizing real-world systems like datacenter cooling, improving chip design, automated discovery of superior algorithms and neural network architectures, and advancing robotics and large language models.
The course focuses both on the theory, spanning from fundamental concepts to recent advancements, as well as on practical implementations in Python and PyTorch (students implement and train agents controlling robots, mastering video games, and planing in complex board games). Basic programming and deep learning skills are expected (for example from the Deep Learning course).
Students work either individually or in small teams on weekly assignments, including competition tasks, where the goal is to obtain the highest performance in the class.
Optionally, you can obtain a micro-credential after passing the course.
SIS code: NPFL139
Semester: summer
E-credits: 8
Examination: 3/4 C+Ex
Guarantor: Milan Straka
All lectures and practicals will be recorded and available on this website.
1. Introduction to Reinforcement Learning Slides PDF Slides Lecture MonteCarlo Practicals Questions bandits monte_carlo
2. Value and Policy Iteration, Monte Carlo, Temporal Difference Slides PDF Slides Lecture TD, Q-learning Practicals Questions policy_iteration policy_iteration_exact q_learning
3. Off-Policy Methods, N-step Methods Slides PDF Slides Lecture Practicas Questions importance_sampling td_algorithms lunar_lander
4. Function Approximation, Deep Q Network, Rainbow Slides PDF Slides Lecture Practicals Questions q_learning_tiles q_network car_racing
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
A micro-credential (aka micro-certificate) is a digital certificate attesting that you have gained knowledge and skills in a specific area. It should be internationally recognized and verifiable using an online EU-wide verification system.
A micro-credential can be obtained both by the university students and external participants. You can no longer enroll to this year's course, a new run will open in February 2027.
If you are not a university student, you could have applied to the Reinforcement Learning micro-credential course and then attend the course along the university students. Upon successfully passing the course, a micro-credential is issued.
The price of the course is 5 000 Kč. If you require a tax receipt, please inform Magdaléna Kokešová within three business days after the payment.
The lectures run for 14 weeks from Feb 17 to May 22, with the examination period continuing until the end of September. Please note that the organization of the course and the setup instructions will be described at the first lecture; if you have already applied, you do not need to do anything else until that time.
If you have passed the course (in academic year 2025/26 or later) as a part of your study plan, you can obtain a micro-credential by paying only an administrative fee of 300 Kč; if you passed the course but it is not in your study plan, the administrative fee is 500 Kč. Detailed instructions how to get the micro-credential will be sent to the course participants during the examination period.
The lecture content, including references to study materials.
The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (referred to as RLB). It is available online and also as a hardcopy.
References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.
Feb 17 Slides PDF Slides Lecture MonteCarlo Practicals Questions bandits monte_carlo
Feb 24 Slides PDF Slides Lecture TD, Q-learning Practicals Questions policy_iteration policy_iteration_exact q_learning
Mar 3 Slides PDF Slides Lecture Practicas Questions importance_sampling td_algorithms lunar_lander
Mar 10 Slides PDF Slides Lecture Practicals Questions q_learning_tiles q_network car_racing
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.
The tasks are evaluated automatically using the ReCodEx Code Examiner.
The evaluation is performed using Python 3.11, Gymnasium, and PyTorch. You should install the exact version of these packages yourselves.
Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are allowed to share code and submit identical solutions. Note that all students involved in cheating will be punished, so if you share your source code with a friend, both you and your friend will be punished. That also means that you should never publish your solutions.
Relying blindly on AI during learning seems to have negative¹ effect² on skill acquisition. Therefore, you are not allowed to directly copy the assignment descriptions to GenAI and you are not allowed to directly use or copy-paste source code generated by GenAI. However, discussing general concepts and discussing your manually written code with GenAI is fine.
Deadline: Mar 04, 22:00 3 points
Implement the -greedy strategy for solving multi-armed bandits.
Start with the bandits.py
template, which defines MultiArmedBandits environment, which has the following
three methods:
reset(): reset the environmentstep(action) → reward: perform the chosen action in the environment,
obtaining a rewardgreedy(epsilon): return True with probability 1-epsilonYour goal is to implement the following solution variants:
alpha: perform -greedy search, updating the estimates using
averaging.alpha: perform -greedy search, updating the estimates using
a fixed learning rate alpha.Note that the initial estimates should be set to a given value, and epsilon can
be zero, in which case purely greedy actions are used.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 bandits.py --alpha=0 --epsilon=0.1 --initial=01.39 0.08
python3 bandits.py --alpha=0 --epsilon=0 --initial=11.48 0.22
python3 bandits.py --alpha=0.15 --epsilon=0.1 --initial=01.37 0.09
python3 bandits.py --alpha=0.15 --epsilon=0 --initial=11.52 0.04
Deadline: Mar 04, 22:00 4 points
Solve the discretized CartPole-v1 environment
from the Gymnasium library using the Monte Carlo
reinforcement learning algorithm. The gymnasium environments have the following
methods and properties:
observation_space: the description of environment observationsaction_space: the description of environment actionsreset() → new_state, info: starts a new episode, returning the new
state and additional environment-specific informationstep(action) → new_state, reward, terminated, truncated, info: perform the
chosen action in the environment, returning the new state, obtained reward,
boolean flags indicating a terminal state and episode truncation, and
additional environment-specific informationWe additionally extend the gymnasium environment by:
episode: number of the current episode (zero-based)reset(start_evaluation=False) → new_state, info: if start_evaluation is
True, an evaluation is startedOnce you finish training (which you indicate by passing start_evaluation=True
to reset), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100-episode
average return each 10 episodes even during training.
Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Deadline: Mar 11, 22:00 2 points
Consider the following gridworld:
Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:
GridWorld.states: return the number of states (11)GridWorld.actions: return the number of actions (4)GridWorld.action_labels: return a list with labels of the actions (["↑", "→", "↓", "←"])GridWorld.step(state, action): return possible outcomes of performing the
action in a given state, as a list of triples containing
probability: probability of the outcomereward: reward of the outcomenew_state: new state of the outcomeImplement policy iteration algorithm, with --steps steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform --iterations applications of the Bellman equation.
Perform the policy evaluation asynchronously (i.e., update the value function
in-place for states ). Assume the initial policy is “go North” and
initial value function is zero.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=1 0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -10.00← -10.95↑
0.00↑ 0.00← -7.50← -88.93←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=2 0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -8.31← -11.83←
0.00↑ 0.00← -1.50← -20.61←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=3 0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -6.46← -6.77←
0.00↑ 0.00← -0.76← -13.08↓
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=10 0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -1.04← -0.83←
0.00↑ 0.00← -0.11→ -0.34↓
python3 policy_iteration.py --gamma=0.95 --iterations=10 --steps=10 11.93↓ 11.19← 10.47← 6.71↑
12.83↓ 10.30← 10.12←
13.70→ 14.73→ 15.72→ 16.40↓
python3 policy_iteration.py --gamma=1 --iterations=1 --steps=100 74.73↓ 74.50← 74.09← 65.95↑
75.89↓ 72.63← 72.72←
77.02→ 78.18→ 79.31→ 80.16↓
Deadline: Mar 11, 22:00 2 points
Starting with policy_iteration_exact.py,
extend the policy_iteration assignment to perform policy evaluation
exactly by solving a system of linear equations. Note that you need to
use 64-bit floats because lower precision results in unacceptable error.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration_exact.py --gamma=0.95 --steps=1 -0.00↑ -0.00↑ -0.00↑ -0.00↑
-0.00↑ -12.35← -12.35↑
-0.85← -8.10← -19.62← -100.71←
python3 policy_iteration_exact.py --gamma=0.95 --steps=2 0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00← -11.05←
-0.00↑ -0.00↑ -0.00← -12.10↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=3 -0.00↑ 0.00↑ 0.00↑ 0.00↑
-0.00↑ -0.00← 0.69←
-0.00↑ -0.00↑ -0.00→ 6.21↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=4 -0.00↑ 0.00↑ 0.00↓ 0.00↑
-0.00↓ 5.91← 6.11←
0.65→ 6.17→ 14.93→ 15.99↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=5 2.83↓ 4.32→ 8.09↓ 5.30↑
12.92↓ 9.44← 9.35←
13.77→ 14.78→ 15.76→ 16.53↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=6 11.75↓ 8.15← 8.69↓ 5.69↑
12.97↓ 9.70← 9.59←
13.82→ 14.84→ 15.82→ 16.57↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=7 12.12↓ 11.37← 9.19← 6.02↑
13.01↓ 9.92← 9.79←
13.87→ 14.89→ 15.87→ 16.60↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=8 12.24↓ 11.49← 10.76← 7.05↑
13.14↓ 10.60← 10.42←
14.01→ 15.04→ 16.03→ 16.71↓
python3 policy_iteration_exact.py --gamma=0.9999 --steps=5 7385.23↓ 7392.62→ 7407.40↓ 7400.00↑
7421.37↓ 7411.10← 7413.16↓
7422.30→ 7423.34→ 7424.27→ 7425.84↓
Deadline: Mar 11, 22:00 4 points
Solve the discretized MountainCar-v0 environment from the Gymnasium library using the Q-learning reinforcement learning algorithm. Note that this task still does not require PyTorch.
The environment methods and properties are described in the monte_carlo assignment.
Once you finish training (which you indicate by passing start_evaluation=True
to reset), your goal is to reach an average return of -150 during 100
evaluation episodes.
You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Q-learning is a bit tricky – I usually start with a larger value of (like 0.2 or even 0.5) and then gradually decrease it to almost zero.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
Deadline: Mar 18, 22:00 2 points
Using the FrozenLake-v1 environment, implement Monte Carlo weighted importance sampling to estimate state value function of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behavior policy, which uniformly chooses among all four actions.
Start with the importance_sampling.py template, which creates the environment and generates episodes according to behavior policy.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 importance_sampling.py --episodes=200 0.00 0.00 0.24 0.32
0.00 0.00 0.40 0.00
0.00 0.00 0.20 0.00
0.00 0.00 0.22 0.00
python3 importance_sampling.py --episodes=5000 0.03 0.00 0.01 0.03
0.04 0.00 0.09 0.00
0.10 0.24 0.23 0.00
0.00 0.44 0.49 0.00
python3 importance_sampling.py --episodes=50000 0.03 0.02 0.05 0.01
0.13 0.00 0.07 0.00
0.21 0.33 0.36 0.00
0.00 0.35 0.76 0.00
Deadline: Mar 18, 22:00 4 points
Starting with the td_algorithms.py template, implement all of the following -step TD methods variants for solving the Taxi-v3 environment:
Note that while the test and example outputs just show mean 100-episode returns,
ReCodEx compares the action-value function you return from main to the
reference one.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1Episode 10, mean 100-episode return -652.70 +-37.77
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1 --off_policyEpisode 10, mean 100-episode return -632.90 +-126.41
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4Episode 10, mean 100-episode return -715.70 +-156.56
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4 --off_policyEpisode 10, mean 100-episode return -649.10 +-171.73
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1Episode 10, mean 100-episode return -641.90 +-122.11
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1 --off_policyEpisode 10, mean 100-episode return -633.80 +-63.61
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4Episode 10, mean 100-episode return -713.90 +-107.05
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4 --off_policyEpisode 10, mean 100-episode return -648.20 +-107.08
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1Episode 10, mean 100-episode return -641.90 +-122.11
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1 --off_policyEpisode 10, mean 100-episode return -633.80 +-63.61
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4Episode 10, mean 100-episode return -663.50 +-111.78
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4 --off_policyEpisode 10, mean 100-episode return -708.50 +-125.63
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 td_algorithms.py --mode=sarsa --n=1Episode 200, mean 100-episode return -235.23 +-92.94
Episode 400, mean 100-episode return -133.18 +-98.63
Episode 600, mean 100-episode return -74.19 +-70.39
Episode 800, mean 100-episode return -41.84 +-54.53
Episode 1000, mean 100-episode return -31.96 +-52.14
python3 td_algorithms.py --mode=sarsa --n=1 --off_policyEpisode 200, mean 100-episode return -227.81 +-91.62
Episode 400, mean 100-episode return -131.29 +-90.07
Episode 600, mean 100-episode return -65.35 +-64.78
Episode 800, mean 100-episode return -34.65 +-44.93
Episode 1000, mean 100-episode return -8.70 +-25.74
python3 td_algorithms.py --mode=sarsa --n=4Episode 200, mean 100-episode return -277.55 +-146.18
Episode 400, mean 100-episode return -87.11 +-152.12
Episode 600, mean 100-episode return -6.95 +-23.28
Episode 800, mean 100-episode return -1.88 +-19.21
Episode 1000, mean 100-episode return 0.97 +-11.76
python3 td_algorithms.py --mode=sarsa --n=4 --off_policyEpisode 200, mean 100-episode return -339.11 +-144.40
Episode 400, mean 100-episode return -172.44 +-176.79
Episode 600, mean 100-episode return -36.23 +-100.93
Episode 800, mean 100-episode return -22.43 +-81.29
Episode 1000, mean 100-episode return -3.95 +-17.78
python3 td_algorithms.py --mode=expected_sarsa --n=1Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
python3 td_algorithms.py --mode=expected_sarsa --n=1 --off_policyEpisode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
python3 td_algorithms.py --mode=expected_sarsa --n=4Episode 200, mean 100-episode return -263.10 +-161.97
Episode 400, mean 100-episode return -102.52 +-162.03
Episode 600, mean 100-episode return -7.13 +-24.53
Episode 800, mean 100-episode return -1.69 +-12.21
Episode 1000, mean 100-episode return -1.53 +-11.04
python3 td_algorithms.py --mode=expected_sarsa --n=4 --off_policyEpisode 200, mean 100-episode return -376.56 +-116.08
Episode 400, mean 100-episode return -292.35 +-166.14
Episode 600, mean 100-episode return -173.83 +-194.11
Episode 800, mean 100-episode return -89.57 +-153.70
Episode 1000, mean 100-episode return -54.60 +-127.73
python3 td_algorithms.py --mode=tree_backup --n=1Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
python3 td_algorithms.py --mode=tree_backup --n=1 --off_policyEpisode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
python3 td_algorithms.py --mode=tree_backup --n=4Episode 200, mean 100-episode return -270.51 +-134.35
Episode 400, mean 100-episode return -64.27 +-109.50
Episode 600, mean 100-episode return -1.80 +-13.34
Episode 800, mean 100-episode return -0.22 +-13.14
Episode 1000, mean 100-episode return 0.60 +-9.37
python3 td_algorithms.py --mode=tree_backup --n=4 --off_policyEpisode 200, mean 100-episode return -248.56 +-147.74
Episode 400, mean 100-episode return -68.60 +-126.13
Episode 600, mean 100-episode return -6.25 +-32.23
Episode 800, mean 100-episode return -0.53 +-11.82
Episode 1000, mean 100-episode return 2.33 +-8.35
Deadline: Mar 18, 22:00 5 points + 5 bonus
Solve the LunarLander-v3 environment
from the Gymnasium library Note that this task
does not require PyTorch. You can play interactively yourself by running
python3 -m npfl139.play.lunar_lander command.
The environment methods and properties are described in the monte_carlo assignment,
but include one additional method:
expert_episode(seed=None) → episode: This method generates one expert
trajectory, where episode is a list of triples (state, action, reward),
where the action and reward is None when reaching the terminal state.
If a seed is given, the expert trajectory random generator is reset before
generating the trajectory.
You cannot change the implementation of this method or use its internals in
any way other than just calling expert_episode(). Furthermore,
you can use this method only during training, not during evaluation.
To pass the task, you need to reach an average return of 0 during 1000 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 15 minutes.
The task is additionally a competition, and at most 5 points will be awarded according to the relative ordering of your solutions.
You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
In the competition, you should consider the environment states meaning to be unknown, so you cannot use the knowledge about how they are created. But you can learn any such information from the data.
Deadline: Mar 25, 22:00 3 points
Improve the q_learning task performance on the
MountainCar-v0 environment
using linear function approximation with tile coding.
Your goal is to reach an average reward of -110 during 100 evaluation episodes.
The environment methods are described in the q_learning assignment, with
the following changes:
state returned by the env.step method is a list containing weight
indices of the current state (i.e., the feature vector of the state consists
of zeros and ones, and only the indices of the ones are returned). The
action-value function is therefore approximated as a sum of the weights whose
indices are returned by env.step.env.observation_space.nvec returns a list, where the -th element
is a number of weights used by first elements of state. Notably,
env.observation_space.nvec[-1] is the total number of the weights.You can start with the q_learning_tiles.py
template, which parses several useful parameters and creates the environment.
Implementing Q-learning is enough to pass the assignment, even if both N-step
Sarsa and Tree Backup converge a little faster. The default number of tiles in
tile encoding (i.e., the size of the list with weight indices) is
args.tiles=8, but you can use any number you want (but the assignment is
solvable with 8).
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
Deadline: Mar 25, 22:00 5 points
Solve the continuous CartPole-v1 environment from the Gymnasium library using Q-learning with neural network as a function approximation.
You can start with the q_network.py template, which provides a simple network implementation in PyTorch.
The continuous environment is very similar to a discrete one, except
that the states are vectors of real-valued observations with shape
env.observation_space.shape.
Use Q-learning with neural network as a function approximation, which for a given state returns state-action values for all actions. You can use any network architecture, but one hidden layer of several dozens ReLU units is a good start. Your goal is to reach an average return of 450 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes (so you can train in ReCodEx, but you can also pretrain your network if you like).
Deadline: Mar 25, 22:00 5 points + 5 bonus
The goal of this competition is to use Deep Q Networks (and any of Rainbow improvements)
on a more real-world CarRacing-v3 environment
from the Gymnasium library. If you want to experience the environment
yourselves, you can drive the car using arrows by running python3 -m npfl139.play.car_racing command.
In the provided CarRacingFS-v3
environment, the states are RGB np.uint8 images of size
, but you can downsample them even more. The actions
are also continuous and consist of an array with the following three elements:
steer in range [-1, 1]gas in range [0, 1]brake in range [0, 1]; note that full brake is quite aggressive, so you
might consider using less force when braking
Internally you should probably generate discrete actions and convert them to the
required representation before the step call. Alternatively, you might set
args.continuous=0, which changes the action space from continuous to 5 discrete
actions – do nothing, steer left, steer right, gas, and brake. But you can
experiment with different action space if you want.The environment also supports frame skipping (args.frame_skipping), which
improves its simulation speed (only some frames need to be rendered). Note that
ReCodEx respects both args.continuous and args.frame_skipping during
evaluation.
In ReCodEx, you are expected to submit an already trained model, which is evaluated on 15 different tracks with a total time limit of 15 minutes. If your average return is at least 500, you obtain 5 points. The task is also a competition, and at most 5 points will be awarded according to relative ordering of your solutions.
The car_racing.py template parses several useful parameters and creates the environment.
You might find it useful to use a vectorized version of the environment for training, which runs several individual environments in separate processes. The template contains instructions how to create it. The vectorized environment expects a vector of actions and returns a vector of observations, rewards, terminations, truncations, and infos. When one of the environments finishes, it is automatically reset either in the next or in the same step, see https://farama.org/Vector-Autoreset-Mode for a detailed description.
When submitting a competition solution to ReCodEx, you should submit a trained agent and a Python source capable of running it.
Furthermore, please also include the Python source and hyperparameters
you used to train the submitted model. But be careful that there still must be
exactly one Python source with a line starting with def main(.
Do not forget about the maximum allowed model size and time and memory limits.
Before the deadline, ReCodEx prints the exact performance of your agent, but only if it is worse than the baseline.
If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached performance.
After the first deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.
After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.
What Python version to use
The recommended Python version is 3.11. This version is used by ReCodEx to evaluate your solutions. Supported Python versions are 3.11–3.13 (some dependencies do not yet provide wheels for Python 3.14).
You can find out the version of your Python installation using python3 --version.
Installing to central user packages repository
You can install all required packages to central user packages repository using
python3 -m pip install --user --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu128 npfl139.
On Linux and Windows, the above command installs CUDA 12.8 PyTorch build (which you would get
also without specifying the --extra-index-url option), but you can change cu128 to:
cpu to get CPU-only (smaller) version,cu126 to get CUDA 12.6 build,rocm7.1 to get AMD ROCm 7.1 build (Linux only).On macOS, the above --extra-index-url values have no practical effect, the
Metal support is installed in all cases.
To update the npfl139 package later, use python3 -m pip install --user --upgrade npfl139.
Installing to a virtual environment
Python supports virtual environments, which are directories containing
independent sets of installed packages. You can create a virtual environment
by running python3 -m venv VENV_DIR followed by
VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu128 npfl139.
(or VENV_DIR/Scripts/pip on Windows).
Again, apart from the CUDA 12.8 build (which you would get also without specifying
the --extra-index-url option), you can change cu128 on Linux and Windows to:
cpu to get CPU-only (smaller) version,cu126 to get CUDA 12.6 build,rocm7.1 to get AMD ROCm 7.1 build (Linux only).To update the npfl139 package later, use VENV_DIR/bin/pip install --upgrade npfl139.
Installing to a virtual environment with uv
If you would like to use uv pip to install the required packages to
a virtual environment, run the above command with VENV_DIR/bin/pip replaced
by uv pip.
If you prefer to use uv add instead and want to use a non-default build
(CUDA 12.8 on Linux and Windows), first manually add torch~=2.10.0, torchaudio~=2.10.0,
and torchvision~=0.25.0 with a specified tool.uv.index according to
https://docs.astral.sh/uv/guides/integration/pytorch/#using-a-pytorch-index.
Once you have PyTorch installed, you can then run uv add npfl139.
Windows installation
On Windows, it can happen that python3 is not in PATH, while py command
is – in that case you can use py -m venv VENV_DIR, which uses the newest
Python available, or for example py -3.11 -m venv VENV_DIR, which uses
Python version 3.11.
If MuJoCo environments fail during construction, make sure the path of the Python site packages contains no non-ASCII characters. If it does, you can create a new virtual environment in a suitable directory to circumvent the problem.
If you encounter a problem creating the logs in the args.logdir directory,
a possible cause is that the path is longer than 260 characters, which is
the default maximum length of a complete path on Windows. However, you can
increase this limit on Windows 10, version 1607 or later, by following
the instructions.
If you encounter an Import Error: DLL load failed, install the VS 2017 Redistributable
as described in the official documentation.
MacOS installation
Install Certificates.command, which should be executed after installation;
see https://docs.python.org/3/using/mac.html#installation-steps.GPU support on Linux and Windows
PyTorch supports NVIDIA GPU or AMD GPU out of the box, you just need to select
appropriate --extra-index-url when installing the packages.
If you encounter problems loading CUDA or cuDNN libraries, make sure your
LD_LIBRARY_PATH does not contain paths to older CUDA/cuDNN libraries.
How to apply for MetaCentrum account?
After reading the Terms and conditions, you can apply for an account here.
After your account is created, please make sure that the directories containing your solutions are always private.
How to activate Python 3.11 on MetaCentrum?
On Metacentrum, currently the newest available Python is 3.11, which you need to activate in every session by running the following command:
module add python/3.11.11-gcc-10.2.1-555dlyc
How to install the required virtual environment on MetaCentrum?
To create a virtual environment, you first need to decide where it will reside. Either you can find a permanent storage, where you have large-enough quota, or you can use scratch storage for a submitted job.
TL;DR:
Run an interactive CPU job, asking for 16GB scratch space:
qsub -l select=1:ncpus=1:mem=8gb:scratch_local=16gb -I
In the job, use the allocated scratch space as the temporary directory:
export TMPDIR=$SCRATCHDIR
You should clear the scratch space before you exit using the clean_scratch
command. You can instruct the shell to call it automatically by running:
trap clean_scratch TERM EXIT
Finally, create the virtual environment and install PyTorch in it:
module add python/3.11.11-gcc-10.2.1-555dlyc
python3 -m venv CHOSEN_VENV_DIR
CHOSEN_VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu126 npfl138
How to run a GPU computation on MetaCentrum?
First, read the official MetaCentrum documentation: Basic terms, Run simple job, GPU computing, GPU clusters.
TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, 8GB RAM, and 32GB scatch space, run:
qsub -l select=1:ncpus=1:ngpus=1:mem=8gb:scratch_local=32gb -I
To run a script in a non-interactive way, replace the -I option with the script to be executed.
If you want to run a CPU-only computation, remove the ngpus=1: from the above commands.
How to install required packages on AIC?
The Python 3.11.7 is available /opt/python/3.11.7/bin/python3, so you should
start by creating a virtual environment using
/opt/python/3.11.7/bin/python3 -m venv VENV_DIR
and then install the required packages in it using
VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu126 npfl138
How to run a GPU computation on AIC?
First, read the official AIC documentation: Submitting CPU Jobs, Submitting GPU Jobs.
TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, and 16GB RAM, run:
srun -p gpu -c1 -G1 --mem=16G --pty bash
To run a shell script requiring a GPU in a non-interactive way, use
sbatch -p gpu -c1 -G1 --mem=16G SCRIPT_PATH
If you want to run a CPU-only computation, remove the -p gpu and -G1
from the above commands.
Is it possible to keep the solutions in a Git repository?
Definitely. Keeping the solutions in a branch of your repository, where you merge them with the course repository, is probably a good idea. However, please keep the cloned repository with your solutions private.
On GitHub, do not create a public fork containing your solutions.
If you keep your solutions in a GitHub repository, please do not create a clone of the repository by using the Fork button; this way, the cloned repository would be public.
Of course, if you want to create a pull request, GitHub requires a public fork and you need to create it, just do not store your solutions in it (so you might end up with two repositories, a public fork for pull requests and a private repo for your own solutions).
How to clone the course repository?
To clone the course repository, run
git clone https://github.com/ufal/npfl139
This creates the repository in the npfl139 subdirectory; if you want a different
name, add it as an additional parameter.
To update the repository, run git pull inside the repository directory.
How to merge the course repository updates into a private repository with additional changes?
It is possible to have a private repository that combines your solutions and the updates from the course repository. To do that, start by cloning your empty private repository, and then run the following commands in it:
git remote add course_repo https://github.com/ufal/npfl139
git fetch course_repo
git checkout --no-track course_repo/master
This creates a new remote course_repo and a clone of the master branch
from it; however, git pull and git push in this branch will operate
on the repository you cloned originally.
To update your branch with the changes from the course repository, run
git fetch course_repo
git merge course_repo/master
while in your branch (the command git pull --no-rebase course_repo master
has the same effect). Of course, it might be necessary to resolve conflicts
if both you and the course repository modified the same lines in the same files.
What files can be submitted to ReCodEx?
You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.
What file does ReCodEx execute and what arguments does it use?
Exactly one file with py suffix must contain a line starting with def main(.
Such a file is imported by ReCodEx and the main method is executed
(during the import, __name__ == "__recodex__").
The file must also export an argument parser called parser. ReCodEx uses its
arguments and default values, but it overwrites some of the arguments
depending on the test being executed; the template always indicates which
arguments are set by ReCodEx and which are left intact.
What are the time and memory limits?
The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution.
Do agents need to be trained directly in ReCodEx?
No, you can pre-train your agent locally (unless specified otherwise in the task description).
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.
To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every but the last lecture). In addition, you can get surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.
Lecture 1 Questions
Derive how to incrementally update a running average (how to compute an average of numbers using the average of the first numbers). [5]
Describe multi-arm bandits and write down the -greedy algorithm for solving it. [5]
Define a Markov Decision Process, including the definition of a return. [5]
Describe how a partially observable Markov decision process extends a Markov decision process and how the agent is altered. [5]
Define a value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]
Define an action-value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]
Express a value function using an action-value function, and express an action-value function using a value function. [5]
Define optimal value function, optimal action-value function, and the optimal policy. [5]
Lecture 2 Questions
Write down the Bellman optimality equation. [5]
Define the Bellman backup operator. [5]
Write down the value iteration algorithm. [5]
Define the supremum norm and prove that Bellman backup operator is a contraction with respect to this norm. [10]
Formulate and prove the policy improvement theorem. [10]
Write down the policy iteration algorithm. [10]
Write down the tabular Monte-Carlo on-policy every-visit -soft algorithm. [5]
Write down the Sarsa algorithm. [5]
Write down the Q-learning algorithm. [5]
Lecture 3 Questions
Elaborate on how can importance sampling estimate expectations with respect to based on samples of . [5]
Show how to estimate returns in the off-policy case, both with (a) ordinary importance sampling and (b) weighted importance sampling. [10]
Write down the Expected Sarsa algorithm and show how to obtain Q-learning from it. [10]
Write down the Double Q-learning algorithm. [10]
Show the bootstrapped estimate of -step return. [5]
Write down the update in on-policy -step Sarsa (assuming you already have previous steps, actions, and rewards). [5]
Write down the update in off-policy -step Sarsa with importance sampling (assuming you already have previous steps, actions, and rewards). [10]
Write down the update of -step Tree-backup algorithm (assuming you already have previous steps, actions, and rewards). [10]
Lecture 4 Questions
Assuming function approximation, define Mean squared value error. [5]
Write down the gradient Monte-Carlo on-policy every-visit -soft algorithm. [10]
Write down the semi-gradient -greedy Sarsa algorithm. [10]
Prove that semi-gradient TD update is not an SGD update of any loss. [10]
What are the three elements causing off-policy divergence with function approximation? Write down the Baird's counterexample. [10]
Explain the role of a replay buffer in Deep Q Networks and describe how a single element of a replay buffer looks like. [5]
How is the target network used and updated in Deep Q Networks? [5]
Explain how is reward clipping used in Deep Q Networks. What other clipping is used? [5]
Formulate the loss used in Deep Q Networks. [5]
Write down the Deep Q Networks training algorithm. [10]
Explain the difference between DQN and Double DQN, and between Double DQN and Double Q-learning. [5]
Describe prioritized replay (how are transitions sampled from the replay buffer, how up-to-date are the priorities [according to which we sample], how are unseen transitions boosted, how is importance sampling used to account for the change in the sampling distribution). [10]
