Deep Reinforcement Learning – Summer 2024/25
The objective of this course is to provide a comprehensive introduction to deep reinforcement learning, a powerful paradigm that combines reinforcement learning with deep neural networks. This approach has demonstrated super-human capabilities in diverse domains, including complex games like Go and chess, optimizing real-world systems like datacenter cooling, improving chip design, automated discovery of superior algorithms and neural network architectures, and advancing robotics and large language models.
The course focuses both on the theory, spanning from fundamental concepts to recent advancements, as well as on practical implementations in Python and PyTorch (students implement and train agents controlling robots, mastering video games, and planing in complex board games). Basic programming and deep learning skills are expected (for example from the Deep Learning course).
Students work either individually or in small teams on weekly assignments, including competition tasks, where the goal is to obtain the highest performance in the class.
About
SIS code: NPFL139
Semester: summer
E-credits: 8
Examination: 3/4 C+Ex
Guarantor: Milan Straka
Timespace Coordinates
- lecture: the lecture is held on Wednesday 9:00 in S5; first lecture is on Feb 19
- practicals: the practicals take place on Thursday 14:00 in S5; first practicals are on Feb 20
- consultations: entirely optional consultations take place on Wednesday 14:00 in S9; first consultations are on Feb 26
All lectures and practicals will be recorded and available on this website.
Lectures
1. Introduction to Reinforcement Learning Slides PDF Slides Lecture Questions bandits monte_carlo
2. Value and Policy Iteration, Monte Carlo, Temporal Difference Slides PDF Slides Lecture TD, Q-learning Questions policy_iteration policy_iteration_exact policy_iteration_mc_estarts policy_iteration_mc_egreedy q_learning
3. Off-Policy Methods, N-step, Function Approximation Slides PDF Slides Lecture TreeBackup Questions importance_sampling td_algorithms q_learning_tiles lunar_lander
4. Function Approximation, Deep Q Network, Rainbow Slides PDF Slides Lecture Questions q_network car_racing
5. Rainbow II, Distributional RL Slides PDF Slides Lecture Questions prioritized_replay_buffer dist_c51
6. Distributional RL II Slides PDF Slides Lecture Questions dist_qr_dqn atari_gamer
7. Policy Gradient Methods Slides PDF Slides Lecture Questions reinforce reinforce_baseline cart_pole_pixels paac
8. Continuous Action Space: DDPG, TD3, SAC Slides PDF Slides Lecture SAC Conclusion Questions paac_continuous ddpg walker
9. Eligibility traces, Impala Slides PDF Slides Lecture PopArt Questions walker_hardcore trace_algorithms cheetah humanoid humanoid_standup
10. PPO, R2D2, Agent57 Slides PDF Slides Lecture R2D2+Agent57 Questions ppo
11. UCB, Monte Carlo Tree Search, AlphaZero Slides PDF Slides Lecture Questions az_quiz
12. MuZero, AlphaZero Policy Target, Gumbel-Max, GumbelZero Slides PDF Slides Lecture Questions az_quiz_cpp az_quiz_randomized pisqorky
13. PlaNet, ST and Gumbel-softmax, DreamerV2+3, MERLIN Slides PDF Slides Lecture Merlin Questions memory_game memory_game_rl
14. Multi-Agent RL, RL from Human Feedback Slides PDF Slides Lecture Questions mappo
License
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
The lecture content, including references to study materials.
The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (referred to as RLB). It is available online and also as a hardcopy.
References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.
1. Introduction to Reinforcement Learning
Feb 19 Slides PDF Slides Lecture Questions bandits monte_carlo
- History of RL [Chapter 1 of RLB]
- Multi-armed bandits [Sections 2-2.6 of RLB]
- Markov Decision Process [Sections 3-3.3 of RLB]
- Policies and Value Functions [Sections 3.5-3.6 of RLB]
2. Value and Policy Iteration, Monte Carlo, Temporal Difference
Feb 26 Slides PDF Slides Lecture TD, Q-learning Questions policy_iteration policy_iteration_exact policy_iteration_mc_estarts policy_iteration_mc_egreedy q_learning
- Value Iteration [Sections 4 and 4.4 of RLB]
- Proof of convergence only in slides
- Policy Iteration [Sections 4.1-4.3 of RLB]
- Generalized Policy Iteration [Section 4.6 of RLB]
- Monte Carlo Methods [Sections 5-5.4 of RLB]
- Model-free and model-based methods, using state-value or action-value functions [Chapter 8 before Section 8.1, and Section 6.8 of RLB]
- Temporal-difference methods [Sections 6-6.3 of RLB]
- Sarsa [Section 6.4 of RLB]
- Q-learning [Section 6.5 of RLB]
3. Off-Policy Methods, N-step, Function Approximation
Mar 5 Slides PDF Slides Lecture TreeBackup Questions importance_sampling td_algorithms q_learning_tiles lunar_lander
- Off-policy Monte Carlo Methods [Sections 5.5-5.7 of RLB]
- Expected Sarsa [Section 6.6 of RLB]
- N-step TD policy evaluation [Section 7.1 of RLB]
- N-step Sarsa [Section 7.2 of RLB]
- Off-policy n-step Sarsa [Section 7.3 of RLB]
- Tree backup algorithm [Section 7.5 of RLB]
- Function approximation [Sections 9-9.3 of RLB]
- Tile coding [Section 9.5.4 of RLB]
- Linear function approximation [Section 9.4 of RLB, without the Proof of Convergence of Linear TD(0)]
4. Function Approximation, Deep Q Network, Rainbow
Mar 12 Slides PDF Slides Lecture Questions q_network car_racing
- Semi-Gradient TD methods [Sections 9.3, 10-10.2 of RLB]
- Off-policy function approximation TD divergence [Sections 11.2-11.3 of RLB]
- Deep Q Network [Volodymyr Mnih et al.: Human-level control through deep reinforcement learning]
- Double Deep Q Network (DDQN) [Hado van Hasselt et al.: Deep Reinforcement Learning with Double Q-learning]
- Prioritized Experience Replay [Tom Schaul et al.: Prioritized Experience Replay]
- Dueling Deep Q Network [Ziyu Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning]
5. Rainbow II, Distributional RL
Mar 19 Slides PDF Slides Lecture Questions prioritized_replay_buffer dist_c51
- Noisy Nets [Meire Fortunato et al.: Noisy Networks for Exploration]
- Distributional Reinforcement Learning [Marc G. Bellemare et al.: A Distributional Perspective on Reinforcement Learning]
- Rainbow [Matteo Hessel et al.: Rainbow: Combining Improvements in Deep Reinforcement Learning]
6. Distributional RL II
Mar 26 Slides PDF Slides Lecture Questions dist_qr_dqn atari_gamer
- QR-DQN [Will Dabney et al.: Distributional Reinforcement Learning with Quantile Regression]
- IQN [Will Dabney et al.: Implicit Quantile Networks for Distributional Reinforcement Learning]
7. Policy Gradient Methods
Apr 02 Slides PDF Slides Lecture Questions reinforce reinforce_baseline cart_pole_pixels paac
- Policy Gradient Methods [Sections 13-13.1 of RLB]
- Policy Gradient Theorem [Section 13.2 of RLB]
- REINFORCE algorithm [Section 13.3 of RLB]
- REINFORCE with baseline algorithm [Section 13.4 of RLB]
- OP-REINFORCE [Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux: An operator view of policy gradient methods]
- Actor-Critic methods [Section 13.5 of RLB, without the eligibility traces variant]
- A3C and asynchronous RL [Volodymyr Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning]
- PAAC [Alfredo V. Clemente et al.: Efficient Parallel Methods for Deep Reinforcement Learning]
8. Continuous Action Space: DDPG, TD3, SAC
Apr 09 Slides PDF Slides Lecture SAC Conclusion Questions paac_continuous ddpg walker
- Gradient methods with continuous actions [Section 13.7 of RLB]
- Deterministic policy gradient theorem (DPG) [David Silver et al.: Deterministic Policy Gradient Algorithms]
- Deep deterministic policy gradient (DDPG) [Timothy P. Lillicrap et al.: Continuous Control with Deep Reinforcement Learning]
- Twin delayed deep deterministic policy gradient (TD3) [Scott Fujimoto et al.: Addressing Function Approximation Error in Actor-Critic Methods]
- Soft actor-critic (SAC) [Tuomas Haarnoja et al.: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor]
9. Eligibility traces, Impala
Apr 16 Slides PDF Slides Lecture PopArt Questions walker_hardcore trace_algorithms cheetah humanoid humanoid_standup
- Eligibility traces [Sections 12, 12.1, 12.3, 12.8, 12.9 of RLB]
- TD(λ) [Section 12.2 of RLB]
- The V-trace algorithm, IMPALA [Lasse Espeholt et al.: IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures]
- PBT [Max Jaderberg et al.: Population Based Training of Neural Networks]
- PopArt reward normalization [Matteo Hessel et al.: Multi-task Deep Reinforcement Learning with PopArt]
10. PPO, R2D2, Agent57
Apr 23 Slides PDF Slides Lecture R2D2+Agent57 Questions ppo
- Natural policy gradient (NPG) [Sham Kakade: A Natural Policy Gradient]
- Truncated natural policy gradient (TNPG), Trust Region Policy Optimalization (TRPO) [John Schulman et al.: Trust Region Policy Optimization]
- PPO algorithm [John Schulman et al.: Proximal Policy Optimization Algorithms]
- Transformed rewards [Tobias Pohlen et al.: Observe and Look Further: Achieving Consistent Performance on Atari]
- Recurrent Replay Distributed DQN (R2D2) [Steven Kapturowski et al.: Recurrent Experience Replay in Distributed Reinforcement Learning]
- Retrace [Rémi Munos et al.:Safe and Efficient Off-Policy Reinforcement Learning]
- Never Give Up [Adrià Puigdomènech Badia et al.: Never Give Up: Learning Directed Exploration Strategies]]
- Agent57 [Adrià Puigdomènech Badia et al.: Agent57: Outperforming the Atari Human Benchmark]]
11. UCB, Monte Carlo Tree Search, AlphaZero
Apr 30 Slides PDF Slides Lecture Questions az_quiz
- UCB [Section 2.7 of RLB]
- Monte-Carlo tree search [Guillaume Chaslot et al.: Monte-Carlo Tree Search: A New Framework for Game A]
- AlphaZero [David Silver et al.: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
12. MuZero, AlphaZero Policy Target, Gumbel-Max, GumbelZero
May 7 Slides PDF Slides Lecture Questions az_quiz_cpp az_quiz_randomized pisqorky
- MuZero [Julian Schrittwieser et al.: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model]
- AlphaZero as regularized policy optimization [Jean-Bastien Grill et al.: Monte-Carlo Tree Search as Regularized Policy Optimization]
- GumbelZero [Ivo Danihelka et al.: Policy Improvement by Planning with Gumbel]
13. PlaNet, ST and Gumbel-softmax, DreamerV2+3, MERLIN
May 14 Slides PDF Slides Lecture Merlin Questions memory_game memory_game_rl
- PlaNet [D. Hafner et al.: Learning Latent Dynamics for Planning from Pixels]
- Straight-Through estimator [Yoshua Bengio, Nicholas Léonard, Aaron Courville: Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation]
- Gumbel softmax [Eric Jang, Shixiang Gu, Ben Poole: Categorical Reparameterization with Gumbel-Softmax, Chris J. Maddison, Andriy Mnih, Yee Whye Teh: The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables]
- DreamerV2 [D. Hafner et al.: Mastering Atari with Discrete World Models]
- DreamerV3 [D. Hafner et al.: Mastering Diverse Domains through World Models]
- MERLIN model [Greg Wayne et al.:Unsupervised Predictive Memory in a Goal-Directed Agent]
14. Multi-Agent RL, RL from Human Feedback
May 20 Slides PDF Slides Lecture Questions mappo
- Multi-Agent Reinforcement Learning [Nice overview gives the diploma thesis Cooperative Multi-Agent Reinforcement Learning]
- RLHF [Reinforcement Learning from Human Preferences]
- Summarize from HF [Learning to summarize from human feedback]
- InstructGPT [Training language models to follow instructions with human feedback]
- DPO [Direct Preference Optimization: Your Language Model is Secretly a Reward Model]
Requirements
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.
Environment
The tasks are evaluated automatically using the ReCodEx Code Examiner.
The evaluation is performed using Python 3.11, Gymnasium 1.0.0, and PyTorch 2.6.0. You should install the exact versions of these packages yourselves.
Teamwork
Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
No Cheating
Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are allowed to share code and submit identical solutions. Note that all students involved in cheating will be punished, so if you share your source code with a friend, both you and your friend will be punished. That also means that you should never publish your solutions.
bandits
Deadline: Mar 05, 22:00 3 points
Implement the -greedy strategy for solving multi-armed bandits.
Start with the bandits.py
template, which defines MultiArmedBandits environment, which has the following
three methods:
reset(): reset the environmentstep(action) → reward: perform the chosen action in the environment, obtaining a rewardgreedy(epsilon): returnTruewith probability 1-epsilon
Your goal is to implement the following solution variants:
alpha: perform -greedy search, updating the estimates using averaging.alpha: perform -greedy search, updating the estimates using a fixed learning ratealpha.
Note that the initial estimates should be set to a given value, and epsilon can
be zero, in which case purely greedy actions are used.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 bandits.py --alpha=0 --epsilon=0.1 --initial=0
1.39 0.08
python3 bandits.py --alpha=0 --epsilon=0 --initial=1
1.48 0.22
python3 bandits.py --alpha=0.15 --epsilon=0.1 --initial=0
1.37 0.09
python3 bandits.py --alpha=0.15 --epsilon=0 --initial=1
1.52 0.04
monte_carlo
Deadline: Mar 05, 22:00 4 points
Solve the discretized CartPole-v1 environment
from the Gymnasium library using the Monte Carlo
reinforcement learning algorithm. The gymnasium environments have the following
methods and properties:
observation_space: the description of environment observationsaction_space: the description of environment actionsreset() → new_state, info: starts a new episode, returning the new state and additional environment-specific informationstep(action) → new_state, reward, terminated, truncated, info: perform the chosen action in the environment, returning the new state, obtained reward, boolean flags indicating a terminal state and episode truncation, and additional environment-specific information
We additionally extend the gymnasium environment by:
episode: number of the current episode (zero-based)reset(start_evaluation=False) → new_state, info: ifstart_evaluationisTrue, an evaluation is started
Once you finish training (which you indicate by passing start_evaluation=True
to reset), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100-episode
average return each 10 episodes even during training.
Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
policy_iteration
Deadline: Mar 12, 22:00 2 points
Consider the following gridworld:
Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:
GridWorld.states: return the number of states (11)GridWorld.actions: return the number of actions (4)GridWorld.action_labels: return a list with labels of the actions (["↑", "→", "↓", "←"])GridWorld.step(state, action): return possible outcomes of performing theactionin a givenstate, as a list of triples containingprobability: probability of the outcomereward: reward of the outcomenew_state: new state of the outcome
Implement policy iteration algorithm, with --steps steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform --iterations applications of the Bellman equation.
Perform the policy evaluation asynchronously (i.e., update the value function
in-place for states ). Assume the initial policy is “go North” and
initial value function is zero.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -10.00← -10.95↑
0.00↑ 0.00← -7.50← -88.93←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=2
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -8.31← -11.83←
0.00↑ 0.00← -1.50← -20.61←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=3
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -6.46← -6.77←
0.00↑ 0.00← -0.76← -13.08↓
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=10
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -1.04← -0.83←
0.00↑ 0.00← -0.11→ -0.34↓
python3 policy_iteration.py --gamma=0.95 --iterations=10 --steps=10
11.93↓ 11.19← 10.47← 6.71↑
12.83↓ 10.30← 10.12←
13.70→ 14.73→ 15.72→ 16.40↓
python3 policy_iteration.py --gamma=1 --iterations=1 --steps=100
74.73↓ 74.50← 74.09← 65.95↑
75.89↓ 72.63← 72.72←
77.02→ 78.18→ 79.31→ 80.16↓
policy_iteration_exact
Deadline: Mar 12, 22:00 2 points
Starting with policy_iteration_exact.py,
extend the policy_iteration assignment to perform policy evaluation
exactly by solving a system of linear equations. Note that you need to
use 64-bit floats because lower precision results in unacceptable error.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration_exact.py --gamma=0.95 --steps=1
-0.00↑ -0.00↑ -0.00↑ -0.00↑
-0.00↑ -12.35← -12.35↑
-0.85← -8.10← -19.62← -100.71←
python3 policy_iteration_exact.py --gamma=0.95 --steps=2
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00← -11.05←
-0.00↑ -0.00↑ -0.00← -12.10↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=3
-0.00↑ 0.00↑ 0.00↑ 0.00↑
-0.00↑ -0.00← 0.69←
-0.00↑ -0.00↑ -0.00→ 6.21↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=4
-0.00↑ 0.00↑ 0.00↓ 0.00↑
-0.00↓ 5.91← 6.11←
0.65→ 6.17→ 14.93→ 15.99↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=5
2.83↓ 4.32→ 8.09↓ 5.30↑
12.92↓ 9.44← 9.35←
13.77→ 14.78→ 15.76→ 16.53↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=6
11.75↓ 8.15← 8.69↓ 5.69↑
12.97↓ 9.70← 9.59←
13.82→ 14.84→ 15.82→ 16.57↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=7
12.12↓ 11.37← 9.19← 6.02↑
13.01↓ 9.92← 9.79←
13.87→ 14.89→ 15.87→ 16.60↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=8
12.24↓ 11.49← 10.76← 7.05↑
13.14↓ 10.60← 10.42←
14.01→ 15.04→ 16.03→ 16.71↓
python3 policy_iteration_exact.py --gamma=0.9999 --steps=5
7385.23↓ 7392.62→ 7407.40↓ 7400.00↑
7421.37↓ 7411.10← 7413.16↓
7422.30→ 7423.34→ 7424.27→ 7425.84↓
policy_iteration_mc_estarts
Deadline: Mar 12, 22:00 2 points
Starting with policy_iteration_mc_estarts.py,
extend the policy_iteration assignment to perform policy evaluation
by using Monte Carlo estimation with exploring starts. Specifically,
we update the action-value function by running a
simulation with a given number of steps and using the observed return
as its estimate.
The estimation can now be performed model-free (without the access to the full
MDP dynamics), therefore, the GridWorld.step returns a randomly sampled
result instead of a full distribution.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00↑ 0.00↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=10
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ -19.50↑
0.27↓ 0.48← 2.21↓ 8.52↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=50
0.09↓ 0.32↓ 0.22← 0.15↑
0.18↑ -2.43← -5.12↓
0.18↓ 1.80↓ 3.90↓ 9.14↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=100
3.09↓ 2.42← 2.39← 1.17↑
3.74↓ 1.66← 0.18←
3.92→ 5.28→ 7.16→ 11.07↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=200
7.71↓ 6.76← 6.66← 3.92↑
8.27↓ 6.17← 5.31←
8.88→ 10.12→ 11.36→ 13.92↓
policy_iteration_mc_egreedy
Deadline: Mar 12, 22:00 2 points
Starting with policy_iteration_mc_egreedy.py,
extend the policy_iteration_mc_estarts assignment to perform policy
evaluation by using -greedy Monte Carlo estimation. Specifically,
we update the action-value function by running a
simulation with a given number of steps and using the observed return
as its estimate.
For the sake of replicability, use the provided
GridWorld.epsilon_greedy(epsilon, greedy_action) method, which returns
a random action with probability of epsilon and otherwise returns the
given greedy_action.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00→
0.00↑ 0.00↑ 0.00→ 0.00→
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=10
-1.20↓ -1.43← 0.00← -6.00↑
0.78→ -20.26↓ 0.00←
0.09← 0.00↓ -9.80↓ 10.37↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=50
-0.16↓ -0.19← 0.56← -6.30↑
0.13→ -6.99↓ -3.51↓
0.01← 0.00← 3.18↓ 7.57↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=100
-0.07↓ -0.09← 0.28← -4.66↑
0.06→ -5.04↓ -8.32↓
0.00← 0.00← 1.70↓ 4.38↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=200
-0.04↓ -0.04← -0.76← -4.15↑
0.03→ -8.02↓ -5.96↓
0.00← 0.00← 2.53↓ 4.36↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=500
-0.02↓ -0.02← -0.65← -3.52↑
0.01→ -11.34↓ -8.07↓
0.00← 0.00← 3.15↓ 3.99↓
q_learning
Deadline: Mar 12, 22:00 4 points
Solve the discretized MountainCar-v0 environment from the Gymnasium library using the Q-learning reinforcement learning algorithm. Note that this task still does not require PyTorch.
The environment methods and properties are described in the monte_carlo assignment.
Once you finish training (which you indicate by passing start_evaluation=True
to reset), your goal is to reach an average return of -150 during 100
evaluation episodes.
You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Q-learning is a bit tricky – I usually start with a larger value of (like 0.2 or even 0.5) and then gradually decrease it to almost zero.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
importance_sampling
Deadline: Mar 19, 22:00 2 points
Using the FrozenLake-v1 environment, implement Monte Carlo weighted importance sampling to estimate state value function of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behavior policy, which uniformly chooses among all four actions.
Start with the importance_sampling.py template, which creates the environment and generates episodes according to behavior policy.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 importance_sampling.py --episodes=200
0.00 0.00 0.24 0.32
0.00 0.00 0.40 0.00
0.00 0.00 0.20 0.00
0.00 0.00 0.22 0.00
python3 importance_sampling.py --episodes=5000
0.03 0.00 0.01 0.03
0.04 0.00 0.09 0.00
0.10 0.24 0.23 0.00
0.00 0.44 0.49 0.00
python3 importance_sampling.py --episodes=50000
0.03 0.02 0.05 0.01
0.13 0.00 0.07 0.00
0.21 0.33 0.36 0.00
0.00 0.35 0.76 0.00
td_algorithms
Deadline: Mar 19, 22:00 4 points
Starting with the td_algorithms.py template, implement all of the following -step TD methods variants:
- SARSA, expected SARSA and Tree backup;
- either on-policy (with -greedy behavior policy) or off-policy (with the same behavior policy, but greedy target policy).
Note that while the test and example outputs just show mean 100-episode returns,
ReCodEx compares the action-value function you return from main to the
reference one.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1
Episode 10, mean 100-episode return -652.70 +-37.77
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1 --off_policy
Episode 10, mean 100-episode return -632.90 +-126.41
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4
Episode 10, mean 100-episode return -715.70 +-156.56
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4 --off_policy
Episode 10, mean 100-episode return -649.10 +-171.73
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1
Episode 10, mean 100-episode return -641.90 +-122.11
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1 --off_policy
Episode 10, mean 100-episode return -633.80 +-63.61
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4
Episode 10, mean 100-episode return -713.90 +-107.05
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4 --off_policy
Episode 10, mean 100-episode return -648.20 +-107.08
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1
Episode 10, mean 100-episode return -641.90 +-122.11
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1 --off_policy
Episode 10, mean 100-episode return -633.80 +-63.61
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4
Episode 10, mean 100-episode return -663.50 +-111.78
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4 --off_policy
Episode 10, mean 100-episode return -708.50 +-125.63
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 td_algorithms.py --mode=sarsa --n=1
Episode 200, mean 100-episode return -235.23 +-92.94
Episode 400, mean 100-episode return -133.18 +-98.63
Episode 600, mean 100-episode return -74.19 +-70.39
Episode 800, mean 100-episode return -41.84 +-54.53
Episode 1000, mean 100-episode return -31.96 +-52.14
python3 td_algorithms.py --mode=sarsa --n=1 --off_policy
Episode 200, mean 100-episode return -227.81 +-91.62
Episode 400, mean 100-episode return -131.29 +-90.07
Episode 600, mean 100-episode return -65.35 +-64.78
Episode 800, mean 100-episode return -34.65 +-44.93
Episode 1000, mean 100-episode return -8.70 +-25.74
python3 td_algorithms.py --mode=sarsa --n=4
Episode 200, mean 100-episode return -277.55 +-146.18
Episode 400, mean 100-episode return -87.11 +-152.12
Episode 600, mean 100-episode return -6.95 +-23.28
Episode 800, mean 100-episode return -1.88 +-19.21
Episode 1000, mean 100-episode return 0.97 +-11.76
python3 td_algorithms.py --mode=sarsa --n=4 --off_policy
Episode 200, mean 100-episode return -339.11 +-144.40
Episode 400, mean 100-episode return -172.44 +-176.79
Episode 600, mean 100-episode return -36.23 +-100.93
Episode 800, mean 100-episode return -22.43 +-81.29
Episode 1000, mean 100-episode return -3.95 +-17.78
python3 td_algorithms.py --mode=expected_sarsa --n=1
Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
python3 td_algorithms.py --mode=expected_sarsa --n=1 --off_policy
Episode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
python3 td_algorithms.py --mode=expected_sarsa --n=4
Episode 200, mean 100-episode return -263.10 +-161.97
Episode 400, mean 100-episode return -102.52 +-162.03
Episode 600, mean 100-episode return -7.13 +-24.53
Episode 800, mean 100-episode return -1.69 +-12.21
Episode 1000, mean 100-episode return -1.53 +-11.04
python3 td_algorithms.py --mode=expected_sarsa --n=4 --off_policy
Episode 200, mean 100-episode return -376.56 +-116.08
Episode 400, mean 100-episode return -292.35 +-166.14
Episode 600, mean 100-episode return -173.83 +-194.11
Episode 800, mean 100-episode return -89.57 +-153.70
Episode 1000, mean 100-episode return -54.60 +-127.73
python3 td_algorithms.py --mode=tree_backup --n=1
Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
python3 td_algorithms.py --mode=tree_backup --n=1 --off_policy
Episode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
python3 td_algorithms.py --mode=tree_backup --n=4
Episode 200, mean 100-episode return -270.51 +-134.35
Episode 400, mean 100-episode return -64.27 +-109.50
Episode 600, mean 100-episode return -1.80 +-13.34
Episode 800, mean 100-episode return -0.22 +-13.14
Episode 1000, mean 100-episode return 0.60 +-9.37
python3 td_algorithms.py --mode=tree_backup --n=4 --off_policy
Episode 200, mean 100-episode return -248.56 +-147.74
Episode 400, mean 100-episode return -68.60 +-126.13
Episode 600, mean 100-episode return -6.25 +-32.23
Episode 800, mean 100-episode return -0.53 +-11.82
Episode 1000, mean 100-episode return 2.33 +-8.35
q_learning_tiles
Deadline: Mar 19, 22:00 3 points
Improve the q_learning task performance on the
MountainCar-v0 environment
using linear function approximation with tile coding.
Your goal is to reach an average reward of -110 during 100 evaluation episodes.
The environment methods are described in the q_learning assignment, with
the following changes:
- The
statereturned by theenv.stepmethod is a list containing weight indices of the current state (i.e., the feature vector of the state consists of zeros and ones, and only the indices of the ones are returned). The action-value function is therefore approximated as a sum of the weights whose indices are returned byenv.step. - The
env.observation_space.nvecreturns a list, where the -th element is a number of weights used by first elements ofstate. Notably,env.observation_space.nvec[-1]is the total number of the weights.
You can start with the q_learning_tiles.py
template, which parses several useful parameters and creates the environment.
Implementing Q-learning is enough to pass the assignment, even if both N-step
Sarsa and Tree Backup converge a little faster. The default number of tiles in
tile encoding (i.e., the size of the list with weight indices) is
args.tiles=8, but you can use any number you want (but the assignment is
solvable with 8).
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
lunar_lander
Deadline: Mar 19, 22:00 5 points + 5 bonus
Solve the LunarLander-v3 environment from the Gymnasium library Note that this task does not require PyTorch.
The environment methods and properties are described in the monte_carlo assignment,
but include one additional method:
-
expert_trajectory(seed=None) → trajectory: This method generates one expert trajectory, wheretrajectoryis a list of triples (state, action, reward), where the action and reward isNonewhen reaching the terminal state. If aseedis given, the expert trajectory random generator is reset before generating the trajectory.You cannot change the implementation of this method or use its internals in any way other than just calling
expert_trajectory(). Furthermore, you can use this method only during training, not during evaluation.
To pass the task, you need to reach an average return of 0 during 1000 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 15 minutes.
The task is additionally a competition, and at most 5 points will be awarded according to the relative ordering of your solutions.
You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
In the competition, you should consider the environment states meaning to be unknown, so you cannot use the knowledge about how they are created. But you can learn any such information from the data.
q_network
Deadline: Mar 26, 22:00 5 points
Solve the continuous CartPole-v1 environment from the Gymnasium library using Q-learning with neural network as a function approximation.
You can start with the q_network.py template, which provides a simple network implementation in PyTorch.
The continuous environment is very similar to a discrete one, except
that the states are vectors of real-valued observations with shape
env.observation_space.shape.
Use Q-learning with neural network as a function approximation, which for a given state returns state-action values for all actions. You can use any network architecture, but one hidden layer of several dozens ReLU units is a good start. Your goal is to reach an average return of 450 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes (so you can train in ReCodEx, but you can also pretrain your network if you like).
car_racing
Deadline: Mar 26, 22:00 5 points + 5 bonus
The goal of this competition is to use Deep Q Networks (and any of Rainbow improvements) on a more real-world CarRacing-v3 environment from the Gymnasium library.
In the provided CarRacingFS-v3
environment, the states are RGB np.uint8 images of size
, but you can downsample them even more. The actions
are also continuous and consist of an array with the following three elements:
steerin range [-1, 1]gasin range [0, 1]brakein range [0, 1]; note that full brake is quite aggressive, so you might consider using less force when braking Internally you should probably generate discrete actions and convert them to the required representation before thestepcall. Alternatively, you might setargs.continuous=0, which changes the action space from continuous to 5 discrete actions – do nothing, steer left, steer right, gas, and brake. But you can experiment with different action space if you want.
The environment also supports frame skipping (args.frame_skipping), which
improves its simulation speed (only some frames need to be rendered). Note that
ReCodEx respects both args.continuous and args.frame_skipping during
evaluation.
In ReCodEx, you are expected to submit an already trained model, which is evaluated on 15 different tracks with a total time limit of 15 minutes. If your average return is at least 500, you obtain 5 points. The task is also a competition, and at most 5 points will be awarded according to relative ordering of your solutions.
The car_racing.py
template parses several useful parameters and creates the environment.
If you want to experience the environment yourselves, you can drive the car
using arrows by running the command python3 -m npfl139.envs.car_racing_interactive.
You might find it useful to use a vectorized version of the environment for training, which runs several individual environments in separate processes. The template contains instructions how to create it. The vectorized environment expects a vector of actions and returns a vector of observations, rewards, terminations, truncations, and infos. When one of the environments finishes, it is automatically reset either in the next or in the same step, see https://farama.org/Vector-Autoreset-Mode for a detailed description.
prioritized_replay_buffer
Deadline: Apr 09, 22:00 2 points
In this assignment, your goal is to implement an efficient prioritized replay buffer with logarithmic complexity of sampling. Start with the prioritized_replay_buffer.py template, which contains the skeleton of the implementation and a detailed description of the functionality that you must implement.
Note that compared to npfl139.ReplayBuffer, the template implementation is more memory efficient: the buffer elements must be named tuples of data convertible to Numpy arrays with a constant shape, and the whole replay buffer stores items in a single named tuple of Numpy arrays containing all the data.
When executed directly, the template runs randomized tests verifying the prioritized replay buffer behaves as expected. In ReCodEx, similar tests are executed, and your performance must be at most twice the running time of the reference solution.
In both local and ReCodEx tests, the maximum replay buffer capacity is always a power of 2 (even if the reference implementation can handle an arbitrary limit). Note that to avoid precision loss, you should store the priorities using 64-bit floats.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
-
python3 prioritized_replay_buffer.py --max_length=128 -
python3 prioritized_replay_buffer.py --max_length=8192 --batch_size=8 -
python3 prioritized_replay_buffer.py --max_length=131072 --batch_size=2
dist_c51
Deadline: Apr 09, 22:00 3 points
Extend the q_network assignment by solving the continuous
CartPole-v1 environment
from the Gymnasium library using distributed
reinforcement learning algorithm C51.
Start with the dist_c51.py
template. In the template, you must implement the Network.compute_loss
method, which constitutes the core of the C51 algorithm. In ReCodEx, the first
two tests verify your implementation by comparing the results to the reference
ones. You can also run two comparison tests present in the template locally by
using the --verify option.
Using the Network.compute_loss method, you should finish the C51
implementation. In the third test, ReCodEx verifies in the usual way that your
agent reaches an average return of 450 during 100 evaluation episodes.
dist_qr_dqn
Deadline: Apr 16, 22:00 3 points
Extend the q_network assignment by solving the continuous
CartPole-v1 environment
from the Gymnasium library using the quantile
regression QR-DQN- algorithm.
Start with the dist_qr_dqn.py
template. As in the dist_c51 assignment, you must implement
the Network.compute_loss method, which constitutes the core of the QR-DQN
algorithm. In ReCodEx, the first two tests verify your implementation by
comparing the results to the reference ones. You can also run two comparison
tests present in the template locally by using the --verify option.
Using the Network.compute_loss method, you should finish the QR-DQN
implementation. In the third test, ReCodEx verifies in the usual way that your
agent reaches an average return of 450 during 100 evaluation episodes.
atari_gamer
Deadline: June 30, 22:00 4 points
In this long-term assignment, your goal is to solve one of the Atari games. Currently, you can choose one of the following games:
- Pong (one of the easiest games), where you must surpass the average return of 16 in 10 episodes
- Breakout (a more difficult one game), where you must surpass the average return of 200 in 10 episodes.
If you would like to train an agent for a different Atari game, write us on
Piazza, and we will decide the required score and add it to the list. Note that
you can play any Atari game interactively with WASD and SPACE using the
python3 -m npfl139.envs.atari_interactive GAME_NAME [--zoom=4] [--frame_skip=1]
command, so for example python3 -m npfl139.envs.atari_interactive Pong.
The template atari_gamer.py
shows how to create the Atari environment. The args.game is the chosen game
and args.frame_skip option specified the required frame skipping (these
options are used also in ReCodEx to construct the evaluation environment), and
additional wrappers can be applied to the environment at the beginning of the
main method.
While you can use any algorithm from the lecture to solve the environment, some distributional DQN approach (C51/QR-DQN/IQN) is a reasonable choice (as is the PPO algorithm). The time limit in ReCodEx is 15 minutes.
reinforce
Deadline: Apr 16, 22:00 3 points
Solve the continuous CartPole-v1 environment from the Gymnasium library using the REINFORCE algorithm.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
Start with the reinforce.py template, which provides a simple network implementation in PyTorch.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
reinforce_baseline
Deadline: Apr 16, 22:00 3 points
This is a continuation of the reinforce assignment.
Using the reinforce_baseline.py template, solve the continuous CartPole-v1 environment using the REINFORCE with baseline algorithm.
Using a baseline lowers the variance of the value function gradient estimator, which allows faster training and decreases sensitivity to hyperparameter values. To reflect this effect in ReCodEx, note that the evaluation phase will automatically start after 200 episodes. Using only 200 episodes for training in this setting is probably too little for the REINFORCE algorithm, but suffices for the variant with a baseline. In this assignment, you must train your agent in ReCodEx using the provided environment only.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
cart_pole_pixels
Deadline: Apr 16, 22:00 3 points + 4 bonus
The supplied CartPolePixels environment
generates a pixel representation of the CartPole environment
as an np.uint8 image with three channels, with each channel representing one time step
(i.e., the current observation and the two previous ones). You can also
play it interactively using python3 -m npfl139.envs.cart_pole_pixels_interactive.
During evaluation in ReCodEx, three different random seeds will be employed, each with time limit of 10 minutes, and if you reach an average return at least 450 on all of them, you obtain 3 points. The task is also a competition, and at most 4 points will be awarded according to relative ordering of your solutions.
The cart_pole_pixels.py template parses several parameters and creates the environment. You are again supposed to train the model beforehand and submit only the trained neural network.
paac
Deadline: Apr 23, 22:00 4 points
Solve the LunarLander-v3 environment
with continuous observations using parallel actor-critic algorithm, employing
the vectorized environment described in the car_racing assignment.
Your goal is to reach an average return of 250 during 100 evaluation episodes.
Start with the paac.py
template, which provides a simple network implementation in PyTorch, support for
saving a trained agent, and ReCodEx evaluation of a submitted agent. The used
environment is configurable, so you can experiment also with other environments;
for example, I would first solve the CartPole environment before progressing
to LunarLander.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
paac_continuous
Deadline: Apr 23, 22:00 3 points
Solve the MountainCarContinuous-v0 environment
using parallel actor-critic algorithm with continuous actions.
When actions are continuous, env.action_space is the same Box space
as env.observation_space, offering:
env.action_space.shape, which specifies the shape of actions (you can assume actions are always a 1D vector),env.action_space.lowandenv.action_space.high, which specify the ranges of the corresponding actions.
Your goal is to reach an average return of 90 during 100 evaluation episodes.
Start with the paac_continuous.py template, which provides a simple network implementation in PyTorch.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes, and the expected approach is to train the agent during ReCodEx evaluation.
ddpg
Deadline: Apr 23, 22:00 5 points
Solve the continuous Pendulum-v1 and InvertedDoublePendulum-v5 environments using the deep deterministic policy gradient algorithm.
Your goal is to reach an average return of -200 for Pendulum-v1 and 9000 for InvertedDoublePendulum-v5
during 100 evaluation episodes.
Start with the ddpg.py template, which provides a simple network implementation in PyTorch.
During evaluation in ReCodEx, two different random seeds will be employed for both environments, and you need to reach the required return on all of them. Time limit for each test is 10 minutes, and my solution comfortably trains the agent during ReCodEx evaluation.
walker
Deadline: Apr 30, 22:00 5 points
In this exercise we explore continuous robot control by solving the continuous BipedalWalker-v3 environment.
Note that the penalty of -100 on crash makes the training considerably slower.
Even if all of DDPG, TD3 and SAC can be trained with original rewards, overriding
the reward at the end of episode to 0 speeds up training considerably.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 200, you obtain 5 points.
The walker.py template contains the skeleton for implementing the SAC agent, but you can also solve the assignment with DDPG/TD3.
walker_hardcore
Deadline: Apr 30, 22:00 5 points + 5 bonus
As an extension of the walker assignment, solve the
hardcore version of the BipedalWalker-v3 environment.
Note that the penalty of -100 on crash can discourage or even stop training,
so overriding the reward at the end of episode to 0 (or decreasing it
substantially) makes the training considerably easier (I have not surpassed
an average return 100 with neither TD3 nor SAC with the original -100 penalty).
In ReCodEx, you are expected to submit an already trained model, which is evaluated with three seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 100, you obtain 5 points. The task is also a competition, and at most 5 points will be awarded according to relative ordering of your solutions.
The walker_hardcore.py
template shows a basic structure of evaluation in ReCodEx, but
you most likely want to start either with ddpg.py.
or with walker.py
and just change the env argument to BipedalWalkerHardcore-v3.
trace_algorithms
Deadline: May 7, 22:00 4 points
Starting with the trace_algorithms.py template, implement the following state value estimations:
- use -step estimates for a given ;
- if requested, use truncated lambda return with a given ;
- allow off-policy correction using importance sampling with control variates, optionally clipping the individual importance sampling ratios by a given threshold.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 trace_algorithms.py --episodes=50 --n=1
The mean 1000-episode return after evaluation -196.80 +-25.96
python3 trace_algorithms.py --episodes=50 --n=4
The mean 1000-episode return after evaluation -165.45 +-78.01
python3 trace_algorithms.py --episodes=50 --n=8 --seed=62
The mean 1000-episode return after evaluation -180.20 +-61.48
python3 trace_algorithms.py --episodes=50 --n=4 --trace_lambda=0.6
The mean 1000-episode return after evaluation -170.70 +-72.93
python3 trace_algorithms.py --episodes=50 --n=8 --trace_lambda=0.6 --seed=77
The mean 1000-episode return after evaluation -154.24 +-86.67
python3 trace_algorithms.py --episodes=50 --n=1 --off_policy
The mean 1000-episode return after evaluation -189.16 +-46.74
python3 trace_algorithms.py --episodes=50 --n=4 --off_policy
The mean 1000-episode return after evaluation -159.09 +-83.40
python3 trace_algorithms.py --episodes=50 --n=8 --off_policy
The mean 1000-episode return after evaluation -166.82 +-76.04
python3 trace_algorithms.py --episodes=50 --n=1 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -198.50 +-17.93
python3 trace_algorithms.py --episodes=50 --n=4 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -144.76 +-92.48
python3 trace_algorithms.py --episodes=50 --n=8 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -167.63 +-75.87
python3 trace_algorithms.py --episodes=50 --n=4 --off_policy --vtrace_clip=1 --trace_lambda=0.6
The mean 1000-episode return after evaluation -186.28 +-52.05
python3 trace_algorithms.py --episodes=50 --n=8 --off_policy --vtrace_clip=1 --trace_lambda=0.6
The mean 1000-episode return after evaluation -185.67 +-53.04
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 trace_algorithms.py --n=1
Episode 100, mean 100-episode return -96.50 +-92.02
Episode 200, mean 100-episode return -53.64 +-76.70
Episode 300, mean 100-episode return -29.03 +-54.84
Episode 400, mean 100-episode return -8.78 +-21.69
Episode 500, mean 100-episode return -14.24 +-41.76
Episode 600, mean 100-episode return -4.57 +-17.56
Episode 700, mean 100-episode return -7.90 +-27.92
Episode 800, mean 100-episode return -2.17 +-16.67
Episode 900, mean 100-episode return -2.07 +-14.01
Episode 1000, mean 100-episode return 0.13 +-13.93
The mean 1000-episode return after evaluation -35.05 +-84.82
python3 trace_algorithms.py --n=4
Episode 100, mean 100-episode return -74.01 +-89.62
Episode 200, mean 100-episode return -4.84 +-20.95
Episode 300, mean 100-episode return 0.37 +-11.81
Episode 400, mean 100-episode return 1.82 +-8.04
Episode 500, mean 100-episode return 1.28 +-8.66
Episode 600, mean 100-episode return 3.13 +-7.02
Episode 700, mean 100-episode return 0.76 +-8.05
Episode 800, mean 100-episode return 2.05 +-8.11
Episode 900, mean 100-episode return 0.98 +-9.22
Episode 1000, mean 100-episode return 0.29 +-9.13
The mean 1000-episode return after evaluation -11.49 +-60.05
python3 trace_algorithms.py --n=8 --seed=62
Episode 100, mean 100-episode return -102.83 +-104.71
Episode 200, mean 100-episode return -5.02 +-23.36
Episode 300, mean 100-episode return -0.43 +-13.33
Episode 400, mean 100-episode return 1.99 +-8.89
Episode 500, mean 100-episode return -2.17 +-16.57
Episode 600, mean 100-episode return -2.62 +-19.87
Episode 700, mean 100-episode return 1.66 +-7.81
Episode 800, mean 100-episode return -7.40 +-36.75
Episode 900, mean 100-episode return -5.95 +-34.04
Episode 1000, mean 100-episode return 3.51 +-7.88
The mean 1000-episode return after evaluation 6.88 +-14.89
python3 trace_algorithms.py --n=4 --trace_lambda=0.6
Episode 100, mean 100-episode return -85.33 +-91.17
Episode 200, mean 100-episode return -16.06 +-39.97
Episode 300, mean 100-episode return -2.74 +-15.78
Episode 400, mean 100-episode return -0.33 +-9.93
Episode 500, mean 100-episode return 1.39 +-9.48
Episode 600, mean 100-episode return 1.59 +-9.26
Episode 700, mean 100-episode return 3.66 +-6.99
Episode 800, mean 100-episode return 2.08 +-7.26
Episode 900, mean 100-episode return 1.32 +-8.76
Episode 1000, mean 100-episode return 3.33 +-7.27
The mean 1000-episode return after evaluation 7.93 +-2.63
python3 trace_algorithms.py --n=8 --trace_lambda=0.6 --seed=77
Episode 100, mean 100-episode return -118.76 +-105.12
Episode 200, mean 100-episode return -21.82 +-49.91
Episode 300, mean 100-episode return -0.59 +-11.21
Episode 400, mean 100-episode return 2.27 +-8.29
Episode 500, mean 100-episode return 1.65 +-8.52
Episode 600, mean 100-episode return 1.16 +-10.32
Episode 700, mean 100-episode return 1.18 +-9.62
Episode 800, mean 100-episode return 3.35 +-7.34
Episode 900, mean 100-episode return 1.66 +-8.67
Episode 1000, mean 100-episode return 0.86 +-8.56
The mean 1000-episode return after evaluation -11.93 +-60.63
python3 trace_algorithms.py --n=1 --off_policy
Episode 100, mean 100-episode return -68.47 +-73.52
Episode 200, mean 100-episode return -29.11 +-34.15
Episode 300, mean 100-episode return -20.30 +-31.24
Episode 400, mean 100-episode return -13.44 +-25.04
Episode 500, mean 100-episode return -4.72 +-13.75
Episode 600, mean 100-episode return -3.07 +-17.63
Episode 700, mean 100-episode return -2.70 +-13.81
Episode 800, mean 100-episode return 1.32 +-11.79
Episode 900, mean 100-episode return 0.78 +-8.95
Episode 1000, mean 100-episode return 1.15 +-9.27
The mean 1000-episode return after evaluation -12.63 +-62.51
python3 trace_algorithms.py --n=4 --off_policy
Episode 100, mean 100-episode return -96.25 +-105.93
Episode 200, mean 100-episode return -26.21 +-74.65
Episode 300, mean 100-episode return -4.84 +-31.78
Episode 400, mean 100-episode return -0.34 +-9.46
Episode 500, mean 100-episode return 1.15 +-8.49
Episode 600, mean 100-episode return 2.95 +-7.20
Episode 700, mean 100-episode return 0.94 +-10.19
Episode 800, mean 100-episode return 0.13 +-9.27
Episode 900, mean 100-episode return 1.95 +-9.69
Episode 1000, mean 100-episode return 1.91 +-7.59
The mean 1000-episode return after evaluation 6.79 +-3.68
python3 trace_algorithms.py --n=8 --off_policy
Episode 100, mean 100-episode return -180.08 +-112.11
Episode 200, mean 100-episode return -125.56 +-124.82
Episode 300, mean 100-episode return -113.66 +-125.12
Episode 400, mean 100-episode return -77.98 +-117.08
Episode 500, mean 100-episode return -23.71 +-69.71
Episode 600, mean 100-episode return -21.44 +-67.38
Episode 700, mean 100-episode return -2.43 +-16.31
Episode 800, mean 100-episode return 2.38 +-7.42
Episode 900, mean 100-episode return 1.29 +-7.78
Episode 1000, mean 100-episode return 0.84 +-8.37
The mean 1000-episode return after evaluation 7.03 +-2.37
python3 trace_algorithms.py --n=1 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -71.85 +-75.59
Episode 200, mean 100-episode return -29.60 +-39.91
Episode 300, mean 100-episode return -23.11 +-33.97
Episode 400, mean 100-episode return -12.00 +-21.72
Episode 500, mean 100-episode return -5.93 +-15.92
Episode 600, mean 100-episode return -7.69 +-16.03
Episode 700, mean 100-episode return -2.95 +-13.75
Episode 800, mean 100-episode return 0.45 +-9.76
Episode 900, mean 100-episode return 0.65 +-9.36
Episode 1000, mean 100-episode return -1.56 +-11.53
The mean 1000-episode return after evaluation -24.25 +-75.88
python3 trace_algorithms.py --n=4 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -76.39 +-83.74
Episode 200, mean 100-episode return -3.32 +-13.97
Episode 300, mean 100-episode return -0.33 +-9.49
Episode 400, mean 100-episode return 2.20 +-7.80
Episode 500, mean 100-episode return 1.49 +-7.72
Episode 600, mean 100-episode return 2.27 +-8.67
Episode 700, mean 100-episode return 1.07 +-9.07
Episode 800, mean 100-episode return 3.17 +-6.27
Episode 900, mean 100-episode return 3.25 +-7.39
Episode 1000, mean 100-episode return 0.70 +-8.61
The mean 1000-episode return after evaluation 7.70 +-2.52
python3 trace_algorithms.py --n=8 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -110.07 +-106.29
Episode 200, mean 100-episode return -7.22 +-32.31
Episode 300, mean 100-episode return 0.54 +-9.65
Episode 400, mean 100-episode return 2.03 +-7.82
Episode 500, mean 100-episode return 1.64 +-8.63
Episode 600, mean 100-episode return 1.54 +-7.28
Episode 700, mean 100-episode return 2.80 +-7.86
Episode 800, mean 100-episode return 1.69 +-7.26
Episode 900, mean 100-episode return 1.17 +-8.59
Episode 1000, mean 100-episode return 2.39 +-7.59
The mean 1000-episode return after evaluation 7.57 +-2.35
python3 trace_algorithms.py --n=4 --off_policy --vtrace_clip=1 --trace_lambda=0.6
Episode 100, mean 100-episode return -81.87 +-87.96
Episode 200, mean 100-episode return -15.94 +-29.31
Episode 300, mean 100-episode return -5.24 +-20.41
Episode 400, mean 100-episode return -1.01 +-12.52
Episode 500, mean 100-episode return 1.09 +-9.55
Episode 600, mean 100-episode return 0.73 +-9.15
Episode 700, mean 100-episode return 3.09 +-7.59
Episode 800, mean 100-episode return 3.13 +-7.60
Episode 900, mean 100-episode return 1.30 +-8.72
Episode 1000, mean 100-episode return 3.77 +-7.11
The mean 1000-episode return after evaluation 6.46 +-17.53
python3 trace_algorithms.py --n=8 --off_policy --vtrace_clip=1 --trace_lambda=0.6
Episode 100, mean 100-episode return -127.86 +-106.40
Episode 200, mean 100-episode return -27.64 +-48.34
Episode 300, mean 100-episode return -12.75 +-35.05
Episode 400, mean 100-episode return -0.38 +-14.28
Episode 500, mean 100-episode return 1.35 +-9.10
Episode 600, mean 100-episode return 0.43 +-10.53
Episode 700, mean 100-episode return 3.11 +-9.26
Episode 800, mean 100-episode return 3.58 +-6.81
Episode 900, mean 100-episode return 1.24 +-8.24
Episode 1000, mean 100-episode return 1.58 +-7.15
The mean 1000-episode return after evaluation 7.93 +-2.67
cheetah
Deadline: Jun 30, 22:00 2 points
In this exercise, use the DDPG/TD3/SAC algorithm to solve the HalfCheetah environment. If you start with DDPG, implementing the TD3 improvements should make the hyperparameter search significantly easier. However, for me, the SAC algorithm seems to work the best.
The template cheetah.py only creates the environment and shows the evaluation in ReCodEx.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 8000 on all of them, you obtain 2 points.
humanoid
Deadline: Jun 30, 22:00 3 points; not required for automatically passing the exam
In this exercise, use the DDPG/TD3/SAC algorithm to solve the Humanoid environment.
The template humanoid.py only creates the environment and shows the evaluation in ReCodEx.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 8000 on all of them, you obtain 3 points.
humanoid_standup
Deadline: Jun 30, 22:00 5 points; not required for automatically passing the exam
In this exercise, use the DDPG/TD3/SAC algorithm to solve the Humanoid Standup environment.
The template humanoid_standup.py only creates the environment and shows the evaluation in ReCodEx.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 380000 on all of them, you obtain 5 points.
ppo
Deadline: May 7, 22:00 5 points
Use the PPO algorithm to solve the SingleCollect environment implemented by the
single_collect.py
module. To familiarize with it, you can watch a trained agent
and you can play it interactively using python3 -m npfl139.envs.single_collect_interactive,
controlling the agent with the arrow keys. In the environment, your goal is to
reach a known place (e.g., a food source), obtaining rewards based on the
agent's distance. If the agent is continuously occupying the place for some
period of time, it gets a large reward and the place is moved randomly. The
environment runs for 250 steps and it is considered solved if you obtain
a return of at least 500.
The ppo.py template contains a skeleton of the PPO algorithm implementation. My implementation trains in approximately three minutes of CPU time.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the average return of 500 on all of them. Time limit for each test is 10 minutes.
az_quiz
Deadline: May 21, 22:00 5 points + 5 bonus
In this competition assignment, use Monte Carlo Tree Search to learn an agent for a simplified version of AZ-kvíz. In our version, the agent does not have to answer questions and we assume that all answers are correct.
The game itself is implemented in the az_quiz module, which is a subclass of a general board_game.
The evaluation in ReCodEx is performed via the
board_game_player
interface, most notably through the play method, which given an AZ-kvíz
instance returns the chosen move. The illustration of the interface is in the
az_quiz_player_random
module, which implements a random agent.
Your solution in ReCodEx is evaluated against a very simple heuristic
az_quiz_player_simple_heuristic,
in two different settings: first playing 560 games (280 as a starting player
and 280 games as a non-starting player) with a total limit of 25 second,
and then playing just 112 games with a total limit of 5 minutes. During
the first test, you are expected to use just the trained policy to fulfil
the time limit; in the second test, you can use also MCTS during evaluation.
In all evaluations, you can see the win rate of your agent directly in ReCodEx,
and to pass you need to achieve at least 95% in both tests. To distinguish
the two mentioned tests in ReCodEx, num_simulations argument is set to 0
during the first test, while it is not modified in any other tests.
ReCodEx also evaluates your agent against several more advanced players: a publicly available better heuristic az_quiz_player_fork_heuristic and three neural network agents (which you do not have access to apart from the ReCodEx evaluation). These additional evaluations also consist of 112 games with a time limit of 5 minutes, and are provided just for your convenience.
The final competition evaluation will be performed after the deadline by
a round-robin tournament. In this tournament, we also consider games
where the first move is chosen for the first player (FirstChosen label
in ReCodEx, --first_chosen option of the evaluator).
The evaluate method of the
board_game_evaluator
can be used in your code to evaluate any two given players. Furthermore, you can
evaluate players also from the command line using the
npfl139.board_games.evaluate
module. As players, you can use either the provided players
random,
simple_heuristic,
fork_heuristic,
mouse, and
keyboard, or
you can pass the name of any module implementing a Player: BoardGamePlayer
class. For illustration, you can use python3 -m npfl139.board_games.evaluate --render mouse fork_heuristic to interactively play against the fork heuristic, or
python3 -m npfl139.board_games.evaluate az_quiz_agent.py:--model_path=PATH:--num_simulations=0 simple_heuristic
to evaluate az_quiz_agent.py with the specified arguments against the simple heuristic.
The template for this assignment is available in the
az_quiz_agent.py
module, or in a variant board_game_agent.py.
The latter is nearly identical, but it is slightly more general, not
mentioning AZQuiz directly in the code; instead, the game (and the player to
evaluate against) is specified only in an argument.
To get regular points, you must implement an AlphaZero-style algorithm. However, any algorithm can be used in the competition.
az_quiz_cpp
In addition to the Python template for az_quiz, you can also use
board_game_cpp,
which is a directory containing a skeleton of C++ MCTS and self-play implementation.
Utilizing the C++ implementation is not required, but it offers a large speedup
(up to 10 times on a multi-core CPU and up to 100 times on a GPU). See
README.md
for more information.
In ReCodEx, you can submit the C++ implementation directly in the az_quiz
assignment, by including all the C++ headers and sources plus the setup.py module
as a part of your submission. Then the setup.py model is present, ReCodEx
first compiles your submission using python3 setup.py build --build-platlib .
az_quiz_randomized
Deadline: Jun 30, 22:00
5 points; either this or pisqorky is required for automatically passing the exam
Extend the az_quiz assignment to handle the possibility of wrong
answers. Therefore, when choosing a field (an action), you might not
claim it; in such a case, the state of the field becomes “failed”. When
a “failed” field is chosen as an action by a player, then either
- it is successfully claimed by the player (they “answer correctly”); or
- if the player “answers incorrectly”, the field is claimed by the opposing player; however, in this case, the original player continue playing (i.e., the players do not alternate in this case).
To instantiate this randomized game variant, either pass randomized=True
to the npfl139.board_games.AZQuiz, or use az_quiz_randomized as a board
games (e.g., as the argument to npfl139.board_games.evaluate or to
npfl139.board_games.BoardGame.from_name).
Your goal is to propose how to modify the Monte Carlo Tree Search to properly
handle stochastic MDPs. The information about distribution of possible next
states is provided by the AZQuiz.all_moves method, which returns a list of
(probability, az_quiz_instance) next states (in our environment, there are
always two possible next states).
Your implementation must be capable of training and achieving at least 90% win
rate against the simple heuristic. The evaluation is performed in the same
setting as in az_quiz, so in order to pass, you must achieve 90% win rate both
in the first test (560 games in 25 seconds with num_simulations=0) and in the
second test (112 games in 5 minutes without modifying num_simulations).
Additionally, part of this assignment is also to write us on Piazza (once you pass in ReCodEx) a description of how you handle the stochasticity in MCTS; you will get points only after we finish the discussion.
pisqorky
Deadline: Jun 30, 22:00
5 points + 5 bonus; either this or az_quiz_randomized is required for automatically passing the exam
Train an agent on Piškvorky, usually called Gomoku internationally, implemented in the pisqorky module.
Because the game is more complex than az_quiz, you probably have to use the
C++ template board_game_cpp.
That template now supports both AZQuiz and Pisqorky.
The game provides quite a strong heuristic called simply heuristic;
in ReCodEx, your agent is evaluated against it, again in two different settings:
first playing 100 games (50 as a starting player and 50 as a non-starting
player) with a total limit of 15 minutes, and then playing just 10 games with
a total limit of again 15 minutes. During the first test, you are expected to
again use just the trained policy (num_simulations is set to 0), while
in the second test you might use MCTS if you want (num_simulations is
not modified). In order to pass, you need to achieve at least 50% win rate
in both tests.
ReCodEx also evaluates your agent against a single neural network agent.
The evaluation consists of 100 games (50 starting, 50 non-starting) in
FirstChosen setting without MCTS (num_simulations is set to 0),
with time limit of 15 minutes.
The final competition evaluation will be performed after the deadline by a round-robin tournament, when the first move is chosen for the first player.
To get regular points, you must implement an AlphaZero-style algorithm. However, any algorithm can be used in the competition.
memory_game
Deadline: Jun 30, 22:00 3 points
In this exercise we explore a partially observable environment. Consider a one-player variant of a memory game (pexeso), where a player repeatedly flips cards. If the player flips two cards with the same symbol in succession, the cards are removed and the player receives a reward of +2. Otherwise the player receives a reward of -1. An episode ends when all cards are removed. Note that it is valid to try to flip an already removed card.
Let there be cards in the environment, being even. There are
actions – the actions flip the corresponding card, and the action 0
flips the unused card with the lowest index (or the card if all have
been used already). The observations consist of a pair of discrete values
(card, symbol), where the card is the index of the card flipped, and
the symbol is the symbol on the flipped card; the env.observation_space.nvec
is a pair , representing there are card indices and
symbol indices.
Every episode can be ended by at most actions, and the required return is therefore greater or equal to zero. Note that there is a limit of at most actions per episode. The described environment is provided by the memory_game.py module.
Your goal is to solve the environment, using supervised learning via the provided
expert episodes and networks with external memory. The environment implements
an env.expert_episode() method, which returns a fresh correct episode
as a list of (state, action) pairs (with the last action being None).
ReCodEx evaluates your solution on environments with 8, 12 and 16 cards
(utilizing the --cards argument). For each card number, 100 episodes are
simulated once you pass evaluating=True to env.reset and your solution gets
1 point if the average return is nonnegative. You can
train the agent directly in ReCodEx (the time limit is 15 minutes),
or submit a pre-trained one.
PyTorch template memory_game.py shows a possible way to use memory augmented networks.
memory_game_rl
Deadline: Jun 30, 22:00 5 points
This is a continuation of the memory_game assignment.
In this task, your goal is to solve the memory game environment
using reinforcement learning. That is, you must not use the
env.expert_episode method during training. You can start with PyTorch template
memory_game_rl.py,
which extends the memory_game template by generating training episodes
suitable for some reinforcement learning algorithm.
ReCodEx evaluates your solution on environments with 4, 6 and 8 cards (utilizing
the --cards argument). For each card number, your solution gets 2 points
(1 point for 4 cards) if the average return is nonnegative. You can train the agent
directly in ReCodEx (the time limit is 15 minutes), or submit a pre-trained one.
mappo
Deadline: Jun 30, 22:00 4 points
Implement MAPPO in a multi-agent settings. Notably, solve MultiCollect
(a multi-agent extension of SingleCollect) with 2 agents,
implemented again by the multi_collect.py
module (you can watch the trained agents).
The environment is a generalization of SingleCollect. If there are
agents, there are also target places, and each place rewards
the closest agent. Additionally, any agent colliding with others gets
a negative reward, and the environment reward is the average of the agents'
rewards (to keep the rewards less dependent on the number of agents).
Again, the environment runs for 250 steps and is considered solved
when reaching return of at least 500.
The mappo.py
template contains a skeleton of the MAPPO algorithm implementation.
I use hyperparameter values quite similar to the ppo assignment, with
a notable exception of a smaller learning_rate=3e-4, which is already
specified in the template.
My implementation (with two independent networks) successfully converges in only circa 50% of the cases, and trains in roughly 10-20 minutes. You can also train using a single shared network for both agents, but then you need to indicate which agent should the network operate on (because positions of both agents are part of the state).
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the average return of 450 on all of them. Time limit for each test is 10 minutes.
Submitting to ReCodEx
When submitting a competition solution to ReCodEx, you should submit a trained agent and a Python source capable of running it.
Furthermore, please also include the Python source and hyperparameters
you used to train the submitted model. But be careful that there still must be
exactly one Python source with a line starting with def main(.
Do not forget about the maximum allowed model size and time and memory limits.
Competition Evaluation
-
Before the deadline, ReCodEx prints the exact performance of your agent, but only if it is worse than the baseline.
If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached performance.
-
After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.
-
After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.
Repeated Participation in Competitions
- If a participant got non-zero points for a competition task already in
previous years, they are treated slightly differently. Namely, every team
with one or more returning participants still get competition points, but
- the returning team results are not shown on the slides on the practicals;
- the returning team results are shown in italics in ReCodEx;
- the returning team results are not used to compute the thresholds for competition points.
What Is Allowed
- Unless stated otherwise, you can use any algorithm to solve the competition task at hand, but the implementation must be created by you and you must understand it fully. You can of course take inspiration from any paper or existing implementation, but please reference it in that case.
- PyTorch, TensorFlow, and JAX are available in ReCodEx (but there are no GPUs).
Install
-
What Python version to use
The recommended Python version is 3.11. This version is used by ReCodEx to evaluate your solutions. Minimum required version is Python 3.10.
You can find out the version of your Python installation using
python3 --version. -
Installing to central user packages repository
You can install all required packages to central user packages repository using
python3 -m pip install --user --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 npfl139.On Linux and Windows, the above command installs CUDA 11.8 PyTorch build, but you can change
cu118to:cputo get CPU-only (smaller) version,cu124to get CUDA 12.4 build,rocm6.2.4to get AMD ROCm 6.2.4 build (Linux only).
On macOS, the
--extra-index-urlhas no effect and the Metal support is installed in any case.To update the
npfl139package later, usepython3 -m pip install --user --upgrade npfl139. -
Installing to a virtual environment
Python supports virtual environments, which are directories containing independent sets of installed packages. You can create a virtual environment by running
python3 -m venv VENV_DIRfollowed byVENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 npfl139. (orVENV_DIR/Scripts/pipon Windows).Again, apart from the CUDA 11.8 build, you can change
cu118on Linux and Windows to:cputo get CPU-only (smaller) version,cu124to get CUDA 12.4 build,rocm6.2.4to get AMD ROCm 6.2.4 build (Linux only).
To update the
npfl139package later, useVENV_DIR/bin/pip install --upgrade npfl139. -
Windows installation
-
On Windows, it can happen that
python3is not in PATH, whilepycommand is – in that case you can usepy -m venv VENV_DIR, which uses the newest Python available, or for examplepy -3.11 -m venv VENV_DIR, which uses Python version 3.11. -
If MuJoCo environments fail during construction, make sure the path of the Python site packages contains no non-ASCII characters. If it does, you can create a new virtual environment in a suitable directory to circumvent the problem.
-
If you encounter a problem creating the logs in the
args.logdirdirectory, a possible cause is that the path is longer than 260 characters, which is the default maximum length of a complete path on Windows. However, you can increase this limit on Windows 10, version 1607 or later, by following the instructions.
-
-
MacOS installation
- If you encounter issues with SSL certificates (certificate verify failed:
self-signed certificate in certificate chain), you probably need to run the
Install Certificates.command, which should be executed after installation; see https://docs.python.org/3/using/mac.html#installation-steps.
- If you encounter issues with SSL certificates (certificate verify failed:
self-signed certificate in certificate chain), you probably need to run the
-
GPU support on Linux and Windows
PyTorch supports NVIDIA GPU or AMD GPU out of the box, you just need to select appropriate
--extra-index-urlwhen installing the packages.If you encounter problems loading CUDA or cuDNN libraries, make sure your
LD_LIBRARY_PATHdoes not contain paths to older CUDA/cuDNN libraries.
MetaCentrum
-
How to apply for MetaCentrum account?
After reading the Terms and conditions, you can apply for an account here.
After your account is created, please make sure that the directories containing your solutions are always private.
-
How to activate Python 3.11 on MetaCentrum?
On Metacentrum, currently the newest available Python is 3.11, which you need to activate in every session by running the following command:
module add python/3.11.11-gcc-10.2.1-555dlyc -
How to install the required virtual environment on MetaCentrum?
To create a virtual environment, you first need to decide where it will reside. Either you can find a permanent storage, where you have large-enough quota, or you can use scratch storage for a submitted job.
TL;DR:
-
Run an interactive CPU job, asking for 16GB scratch space:
qsub -l select=1:ncpus=1:mem=8gb:scratch_local=16gb -I -
In the job, use the allocated scratch space as the temporary directory:
export TMPDIR=$SCRATCHDIR -
You should clear the scratch space before you exit using the
clean_scratchcommand. You can instruct the shell to call it automatically by running:trap 'clean_scratch' TERM EXIT -
Finally, create the virtual environment and install PyTorch in it:
module add python/3.11.11-gcc-10.2.1-555dlyc python3 -m venv CHOSEN_VENV_DIR CHOSEN_VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 npfl139
-
-
How to run a GPU computation on MetaCentrum?
First, read the official MetaCentrum documentation: Basic terms, Run simple job, GPU computing, GPU clusters.
TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, 8GB RAM, and 32GB scatch space, run:
qsub -l select=1:ncpus=1:ngpus=1:mem=8gb:scratch_local=32gb -ITo run a script in a non-interactive way, replace the
-Ioption with the script to be executed.If you want to run a CPU-only computation, remove the
ngpus=1:from the above commands.
AIC
-
How to install required packages on AIC?
The Python 3.11.7 is available
/opt/python/3.11.7/bin/python3, so you should start by creating a virtual environment using/opt/python/3.11.7/bin/python3 -m venv VENV_DIRand then install the required packages in it using
VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 npfl139 -
How to run a GPU computation on AIC?
First, read the official AIC documentation: Submitting CPU Jobs, Submitting GPU Jobs.
TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, and 16GB RAM, run:
srun -p gpu -c1 -G1 --mem=16G --pty bashTo run a shell script requiring a GPU in a non-interactive way, use
sbatch -p gpu -c1 -G1 --mem=16G SCRIPT_PATHIf you want to run a CPU-only computation, remove the
-p gpuand-G1from the above commands.
Git
-
Is it possible to keep the solutions in a Git repository?
Definitely. Keeping the solutions in a branch of your repository, where you merge them with the course repository, is probably a good idea. However, please keep the cloned repository with your solutions private.
-
On GitHub, do not create a public fork with your solutions
If you keep your solutions in a GitHub repository, please do not create a clone of the repository by using the Fork button – this way, the cloned repository would be public.
Of course, if you just want to create a pull request, GitHub requires a public fork and that is fine – just do not store your solutions in it.
-
How to clone the course repository?
To clone the course repository, run
git clone https://github.com/ufal/npfl139This creates the repository in the
npfl139subdirectory; if you want a different name, add it as a last parameter.To update the repository, run
git pullinside the repository directory. -
How to keep the course repository as a branch in your repository?
If you want to store the course repository just in a local branch of your existing repository, you can run the following command while in it:
git remote add course_repo https://github.com/ufal/npfl139 git fetch course_repo git checkout --track course_repo/master -b BRANCH_NAMEThis creates a branch
BRANCH_NAME, and when you rungit pullin that branch, it will be updated to the current state of the course repository. -
How to merge the course repository updates with your modified branch?
If you want to store your solutions in your branch and gradually update this branch to track the changes in the course repository, you should start by
git remote add course_repo https://github.com/ufal/npfl139 git fetch course_repo git checkout --no-track course_repo/master -b BRANCH_NAMEwhich creates a branch
BRANCH_NAMEwith the current state of the course repository. However, unlike to the previous case,git pullandgit pushin this branch will not operate on the course repository. Therefore, you can then commit to this branch and push it to your own repository.To update your branch with the changes from the course repository, run
git fetch course_repo git merge course_repo/masterwhile in your branch. Of course, it might be necessary to resolve conflicts if both you and I modified the same lines in the templates.
ReCodEx
-
What files can be submitted to ReCodEx?
You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.
-
What file does ReCodEx execute and what arguments does it use?
Exactly one file with
pysuffix must contain a line starting withdef main(. Such a file is imported by ReCodEx and themainmethod is executed (during the import,__name__ == "__recodex__").The file must also export an argument parser called
parser. ReCodEx uses its arguments and default values, but it overwrites some of the arguments depending on the test being executed – the template should always indicate which arguments are set by ReCodEx and which are left intact. -
What are the time and memory limits?
The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution.
-
Do agents need to be trained directly in ReCodEx?
No, you can pre-train your agent locally (unless specified otherwise in the task description).
Requirements
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.
To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every but the last lecture). In addition, you can get surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.
Exam Questions
Lecture 1 Questions
-
Derive how to incrementally update a running average (how to compute an average of numbers using the average of the first numbers). [5]
-
Describe multi-arm bandits and write down the -greedy algorithm for solving it. [5]
-
Define a Markov Decision Process, including the definition of a return. [5]
-
Describe how a partially observable Markov decision process extends a Markov decision process and how the agent is altered. [5]
-
Define a value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]
-
Define an action-value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]
-
Express a value function using an action-value function, and express an action-value function using a value function. [5]
-
Define optimal value function, optimal action-value function, and the optimal policy. [5]
Lecture 2 Questions
-
Write down the Bellman optimality equation. [5]
-
Define the Bellman backup operator. [5]
-
Write down the value iteration algorithm. [5]
-
Define the supremum norm and prove that Bellman backup operator is a contraction with respect to this norm. [10]
-
Formulate and prove the policy improvement theorem. [10]
-
Write down the policy iteration algorithm. [10]
-
Write down the tabular Monte-Carlo on-policy every-visit -soft algorithm. [5]
-
Write down the Sarsa algorithm. [5]
-
Write down the Q-learning algorithm. [5]
Lecture 3 Questions
-
Elaborate on how can importance sampling estimate expectations with respect to based on samples of . [5]
-
Show how to estimate returns in the off-policy case, both with (a) ordinary importance sampling and (b) weighted importance sampling. [10]
-
Write down the Expected Sarsa algorithm and show how to obtain Q-learning from it. [10]
-
Write down the Double Q-learning algorithm. [10]
-
Show the bootstrapped estimate of -step return. [5]
-
Write down the update in on-policy -step Sarsa (assuming you already have previous steps, actions, and rewards). [5]
-
Write down the update in off-policy -step Sarsa with importance sampling (assuming you already have previous steps, actions, and rewards). [10]
-
Write down the update of -step Tree-backup algorithm (assuming you already have previous steps, actions, and rewards). [10]
-
Assuming function approximation, define Mean squared value error. [5]
-
Write down the gradient Monte-Carlo on-policy every-visit -soft algorithm. [10]
Lecture 4 Questions
-
Write down the semi-gradient -greedy Sarsa algorithm. [10]
-
Prove that semi-gradient TD update is not an SGD update of any loss. [10]
-
What are the three elements causing off-policy divergence with function approximation? Write down the Baird's counterexample. [10]
-
Explain the role of a replay buffer in Deep Q Networks and describe how a single element of a replay buffer looks like. [5]
-
How is the target network used and updated in Deep Q Networks? [5]
-
Explain how is reward clipping used in Deep Q Networks. What other clipping is used? [5]
-
Formulate the loss used in Deep Q Networks. [5]
-
Write down the Deep Q Networks training algorithm. [10]
-
Explain the difference between DQN and Double DQN, and between Double DQN and Double Q-learning. [5]
-
Describe prioritized replay (how are transitions sampled from the replay buffer, how up-to-date are the priorities [according to which we sample], how are unseen transitions boosted, how is importance sampling used to account for the change in the sampling distribution). [10]
-
Describe a data structure that can be used to implement prioritized replay buffer, so that it has given maximum capacity and insertion and sampling runs in time logarithmic with respect to the maximum number of elements. [10]
-
How is the action-value function computed in dueling networks? [5]
Lecture 5 Questions
-
Describe a fully connected layer in Noisy nets (parametrization, computation, effective noise generation). [5]
-
Write down the distributional Bellman backup operator, define Wasserstein distance, and state in which metric is the distributed Bellman backup operator a -contraction. [5]
-
Considering C51, describe how is the distribution of rewards represented and how it is predicted using a neural network. [5]
-
Considering distibutional Q network (C51), describe how the predicted distributions are represented (what are the atoms, how do we get their probability), and write down the loss used to train a distributional Q network and an algorithm to compute it (including the mapping of atoms, which does not need to be mathematically flawless, but enough to describe how it should be done). [10]
-
Write down the final loss function in Rainbow, describe what atoms are, and explain how is an atom logit computed for a given state and action. [5]
-
How exactly are predicted distributions represented in quantile regression? What are the advantages of quantile regression compared to C51? [5]
-
Assume is a cummulative density function of and that is a quantile distribution. Write down the 1-Wasserstein distance betwen the two distributions, and explicitly write down how the closest looks like, assuming is continuous. [10]
Lecture 6 Questions
-
Assume we can get samples with a distribution . Write down the loss to minimize if we want to estimate the mean of the distribution and prove it. [5]
-
Assume we can get samples with a distribution . Write down the loss to minimize if we want to estimate the median of the distribution and prove it. [5]
-
Assume we can get samples with a distribution . Write down the loss to minimize if we want to estimate a quantile and prove it. [5]
-
Explain how we can solve the problem of quantile regression not being smooth around zero, including the formula of the result. [5]
-
Write down the QR-DQN-1 training algorithm including the quantile Huber loss (it is fine to use ). How does the inputs and outputs of the network look like (including their shapes)? [10]
-
Describe the network inputs and outputs (including their shapes) of DQN, C51, QR-DQN, IQN. [5]
-
Describe the network architecture of IQN, including how the quantile is represented. Then write down the training algorithm, including the quantile Huber loss (it is fine to use ). [10]
Lecture 7 Questions
-
Formulate the policy gradient theorem. [5]
-
Prove the part of the policy gradient theorem showing the value of . [10]
-
Assuming the policy gradient theorem, formulate the loss used by the REINFORCE algorithm and show how can its gradient be expressed as an expectation over states and actions. [5]
-
Write down the REINFORCE algorithm. [10]
-
Show that introducing baseline does not influence validity of the policy gradient theorem. [5]
-
Write down the REINFORCE with baseline algorithm. [10]
-
Write down the trajectory formulation of the operator version of REINFORCE, and show that the usual REINFORCE performs one gradient step to minimize the same utility function. [10]
-
Write down the one-step Actor-critic algorithm. [10]
-
How and why is entropy regularization used in policy gradient algorithms? What are the differences to -smooth policies? [5]
-
The Asynchronous advantage actor-critic (A3C) policy may utilize recurrent neural networks. How is the training structured to allow backpropagation through them (would vanilla DQN, vanilla REINFORCE, vanilla actor-critic work with recurrent neural networks)? [5]
-
Explain the difference between a regular Actor-critic and Parallel Advantage Actor Critic algorithms. [5]
Lecture 8 Questions
-
Considering continuous actions modeled by a normal distribution with diagonal covariance, describe how is the policy distribution computed (network architecture, output activation functions) and how does the loss of a simple REINFORCE algorithm look like. [5]
-
Formulate the deterministic policy gradient theorem for . [5]
-
Formulate the deterministic policy gradient theorem for . [5]
-
Prove the part of the deterministic policy gradient theorem showing the value of . [10]
-
Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Deep Deterministic Policy Gradients (DDPG) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]
-
How is the return estimated in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm? [5]
-
Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]
-
Write down how is the reward augmented in Soft actor critic, and the definitions of the soft action-value function and the soft (state-)value function. Then, define the modified Bellman backup operator (be sure to indicate whether you are using the augmented or non-augmented reward), whose repeated application converges to the soft actor-value function , and explain why. [10]
-
Considering soft policy improvement of a policy , write down the update formula for the improved policy , and prove that the soft action-value function of the improved policy is greater or equal to the soft action-value function of the original policy. [10]
-
Write down how are the critics and target critics updated in the Soft actor critic algorithm. [5]
-
Write down how is the actor updated in the Soft actor critic algorithm, including the policy reparametrization trick. [5]
-
Regarding the entropy penalty coefficient in the Soft actor critic, define what constrained optimization problem we are solving, what is the corresponding Lagrangian (and whether we are minimizing/maximizing it with respect to the policy and ), and what does the update look like. [5]
Lecture 9 Questions
-
Define a one-step TD error and express the -step return as a sum of them. [5]
-
Define a one-step TD error and express the -step return with off-policy correction using control variates as a sum of TD errors. [5]
-
Define the -return. [5]
-
Define the -step truncated -return. [5]
-
Define a one-step TD error and express the -step truncated -return as a sum of them. [5]
-
Define a one-step TD error and express the -step truncated -return with off-policy correction as a sum of them. [5]
-
Define the V-trace estimate and write down the policy to whose value function the V-trace estimate converges to. [10]
-
Explain why the fixed point of the V-trace operator does not depend on the truncation of all but the last importance sampling ratios. [10]
-
Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the IMPALA algorithm, including the V-trace formula. [10]
-
Sketch the population based training used in the IMPALA algorithm. [5]
-
In PopArt normalization, the value function is computed based on a normalized value predictor as . Describe how to maintain and , how to compute normalized advantage based on return , and how is the normalized value predictor modified when the estimates of and change. [10]
Lecture 10 Questions
-
What is the value used in TRPO to approximate ? What is the value the TRPO algorithm maximizes, and under what constraint? [5]
-
Write down the PPO algorithm, including the generalized advantage estimation (the -step truncated lambda return. [10]
-
Define the transformed Bellman operator. [5]
-
Define the transformed Bellman operator. Then, assuming is strictly monotonically increasing function and considering a deterministic Markov decision process, show to what does a transformed Bellman operator converge and prove it. [10]
-
Write down the return transformation used for Atari environments (for example by R2D2). [5]
-
Describe the replay buffer elements in R2D2. What is the difference between the zero-state and stored-state strategies, and how is burn-in used? [5]
-
Write down the Retrace operator and describe the three possibilities of setting the traces : importance sampling, Tree-backup(), and Retrace(). [10]
Lecture 11 Questions
-
Considering multi-arm bandits, write down the UCB algorithm. [5]
-
Describe the inputs and outputs of a neural network used in AlphaZero, and describe the inputs and outputs of a Monte-Carlo tree search. [5]
-
Write down the loss used in AlphaZero algorithm. [5]
-
What quantities are kept in a node of a AlphaZero Monte-Carlo tree search? [5]
-
How are actions selected in a AlphaZero Monte-Carlo tree search? [10]
-
What does AlphaZero use to maintain exploration in a Monte-Carlo tree search, i.e., how does AlphaZero make sure that even actions with very low prior are sampled enough? [5]
-
Describe the backup phase of AlphaZero Monte-Carlo tree search, i.e., the steps you perform when you reach a leaf during the tree search. [5]
-
How are the actions selected in AlphaZero self-play? [5]
Lecture 12 Questions
-
Describe the three components of a MuZero model including their exact inputs and outputs, and show how they are used to traverse the MCTS tree. [10]
-
Describe the MCTS in MuZero – action selection (including the exact action-values used), how are the three components of a MuZero model used during the tree traversal and leaf evaluation, and the updates during the backup phase. [10]
-
Assuming we already have a filled replay buffer, describe the MuZero training – the losses and the target values used in them (make sure to mention where all the values come from, whether from the replay buffer or how are they computed). [10]
-
In AlphaZero, define the empirical visit count distribution generated by the AlphaZero action selection. Then define the distribution which the visit count distribution converges to; write both the objective the minimizes, and also the resulting solution. [10]
-
Describe the Gumbel-Max trick, i.e., write down how to perform sampling from a categorical distribution using an , including the procedure for sampling from the distribution. [5]
-
Describe the Gumbel-Top-k trick, i.e., write down how to sample top actions from a categorical distribution when given its logits and suitable Gumbel noise. [5]
-
In GumbelZero, assuming actions are sampled using the Gumbel-Top-k trick, write down how to select an action so that choosing this action is a policy improvement (i.e., that it has greater or equal expected reward than the original policy), and prove it. [10]
Lecture 13 Questions
-
Describe the components of a typical latent-space model in PlaNet (the transition, observation, and reward functions; and the encoder) and the components of a recurrent state-space model (RSSM). [5]
-
Derive the variational lower bound on used in PlaNet (you can utilize the Jensen's inequality ). [10]
-
Consider a model with a discrete categorical latent variable sampled from , with a loss . Describe how we compute the derivative of the loss with respect to the parameters using (a) a straight-through estimator, and (b) a REINFORCE-like gradient estimator with a baseline. [5]
-
Consider a discrete categorical variable sampled from logits . Define the distribution with logits and a temperature (no need to describe sampling from ), and describe the main difference between the and the distributions. [5]
-
Consider a model with a discrete categorical latent variable sampled from , with a loss . Describe how we compute the derivative of the loss with respect to the parameters using (a) a Gumbel-softmax estimator, and (b) a straight-through Gumbel-softmax estimator. [5]
-
Write down an algorithm implementing a straight-through estimator of a discrete categorical latent variable using automatic differentiation (i.e., in TensorFlow or Pytorch). [5]
-
Describe the six components of the DreamerV2 recurrent state-space model (RSSM). [5]
-
Explain the KL balancing used in DreamerV2. [5]
-
Describe the training of both a critic and an actor in DreamerV2 (including the explicit losses). [10]
Lecture 14 Questions
-
Define the Partially observable stochastic game. [5]
-
Define the Partially observable stochastic game, Agent-environment cycle game, and prove that every POSG can be represented as AECG and vice versa. [10]
-
In a Partially observable stochastic game, define the expected return for a given agent (including the probability of a trajectory), define the best response, and define the Nash equilibrium. [10]
-
In Reinforcement learning from human feedback, define what models are trained and write down the three parallel processes performed during training. [5]
-
In Reinforcement learning from human feedback, define a trajectory segment, write down how we estimate the probability that one trajectory segment is rated to be better than another trajectory segment according to the Bradley-Terry model, and then define the loss we minimize to fit the estimated reward in RLHF. [10]
-
When summarizing from human feedback, define the loss we minimize to fit the reward model, and then describe how we train the human feedback policies (make sure to describe all quantities of the loss in detail). What are the roles of the used KL term? [10]
-
In InstructGPT, write down the difference in collecting the rating compared to plain RLHF, define the loss used to train the reward model, and finally describe the reward used to train the PPO and PPO-ptx models. [5]



