In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with super-human performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by self-play), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.
Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.
SIS code: NPFL122
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Guarantor: Milan Straka
All lectures and practicals will be recorded and available on this website.
1. Introduction to Reinforcement Learning Slides PDF Slides Lecture Questions bandits monte_carlo
2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods Slides PDF Slides Lecture Questions policy_iteration policy_iteration_exact policy_iteration_mc_estarts policy_iteration_mc_egreedy
3. Temporal Difference Methods, Off-Policy Methods Slides PDF Slides Lecture TreeBackup Questions importance_sampling q_learning lunar_lander td_algorithms
4. Function Approximation, Deep Q Network Slides PDF Slides Lecture Questions q_learning_tiles q_network
5. Rainbow Slides PDF Slides Lecture Rainbow Questions car_racing
6. Policy Gradient Methods Slides PDF Slides Lecture Questions reinforce reinforce_baseline cart_pole_pixels
7. Continuous Actions, DDPG, TD3 Slides PDF Slides Lecture Questions paac paac_continuous ddpg
8. SAC, Eligibility Traces Slides PDF Slides Lecture Questions walker walker_hardcore
9. Eligibility traces, Impala, R2D2, Agent57 Slides PDF Slides Lecture Agent57 Questions trace_algorithms cheetah humanoid humanoid_standup
10. UCB, Monte Carlo Tree Search, AlphaZero Slides PDF Slides Lecture Questions az_quiz az_quiz_randomized
11. MuZero, PlaNet Slides PDF Slides Lecture Questions
12. ST and Gumbel-Softmax, DreamerV2, MERLIN, FTW Slides PDF Slides Lecture External Memory Questions memory_game memory_game_rl
13. Multi-Agent RL, PPO, MAPPO Slides PDF Slides Lecture PPO ppo mappo
Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.
The lecture content, including references to study materials.
The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.
References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.
Oct 03 Slides PDF Slides Lecture Questions bandits monte_carlo
Oct 10 Slides PDF Slides Lecture Questions policy_iteration policy_iteration_exact policy_iteration_mc_estarts policy_iteration_mc_egreedy
Oct 17 Slides PDF Slides Lecture TreeBackup Questions importance_sampling q_learning lunar_lander td_algorithms
Oct 24 Slides PDF Slides Lecture Questions q_learning_tiles q_network
Oct 31 Slides PDF Slides Lecture Rainbow Questions car_racing
Nov 07 Slides PDF Slides Lecture Questions reinforce reinforce_baseline cart_pole_pixels
Nov 14 Slides PDF Slides Lecture Questions paac paac_continuous ddpg
Nov 21 Slides PDF Slides Lecture Questions walker walker_hardcore
Nov 28 Slides PDF Slides Lecture Agent57 Questions trace_algorithms cheetah humanoid humanoid_standup
Dec 05 Slides PDF Slides Lecture Questions az_quiz az_quiz_randomized
Dec 12 Slides PDF Slides Lecture Questions
Dec 19 Slides PDF Slides Lecture External Memory Questions memory_game memory_game_rl
Jan 02 Slides PDF Slides Lecture PPO ppo mappo
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.
The tasks are evaluated automatically using the ReCodEx Code Examiner.
The evaluation is performed using Python 3.9, TensorFlow 2.8.3, TensorFlow Addons 0.16.1, TensorFlow Probability 0.16.0, NumPy 1.23.3, and Gym 0.26.1. You should install the exact version of these packages yourselves. For those using PyTorch, 1.12.1 is also available, together with corresponding versions of TorchVision and TorchAudio.
Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are expected to share code and submit identical solutions.
Deadline: Oct 17, 7:59 a.m. 3 points
Implement the $ε$-greedy strategy for solving multi-armed bandits.
Start with the bandits.py
template, which defines MultiArmedBandits
environment, which has the following
three methods:
reset()
: reset the environmentstep(action) → reward
: perform the chosen action in the environment,
obtaining a rewardgreedy(epsilon)
: return True
with probability 1-epsilon
Your goal is to implement the following solution variants:
alpha
$=0$: perform $ε$-greedy search, updating the estimates using
averaging.alpha
$≠0$: perform $ε$-greedy search, updating the estimates using
a fixed learning rate alpha
.Note that the initial estimates should be set to a given value and epsilon
can
be zero, in which case purely greedy actions are used.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 bandits.py --alpha=0 --epsilon=0.1 --initial=0
1.39 0.08
python3 bandits.py --alpha=0 --epsilon=0 --initial=1
1.48 0.22
python3 bandits.py --alpha=0.15 --epsilon=0.1 --initial=0
1.37 0.09
python3 bandits.py --alpha=0.15 --epsilon=0 --initial=1
1.52 0.04
Deadline: Oct 17, 7:59 a.m. 5 points
Solve the discretized CartPole-v1 environment
from the Gym library using the Monte Carlo
reinforcement learning algorithm. The gym
environments have the followng
methods and properties:
observation_space
: the description of environment observationsaction_space
: the description of environment actionsreset() → new_state, info
: starts a new episode, returning the new
state and additional environment-specific informationstep(action) → new_state, reward, terminated, truncated, info
: perform the
chosen action in the environment, returning the new state, obtained reward,
boolean flags indicating a terminal state and episode truncation, and
additional environment-specific informationWe additionaly extend the gym
environment by:
episode
: number of the current episode (zero-based)reset(start_evaluation=False) → new_state, info
: if start_evaluation
is
True
, an evaluation is startedOnce you finish training (which you indicate by passing start_evaluation=True
to reset
), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100-episode
average return each 10 episodes even during training.
Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Deadline: Oct 24, 7:59 a.m. 2 points
Consider the following gridworld:
Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:
GridWorld.states
: return the number of states (11
)GridWorld.actions
: return a list with labels of the actions (["↑", "→", "↓", "←"]
)GridWorld.step(state, action)
: return possible outcomes of performing the
action
in a given state
, as a list of triples containing
probability
: probability of the outcomereward
: reward of the outcomenew_state
: new state of the outcomeImplement policy iteration algorithm, with --steps
steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform --iterations
applications of the Bellman equation.
Perform the policy evaluation asynchronously (i.e., update the value function
in-place for states $0, 1, …$). Assume the initial policy is “go North” and
initial value function is zero.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -10.00← -10.95↑
0.00↑ 0.00← -7.50← -88.93←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=2
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -8.31← -11.83←
0.00↑ 0.00← -1.50← -20.61←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=3
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -6.46← -6.77←
0.00↑ 0.00← -0.76← -13.08↓
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=10
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -1.04← -0.83←
0.00↑ 0.00← -0.11→ -0.34↓
python3 policy_iteration.py --gamma=0.95 --iterations=10 --steps=10
11.93↓ 11.19← 10.47← 6.71↑
12.83↓ 10.30← 10.12←
13.70→ 14.73→ 15.72→ 16.40↓
python3 policy_iteration.py --gamma=1 --iterations=1 --steps=100
74.73↓ 74.50← 74.09← 65.95↑
75.89↓ 72.63← 72.72←
77.02→ 78.18→ 79.31→ 80.16↓
Deadline: Oct 24, 7:59 a.m. 2 points
Starting with policy_iteration_exact.py,
extend the policy_iteration
assignment to perform policy evaluation
exactly by solving a system of linear equations.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration_exact.py --gamma=0.95 --steps=1
-0.00↑ -0.00↑ -0.00↑ -0.00↑
-0.00↑ -12.35← -12.35↑
-0.85← -8.10← -19.62← -100.71←
python3 policy_iteration_exact.py --gamma=0.95 --steps=2
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00← -11.05←
-0.00↑ -0.00↑ -0.00← -12.10↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=3
-0.00↑ 0.00↑ 0.00↑ 0.00↑
-0.00↑ -0.00← 0.69←
-0.00↑ -0.00↑ -0.00→ 6.21↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=4
-0.00↑ 0.00↑ 0.00↓ 0.00↑
-0.00↓ 5.91← 6.11←
0.65→ 6.17→ 14.93→ 15.99↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=5
2.83↓ 4.32→ 8.09↓ 5.30↑
12.92↓ 9.44← 9.35←
13.77→ 14.78→ 15.76→ 16.53↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=6
11.75↓ 8.15← 8.69↓ 5.69↑
12.97↓ 9.70← 9.59←
13.82→ 14.84→ 15.82→ 16.57↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=7
12.12↓ 11.37← 9.19← 6.02↑
13.01↓ 9.92← 9.79←
13.87→ 14.89→ 15.87→ 16.60↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=8
12.24↓ 11.49← 10.76← 7.05↑
13.14↓ 10.60← 10.42←
14.01→ 15.04→ 16.03→ 16.71↓
python3 policy_iteration_exact.py --gamma=0.9999 --steps=5
7385.23↓ 7392.62→ 7407.40↓ 7400.00↑
7421.37↓ 7411.10← 7413.16↓
7422.30→ 7423.34→ 7424.27→ 7425.84↓
Deadline: Oct 24, 7:59 a.m. 2 points
Starting with policy_iteration_mc_estarts.py,
extend the policy_iteration
assignment to perform policy evaluation
by using Monte Carlo estimation with exploring starts. Specifically,
we update the action-value function $q_\pi(s, a)$ by running a
simulation with a given number of steps and using the observed return
as its estimate.
The estimation can now be performed model-free (without the access to the full
MDP dynamics), therefore, the GridWorld.step
returns a randomly sampled
result instead of a full distribution.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00↑ 0.00↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --steps=10
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ -19.50↑
0.27↓ 0.48← 2.21↓ 8.52↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --steps=50
0.09↓ 0.32↓ 0.22← 0.15↑
0.18↑ -2.43← -5.12↓
0.18↓ 1.80↓ 3.90↓ 9.14↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --steps=100
3.09↓ 2.42← 2.39← 1.17↑
3.74↓ 1.66← 0.18←
3.92→ 5.28→ 7.16→ 11.07↓
python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --steps=200
7.71↓ 6.76← 6.66← 3.92↑
8.27↓ 6.17← 5.31←
8.88→ 10.12→ 11.36→ 13.92↓
Deadline: Oct 24, 7:59 a.m. 2 points
Starting with policy_iteration_mc_egreedy.py,
extend the policy_iteration_mc_estarts
assignment to perform policy
evaluation by using $ε$-greedy Monte Carlo estimation. Specifically,
we update the action-value function $q_\pi(s, a)$ by running a
simulation with a given number of steps and using the observed return
as its estimate.
For the sake of replicability, use the provided
GridWorld.epsilon_greedy(epsilon, greedy_action)
method, which returns
a random action with probability of epsilon
and otherwise returns the
given greedy_action
.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00→
0.00↑ 0.00↑ 0.00→ 0.00→
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --steps=10
-1.20↓ -1.43← 0.00← -6.00↑
0.78→ -20.26↓ 0.00←
0.09← 0.00↓ -9.80↓ 10.37↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --steps=50
-0.16↓ -0.19← 0.56← -6.30↑
0.13→ -6.99↓ -3.51↓
0.01← 0.00← 3.18↓ 7.57↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --steps=100
-0.07↓ -0.09← 0.28← -4.66↑
0.06→ -5.04↓ -8.32↓
0.00← 0.00← 1.70↓ 4.38↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --steps=200
-0.04↓ -0.04← -0.76← -4.15↑
0.03→ -8.02↓ -5.96↓
0.00← 0.00← 2.53↓ 4.36↓
python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --steps=500
-0.02↓ -0.02← -0.65← -3.52↑
0.01→ -11.34↓ -8.07↓
0.00← 0.00← 3.15↓ 3.99↓
Deadline: Oct 31, 7:59 a.m. 2 points
Using the FrozenLake-v1 environment, implement Monte Carlo weighted importance sampling to estimate state value function $V$ of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behaviour policy, which uniformly chooses among all four actions.
Start with the importance_sampling.py template, which creates the environment and generates episodes according to behaviour policy.
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 importance_sampling.py --episodes=500
0.00 0.00 0.09 0.24
0.00 0.00 0.30 0.00
0.00 0.11 0.32 0.00
0.00 0.25 0.33 0.00
python3 importance_sampling.py --episodes=5000
0.03 0.00 0.01 0.03
0.04 0.00 0.09 0.00
0.10 0.24 0.23 0.00
0.00 0.44 0.49 0.00
python3 importance_sampling.py --episodes=50000
0.03 0.02 0.05 0.01
0.13 0.00 0.07 0.00
0.21 0.33 0.36 0.00
0.00 0.35 0.76 0.00
Deadline: Oct 31, 7:59 a.m. 4 points
Solve the MountainCar-v0 environment from the Gym library using the Q-learning reinforcement learning algorithm. Note that this task does not require TensorFlow.
The environment methods and properties are described in the monte_carlo
assignment.
Once you finish training (which you indicate by passing start_evaluation=True
to reset
), your goal is to reach an average return of -150 during 100
evaluation episodes.
You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Q-learning is a bit tricky – I usually start with a larger value of $ε$ (like 0.2 or even 0.5) and then gradually decrease it to almost zero.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
Deadline: Oct 31, 7:59 a.m. 7 points + 7 bonus
Solve the LunarLander-v2 environment from the Gym library. Note that this task does not require TensorFlow.
The environment methods and properties are described in the monte_carlo
assignment,
but include one additional method:
expert_trajectory(seed=None) → trajectory
: This method generates one expert
trajectory, where trajectory
is a list of triples (state, action, reward),
where the action and reward is None
when reaching the terminal state.
You cannot change the implementation of this method or use its internals in
any other way than just calling expert_trajectory()
. Furthermore,
you can use this method only during training, not during evaluation.
To pass the task, you need to reach an average return of 0 during 1000 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 15 minutes.
The task is additionally a competition, and at most 7 points will be awarded according to the relative ordering of your solutions.
You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
Deadline: Nov 07, 7:59 a.m. 4 points
Starting with the td_algorithms.py template, implement all of the following $n$-step TD methods variants:
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 td_algorithms.py --mode=sarsa --n=1
Episode 200, mean 100-episode return -235.23 +-92.94
Episode 400, mean 100-episode return -133.18 +-98.63
Episode 600, mean 100-episode return -74.19 +-70.39
Episode 800, mean 100-episode return -41.84 +-54.53
Episode 1000, mean 100-episode return -31.96 +-52.14
python3 td_algorithms.py --mode=sarsa --n=1 --off_policy
Episode 200, mean 100-episode return -227.81 +-91.62
Episode 400, mean 100-episode return -131.29 +-90.07
Episode 600, mean 100-episode return -65.35 +-64.78
Episode 800, mean 100-episode return -34.65 +-44.93
Episode 1000, mean 100-episode return -8.70 +-25.74
python3 td_algorithms.py --mode=sarsa --n=4
Episode 200, mean 100-episode return -277.55 +-146.18
Episode 400, mean 100-episode return -87.11 +-152.12
Episode 600, mean 100-episode return -6.95 +-23.28
Episode 800, mean 100-episode return -1.88 +-19.21
Episode 1000, mean 100-episode return 0.97 +-11.76
python3 td_algorithms.py --mode=sarsa --n=4 --off_policy
Episode 200, mean 100-episode return -339.11 +-144.40
Episode 400, mean 100-episode return -172.44 +-176.79
Episode 600, mean 100-episode return -36.23 +-100.93
Episode 800, mean 100-episode return -22.43 +-81.29
Episode 1000, mean 100-episode return -3.95 +-17.78
python3 td_algorithms.py --mode=expected_sarsa --n=1
Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
python3 td_algorithms.py --mode=expected_sarsa --n=1 --off_policy
Episode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
python3 td_algorithms.py --mode=expected_sarsa --n=4
Episode 200, mean 100-episode return -263.10 +-161.97
Episode 400, mean 100-episode return -102.52 +-162.03
Episode 600, mean 100-episode return -7.13 +-24.53
Episode 800, mean 100-episode return -1.69 +-12.21
Episode 1000, mean 100-episode return -1.53 +-11.04
python3 td_algorithms.py --mode=expected_sarsa --n=4 --off_policy
Episode 200, mean 100-episode return -376.56 +-116.08
Episode 400, mean 100-episode return -292.35 +-166.14
Episode 600, mean 100-episode return -173.83 +-194.11
Episode 800, mean 100-episode return -89.57 +-153.70
Episode 1000, mean 100-episode return -54.60 +-127.73
python3 td_algorithms.py --mode=tree_backup --n=1
Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
python3 td_algorithms.py --mode=tree_backup --n=1 --off_policy
Episode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
python3 td_algorithms.py --mode=tree_backup --n=4
Episode 200, mean 100-episode return -270.51 +-134.35
Episode 400, mean 100-episode return -64.27 +-109.50
Episode 600, mean 100-episode return -1.80 +-13.34
Episode 800, mean 100-episode return -0.22 +-13.14
Episode 1000, mean 100-episode return 0.60 +-9.37
python3 td_algorithms.py --mode=tree_backup --n=4 --off_policy
Episode 200, mean 100-episode return -248.56 +-147.74
Episode 400, mean 100-episode return -68.60 +-126.13
Episode 600, mean 100-episode return -6.25 +-32.23
Episode 800, mean 100-episode return -0.53 +-11.82
Episode 1000, mean 100-episode return 2.33 +-8.35
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1
Episode 10, mean 100-episode return -652.70 +-37.77
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1 --off_policy
Episode 10, mean 100-episode return -632.90 +-126.41
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4
Episode 10, mean 100-episode return -715.70 +-156.56
python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4 --off_policy
Episode 10, mean 100-episode return -649.10 +-171.73
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1
Episode 10, mean 100-episode return -641.90 +-122.11
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1 --off_policy
Episode 10, mean 100-episode return -633.80 +-63.61
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4
Episode 10, mean 100-episode return -713.90 +-107.05
python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4 --off_policy
Episode 10, mean 100-episode return -648.20 +-107.08
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1
Episode 10, mean 100-episode return -641.90 +-122.11
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1 --off_policy
Episode 10, mean 100-episode return -633.80 +-63.61
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4
Episode 10, mean 100-episode return -663.50 +-111.78
python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4 --off_policy
Episode 10, mean 100-episode return -708.50 +-125.63
Deadline: Nov 07, 7:59 a.m. 3 points
Improve the q_learning
task performance on the
MountainCar-v0 environment
using linear function approximation with tile coding.
Your goal is to reach an average reward of -110 during 100 evaluation episodes.
The environment methods are described in the q_learning
assignment, with
the following changes:
state
returned by the env.step
method is a list containing weight
indices of the current state (i.e., the feature vector of the state consists
of zeros and ones, and only the indices of the ones are returned). The
action-value function is therefore approximated as a sum of the weights whose
indices are returned by env.step
.env.observation_space.nvec
returns a list, where the $i$-th element
is a number of weights used by first $i$ elements of state
. Notably,
env.observation_space.nvec[-1]
is the total number of the weights.You can start with the q_learning_tiles.py
template, which parses several useful parameters and creates the environment.
Implementing Q-learning is enough to pass the assignment, even if both N-step
Sarsa and Tree Backup converge a little faster. The default number of tiles in
tile encoding (i.e., the size of the list with weight indices) is
args.tiles=8
, but you can use any number you want (but the assignment is
solvable with 8).
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
Deadline: Nov 07, 7:59 a.m. 5 points
Solve the continuous CartPole-v1 environment from the Gym library using Q-learning with neural network as a function approximation.
You can start with the q_network.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch (q_network.torch.py) or JAX instead, if you like.
The continuous environment is very similar to a discrete one, except
that the states are vectors of real-valued observations with shape
env.observation_space.shape
.
Use Q-learning with neural network as a function approximation, which for a given state returns state-action values for all actions. You can use any network architecture, but one hidden layer of several dozens ReLU units is a good start. Your goal is to reach an average return of 450 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes (so you can train in ReCodEx, but you can also pretrain your network if you like).
Deadline: Nov 14, 7:59 a.m. 8 points + 8 bonus
The goal of this competition is to use Deep Q Networks (and any of Rainbow improvements) on a more real-world CarRacing-v2 environment from the Gym library.
The supplied car_racing_environment.py
provides the environment. The states are RGB np.uint8
images of size
$96×96×3$, but you can downsample them even more. The actions
are also continuous and consist of an array with the following three elements:
steer
in range [-1, 1]gas
in range [0, 1]brake
in range [0, 1]; note that full brake is quite aggressive, so you
might consider using less force when brakingInternally you should probably generate discrete actions and convert them to the
required representation before the step
call. The smallest set is probably
left, right, gas, brake and no-op, but you can use a more fine-grained one if
you like.
The environment also supports frame skipping, which improves its performance (only some frames need to be rendered).
In ReCodEx, you are expected to submit an already trained model, which is evaluated on 15 different tracks with a total time limit of 15 minutes. If your average return is at least 300, you obtain 8 points. The task is also a competition, and at most 8 points will be awarded according to relative ordering of your solutions.
The car_racing.py template parses several useful parameters and creates the environment. Note that the car_racing_environment.py can be executed directly and in that case you can drive the car using arrows.
Also, you might want to use a vectorized version of the environment for
training, which runs several individual environments in separate processes.
The template contains instructions how to create it. The vectorized environment
expects a vector of actions and returns a vector of observations, rewards, dones
and infos. When one of the environments is done
, it is immediately reset and
state
is the initial state of a new episode.
Deadline: Nov 21, 7:59 a.m. 4 points
Solve the continuous CartPole-v1 environment from the Gym library using the REINFORCE algorithm.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
Start with the reinforce.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch (reinforce.torch.py) or JAX instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Deadline: Nov 21, 7:59 a.m. 4 points
This is a continuation of the reinforce
assignment.
Using the reinforce_baseline.py template, solve the continuous CartPole-v1 environment using the REINFORCE with baseline algorithm.
Using a baseline lowers the variance of the value function gradient estimator, which allows faster training and decreases sensitivity to hyperparameter values. To reflect this effect in ReCodEx, note that the evaluation phase will automatically start after 200 episodes. Using only 200 episodes for training in this setting is probably too little for the REINFORCE algorithm, but suffices for the variant with a baseline. In this assignment, you must train your agent in ReCodEx using the provided environment only.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Deadline: Nov 21, 7:59 a.m. 4 points + 5 bonus
The supplied cart_pole_pixels_environment.py
generates a pixel representation of the CartPole
environment
as an $80×80$ np.uint8
image with three channels, with each channel representing one time step
(i.e., the current observation and the two previous ones).
During evaluation in ReCodEx, three different random seeds will be employed, each with time limit of 10 minutes, and if you reach an average return at least 400 on all of them, you obtain 4 points. The task is also a competition, and at most 5 points will be awarded according to relative ordering of your solutions.
The cart_pole_pixels.py template parses several parameters and creates the environment. You are again supposed to train the model beforehand and submit only the trained neural network.
Deadline: Nov 28, 7:59 a.m. 4 points
Solve the CartPole-v1 environment
using parallel actor-critic algorithm, employing the vectorized
environment described in the car_racing
assignment.
Your goal is to reach an average return of 450 during 100 evaluation episodes.
Start with the paac.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch or JAX instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
Deadline: Nov 28, 7:59 a.m. 5 points
Solve the MountainCarContinuous-v0 environment
using parallel actor-critic algorithm with continuous actions.
When actions are continuous, env.action_space
is the same Box
space
as env.observation_space
, offering:
env.action_space.shape
, which specifies the shape of actions (you can assume
actions are always a 1D vector),env.action_space.low
and env.action_space.high
, which specify the ranges
of the corresponding actions.Your goal is to reach an average return of 90 during 100 evaluation episodes.
Start with the paac_continuous.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch or JAX instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
Deadline: Nov 28, 7:59 a.m. 6 points
Solve the continuous Pendulum-v1 environment using deep deterministic policy gradient algorithm.
Your goal is to reach an average return of -200 during 100 evaluation episodes.
Start with the ddpg.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
Deadline: Dec 05, 7:59 a.m. 5 points
In this exercise we explore continuous robot control by solving the continuous BipedalWalker-v3 environment from the Gym library.
Note that the penalty of -100
on crash makes the training considerably slower.
Even if all of DDPG, TD3 and SAC can be trained with original rewards, overriding
the reward at the end of episode to 0
speeds up training considerably.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 200, you obtain 5 points.
The walker.py template contains the skeleton for implementing the SAC agent, but you can also solve the assignment with DDPG/TD3. The PyTorch template walker.torch.py is also available.
Deadline: Dec 05, 7:59 a.m. 6 points + 8 bonus
As an extension of the walker
assignment, solve the
hardcore version of the BipedalWalker-v3 environment
from the Gym library.
Note that the penalty of -100
on crash can discourage or even stop training,
so overriding the reward at the end of episode to 0
(or descresing it
substantially) makes the training considerably easier (I have not surpassed
return 0
with neither TD3 nor SAC with the original -100
penalty).
In ReCodEx, you are expected to submit an already trained model, which is evaluated with three seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 100, you obtain 6 points. The task is also a competition, and at most 8 points will be awarded according to relative ordering of your solutions.
The walker_hardcore.py
template shows a basic structure of evaluaton in ReCodEx, but
you most likely want to start either with ddpg.py.
or with walker.py
and just change the env
argument to BipedalWalkerHardcore-v3
.
Deadline: Dec 12, 7:59 a.m. 4 points
Starting with the trace_algorithms.py template, implement the following state value estimations:
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 trace_algorithms.py --n=1
Episode 100, mean 100-episode return -96.50 +-92.02
Episode 200, mean 100-episode return -53.64 +-76.70
Episode 300, mean 100-episode return -29.03 +-54.84
Episode 400, mean 100-episode return -8.78 +-21.69
Episode 500, mean 100-episode return -14.24 +-41.76
Episode 600, mean 100-episode return -4.57 +-17.56
Episode 700, mean 100-episode return -7.90 +-27.92
Episode 800, mean 100-episode return -2.17 +-16.67
Episode 900, mean 100-episode return -2.07 +-14.01
Episode 1000, mean 100-episode return 0.13 +-13.93
The mean 1000-episode return after evaluation -35.05 +-84.82
python3 trace_algorithms.py --n=4
Episode 100, mean 100-episode return -74.01 +-89.62
Episode 200, mean 100-episode return -4.84 +-20.95
Episode 300, mean 100-episode return 0.37 +-11.81
Episode 400, mean 100-episode return 1.82 +-8.04
Episode 500, mean 100-episode return 1.28 +-8.66
Episode 600, mean 100-episode return 3.13 +-7.02
Episode 700, mean 100-episode return 0.76 +-8.05
Episode 800, mean 100-episode return 2.05 +-8.11
Episode 900, mean 100-episode return 0.98 +-9.22
Episode 1000, mean 100-episode return 0.29 +-9.13
The mean 1000-episode return after evaluation -11.49 +-60.05
python3 trace_algorithms.py --n=8
Episode 100, mean 100-episode return -92.61 +-105.75
Episode 200, mean 100-episode return -6.49 +-21.66
Episode 300, mean 100-episode return -0.05 +-11.13
Episode 400, mean 100-episode return 1.40 +-7.92
Episode 500, mean 100-episode return -0.79 +-13.78
Episode 600, mean 100-episode return -3.73 +-25.97
Episode 700, mean 100-episode return 1.13 +-13.67
Episode 800, mean 100-episode return -5.98 +-28.62
Episode 900, mean 100-episode return 0.79 +-9.62
Episode 1000, mean 100-episode return 2.09 +-7.78
The mean 1000-episode return after evaluation -55.86 +-96.41
python3 trace_algorithms.py --n=4 --trace_lambda=0.6
Episode 100, mean 100-episode return -85.33 +-91.17
Episode 200, mean 100-episode return -16.06 +-39.97
Episode 300, mean 100-episode return -2.74 +-15.78
Episode 400, mean 100-episode return -0.33 +-9.93
Episode 500, mean 100-episode return 1.39 +-9.48
Episode 600, mean 100-episode return 1.59 +-9.26
Episode 700, mean 100-episode return 3.66 +-6.99
Episode 800, mean 100-episode return 2.08 +-7.26
Episode 900, mean 100-episode return 1.32 +-8.76
Episode 1000, mean 100-episode return 3.33 +-7.27
The mean 1000-episode return after evaluation 7.93 +-2.63
python3 trace_algorithms.py --n=8 --trace_lambda=0.6
Episode 100, mean 100-episode return -117.72 +-113.58
Episode 200, mean 100-episode return -24.27 +-57.52
Episode 300, mean 100-episode return -6.54 +-27.78
Episode 400, mean 100-episode return 0.11 +-9.50
Episode 500, mean 100-episode return 3.17 +-7.27
Episode 600, mean 100-episode return 1.99 +-8.11
Episode 700, mean 100-episode return 0.89 +-8.25
Episode 800, mean 100-episode return 3.24 +-8.59
Episode 900, mean 100-episode return 3.04 +-6.91
Episode 1000, mean 100-episode return 1.26 +-8.61
The mean 1000-episode return after evaluation -7.84 +-55.17
python3 trace_algorithms.py --n=1 --off_policy
Episode 100, mean 100-episode return -68.47 +-73.52
Episode 200, mean 100-episode return -29.11 +-34.15
Episode 300, mean 100-episode return -20.30 +-31.24
Episode 400, mean 100-episode return -13.44 +-25.04
Episode 500, mean 100-episode return -4.72 +-13.75
Episode 600, mean 100-episode return -3.07 +-17.63
Episode 700, mean 100-episode return -2.70 +-13.81
Episode 800, mean 100-episode return 1.32 +-11.79
Episode 900, mean 100-episode return 0.78 +-8.95
Episode 1000, mean 100-episode return 1.15 +-9.27
The mean 1000-episode return after evaluation -12.63 +-62.51
python3 trace_algorithms.py --n=4 --off_policy
Episode 100, mean 100-episode return -96.25 +-105.93
Episode 200, mean 100-episode return -26.21 +-74.65
Episode 300, mean 100-episode return -4.84 +-31.78
Episode 400, mean 100-episode return -0.34 +-9.46
Episode 500, mean 100-episode return 1.15 +-8.49
Episode 600, mean 100-episode return 2.95 +-7.20
Episode 700, mean 100-episode return 0.94 +-10.19
Episode 800, mean 100-episode return 0.13 +-9.27
Episode 900, mean 100-episode return 1.95 +-9.69
Episode 1000, mean 100-episode return 1.91 +-7.59
The mean 1000-episode return after evaluation 6.79 +-3.68
python3 trace_algorithms.py --n=8 --off_policy
Episode 100, mean 100-episode return -180.08 +-112.11
Episode 200, mean 100-episode return -125.56 +-124.82
Episode 300, mean 100-episode return -113.66 +-125.12
Episode 400, mean 100-episode return -77.98 +-117.08
Episode 500, mean 100-episode return -23.71 +-69.71
Episode 600, mean 100-episode return -21.44 +-67.38
Episode 700, mean 100-episode return -2.43 +-16.31
Episode 800, mean 100-episode return 2.38 +-7.42
Episode 900, mean 100-episode return 1.29 +-7.78
Episode 1000, mean 100-episode return 0.84 +-8.37
The mean 1000-episode return after evaluation 7.03 +-2.37
python3 trace_algorithms.py --n=1 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -71.85 +-75.59
Episode 200, mean 100-episode return -29.60 +-39.91
Episode 300, mean 100-episode return -23.11 +-33.97
Episode 400, mean 100-episode return -12.00 +-21.72
Episode 500, mean 100-episode return -5.93 +-15.92
Episode 600, mean 100-episode return -7.69 +-16.03
Episode 700, mean 100-episode return -2.95 +-13.75
Episode 800, mean 100-episode return 0.45 +-9.76
Episode 900, mean 100-episode return 0.65 +-9.36
Episode 1000, mean 100-episode return -1.56 +-11.53
The mean 1000-episode return after evaluation -24.25 +-75.88
python3 trace_algorithms.py --n=4 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -76.39 +-83.74
Episode 200, mean 100-episode return -3.32 +-13.97
Episode 300, mean 100-episode return -0.33 +-9.49
Episode 400, mean 100-episode return 2.20 +-7.80
Episode 500, mean 100-episode return 1.49 +-7.72
Episode 600, mean 100-episode return 2.27 +-8.67
Episode 700, mean 100-episode return 1.07 +-9.07
Episode 800, mean 100-episode return 3.17 +-6.27
Episode 900, mean 100-episode return 3.25 +-7.39
Episode 1000, mean 100-episode return 0.70 +-8.61
The mean 1000-episode return after evaluation 7.70 +-2.52
python3 trace_algorithms.py --n=8 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -110.07 +-106.29
Episode 200, mean 100-episode return -7.22 +-32.31
Episode 300, mean 100-episode return 0.54 +-9.65
Episode 400, mean 100-episode return 2.03 +-7.82
Episode 500, mean 100-episode return 1.64 +-8.63
Episode 600, mean 100-episode return 1.54 +-7.28
Episode 700, mean 100-episode return 2.80 +-7.86
Episode 800, mean 100-episode return 1.69 +-7.26
Episode 900, mean 100-episode return 1.17 +-8.59
Episode 1000, mean 100-episode return 2.39 +-7.59
The mean 1000-episode return after evaluation 7.57 +-2.35
Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.
python3 trace_algorithms.py --episodes=50 --n=1
The mean 1000-episode return after evaluation -196.80 +-25.96
python3 trace_algorithms.py --episodes=50 --n=4
The mean 1000-episode return after evaluation -165.45 +-78.01
python3 trace_algorithms.py --episodes=50 --n=8
The mean 1000-episode return after evaluation -144.47 +-92.73
python3 trace_algorithms.py --episodes=50 --n=4 --trace_lambda=0.6
The mean 1000-episode return after evaluation -170.70 +-72.93
python3 trace_algorithms.py --episodes=50 --n=8 --trace_lambda=0.6
The mean 1000-episode return after evaluation -155.04 +-86.17
python3 trace_algorithms.py --episodes=50 --n=1 --off_policy
The mean 1000-episode return after evaluation -189.16 +-46.74
python3 trace_algorithms.py --episodes=50 --n=4 --off_policy
The mean 1000-episode return after evaluation -159.09 +-83.40
python3 trace_algorithms.py --episodes=50 --n=8 --off_policy
The mean 1000-episode return after evaluation -166.82 +-76.04
python3 trace_algorithms.py --episodes=50 --n=1 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -198.50 +-17.93
python3 trace_algorithms.py --episodes=50 --n=4 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -144.76 +-92.48
python3 trace_algorithms.py --episodes=50 --n=8 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -167.63 +-75.87
Deadline: Feb 12, 23:59 2 points
In this exercise, use the DDPG/TD3/SAC algorithm to solve the HalfCheetah environment.
The template cheetah.py only creates the environment and shows the evaluation in ReCodEx.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 8000 on all of them, you obtain 2 points.
Deadline: Feb 12, 23:59 2 points; not required for passing the exam with grade 1 by solving all assignments
In this exercise, use the DDPG/TD3/SAC algorithm to solve the Humanoid environment.
The template humanoid.py only creates the environment and shows the evaluation in ReCodEx.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 8000 on all of them, you obtain 2 points.
Deadline: Feb 12, 23:59 3 bonus points; not required for passing the exam with grade 1 by solving all assignments
In this exercise, use the DDPG/TD3/SAC algorithm to solve the Humanoid Standup environment.
The template humanoid_standup.py only creates the environment and shows the evaluation in ReCodEx.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 380000 on all of them, you obtain 3 bonus points.
Deadline: Jan 01, 23:59 (competition); Feb 12, 23:59 (regular points) 10 points + 10 bonus
In this competition assignment, use Monte Carlo Tree Search to learn an agent for a simplified version of AZ-kvíz. In our version, the agent does not have to answer questions and we assume that all answers are correct.
The game itself is implemented in the
az_quiz.py
module, using randomized=False
constructor argument.
The evaluation in ReCodEx should be implemented by returning an object
implementing a method play
, which given an AZ-kvíz instance returns the chosen
move. The illustration of the interface is in the
az_quiz_player_random.py
module, which implements a random agent.
Your solution in ReCodEx is automatically evaluated against a very simple heuristic az_quiz_player_simple_heuristic.py, playing 56 games as a starting player and 56 games as a non-starting player. The time limit for the games is 10 minutes and you should see the win rate directly in ReCodEx. If you achieve at least 90%, you will pass the assignment. A better heuristic az_quiz_player_fork_heuristic.py is also available for your evaluations.
The final competition evaluation will be performed after the deadline by
a round-robin tournament. In this tournament, we also consider games
where the first move is chosen for the first player (FirstChosen
label
in ReCodEx, --first_chosen
option of the evaluator).
The az_quiz_evaluator.py can be used to evaluate any two given implementations and there are two interactive players available, az_quiz_player_interactive_mouse.py and az_quiz_player_interactive_keyboard.py.
The starting template is available in the az_quiz_agent.py module. Furthermore, the az_quiz_cpp directory contains a skeleton of C++ MCTS and self-play implementation. Utilizing the C++ implementation is not required, but it offers a large speedup (up to 10 times on a multi-core CPU and up to 50-100 times on a GPU)
To get regular points, you must implement an AlphaZero-style algorithm. However, any algorithm can be used in the competition.
Deadline: Jan 01, 23:59 5 bonus; not required for passing the exam with grade 1 by solving all assignments
Extend the az_quiz
assignment to handle the possibility of wrong
answers. Therefore, when choosing a field, the agent might answer
incorrectly.
To instantiate this randomized game variant, pass randomized=True
to the AZQuiz
class of az_quiz.py.
The Monte Carlo Tree Search has to be slightly modified to handle stochastic
MDP. The information about distribution of possible next states is provided
by the AZQuiz.all_moves
method, which returns a list of (probability, az_quiz_instance)
next states (in our environment, there are always two
possible next states).
To enter the competition, you need to actually handle the randomness in your solution.
Deadline: Feb 12, 23:59 3 points
In this exercise we explore a partially observable environment. Consider a one-player variant of a memory game (pexeso), where a player repeatedly flip cards. If the player flips two cards with the same symbol in succession, the cards are removed and the player recieves a reward of +2. Otherwise the player recieves a reward of -1. An episode ends when all cards are removed. Note that it is valid to try to flip an already removed card.
Let there be $N$ cards in the environment, $N$ being even. There are $N+1$
actions – the first $N$ flip the corresponding card, and the last action
flips the unused card with the lowest index (or the card $N$ if all have
been used already). The observations consist of a pair of discrete values
(card, symbol), where the card is the index of the card flipped, and
the symbol is the symbol on the flipped card. The env.states
returns
a pair $(N, N/2)$, representing there are $N$ card indices and $N/2$
symbol indices.
Every episode can be ended by at most $3N/2$ actions, and the required return is therefore greater or equal to zero. Note that there is a limit of at most $2N$ actions per episode. The described environment is provided by the memory_game_environment.py module.
Your goal is to solve the environment, using supervised learning via the provided
expert episodes and networks with external memory. The environment implements
an env.expert_episode()
method, which returns a fresh correct episode
as a list of (state, action)
pairs (with the last action
being None
).
ReCodEx evaluates your solution on environments with 8, 12 and 16 cards
(utilizing the --cards
argument). For each card number, 100 episodes are
simulated once you pass evaluating=True
to env.reset
and your solution gets
1 point if the average return is nonnegative. You can
train the agent directly in ReCodEx (the time limit is 15 minutes),
or submit a pre-trained one.
A template memory_game.py is available, commenting a possible use of memory augmented networks.
Deadline: Feb 12, 23:59 5 points
This is a continuation of the memory_game
assignment.
In this task, your goal is to solve the memory game environment
using reinforcement learning. That is, you must not use the
env.expert_episode
method during training. You can start with the
memory_game_rl.py
template, which extends the memory_game
template by generating
training episodes suitable for some reinforcement learning algorithm.
ReCodEx evaluates your solution on environments with 4, 6 and 8 cards (utilizing
the --cards
argument). For each card number, your solution gets 2 points
(1 point for 4 cards) if the average return is nonnegative. You can train the agent
directly in ReCodEx (the time limit is 15 minutes), or submit a pre-trained one.
Deadline: Feb 12, 23:59 3 points
Implement the PPO algorithm in a single-agent settings. Notably, solve
the SingleCollect
environment implemented by the
multi_collect_environment.py
module. To familiarize with it, you can watch a trained agent
and you can run the module directly, controlling the agent with the arrow keys.
In the environment, your goal is to reach a known place, obtaining rewards
based on the agent's distance. If the agent is continuously occupying the place
for some period of time, it gets a large reward and the place is moved randomly.
The environment runs for 250 steps and it is considered solved if you obtain
a return of at least 500.
The ppo.py template contains a skeleton of the PPO algorithm implementation. Regarding the unspecified hyperparameters, I would consider the following ranges:
batch_size
between 64 and 512clip_epsilon
between 0.1 and 0.2epochs
between 1 and 10gamma
between 0.97 and 1.0trace_lambda
is usually 0.95envs
between 16 and 128worker_steps
between tens and hundredsMy implementation trains in approximately three minutes of CPU time.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the average return of 450 on all of them. Time limit for each test is 10 minutes.
Deadline: Feb 12, 23:59 4 points
Implement MAPPO in a multi-agent settings. Notably, solve MultiCollect
(a multi-agent extension of SingleCollect
) with 2 agents,
implemented again by the multi_collect_environment.py
module (you can watch the trained agents).
The environment is a generalization of SingleCollect
. If there are
$A$ agents, there are also $A$ target places, and each place rewards
the closest agent. Additionally, any agent colliding with others gets
a negative reward, and the environment reward is the average of the agents'
rewards (to keep the rewards less dependent on the number of agents).
Again, the environment runs for 250 steps and is considered solved
when reaching return of at least 500.
The mappo.py
template contains a skeleton of the MAPPO algorithm implementation.
I use hyperparameter values quite similar to the ppo
assignment, with
a notable exception of a smaller learning_rate=3e-4
, which is already
specified in the template.
My implementation successfully converges in only circa 50% of the cases, and trains in roughly 10-20 minutes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the average return of 450 on all of them. Time limit for each test is 10 minutes.
When submitting a competition solution to ReCodEx, you should submit a trained agent and a Python source capable of running it.
Furthermore, please also include the Python source and hyperparameters
you used to train the submitted model. But be careful that there still must be
exactly one Python source with a line starting with def main(
.
Do not forget about the maximum allowed model size and time and memory limits.
Before the deadline, ReCodEx prints the exact performance of your agent, but only if it is worse than the baseline.
If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached performance.
After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.
After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.
Installing to central user packages repository
You can install all required packages to central user packages repository using
pip3 install --user tensorflow==2.8.3 tensorflow_addons==0.16.1 tensorflow_probability==0.16.0 numpy==1.23.3 gym==0.26.1 pygame==2.1.2 mujoco==2.2.2 ufal.pybox2d==2.3.10.2 imageio==2.22.4
.
Installing to a virtual environment
Python supports virtual environments, which are directories containing
independent sets of installed packages. You can create a virtual environment
by running python3 -m venv VENV_DIR
followed by
VENV_DIR/bin/pip3 install tensorflow==2.8.3 tensorflow_addons==0.16.1 tensorflow_probability==0.16.0 numpy==1.23.3 gym==0.26.1 pygame==2.1.2 mujoco==2.2.2 ufal.pybox2d==2.3.10.2 imageio==2.22.4
.
(or VENV_DIR/Scripts/pip3
on Windows).
Windows installation
On Windows, it can happen that python3
is not in PATH, while py
command
is – in that case you can use py -m venv VENV_DIR
, which uses the newest
Python available, or for example py -3.9 -m venv VENV_DIR
, which uses
Python version 3.9.
If your Windows TensorFlow fails with ImportError: DLL load failed
,
you are probably missing
Visual C++ 2019 Redistributable.
If you encounter a problem creating the logs in the args.logdir
directory,
a possible cause is that the path is longer than 260 characters, which is
the default maximum length of a complete path on Windows. However, you can
increase this limit on Windows 10, version 1607 or later, by following
the instructions.
macOS installation
With an Intel processor, you should not need anything special.
If you have Apple Silicon, use tensorflow-macos==2.8.0 protobuf==3.19.6
instead of
tensorflow
. As of Oct 1, the dependency package grpcio
needs to be
compiled during the installation (automatically, but you need working Xcode);
the installation worked fine on my testing macOS. Furthermore, according to
this issue, a binary wheel for
grpcio
could be provided soon.
GPU support on Linux and Windows
TensorFlow 2.8 supports NVIDIA GPU out of the box, but you need to install CUDA 11.2 and cuDNN 8.1 libraries yourself.
GPU support on macOS
The AMD and Apple Silicon GPUs can be used by installing a plugin providing the GPU acceleration using:
python -m pip install tensorflow-metal==0.5.1
Errors when running with a GPU
If you encounter errors when running with a GPU:
export TF_FORCE_GPU_ALLOW_GROWTH=true
export TF_CPP_MIN_LOG_LEVEL=0
environmental variable,
which increases verbosity of the log messages.How to install TensorFlow dependencies on MetaCentrum?
To install CUDA, cuDNN, and Python 3.10 on MetaCentrum, it is enough to run in every session the following command:
module add python/python-3.10.4-gcc-8.3.0-ovkjwzd cuda/cuda-11.2.0-intel-19.0.4-tn4edsz cudnn/cudnn-8.1.0.77-11.2-linux-x64-intel-19.0.4-wx22b5t
How to install TensorFlow on MetaCentrum?
Once you have the required dependencies, you can create a virtual environment and install TensorFlow in it. However, note that by default the MetaCentrum jobs have a little disk space, so read about how to ask for scratch storage when submitting a job, and about quotas,
TL;DR:
Run an interactive CPU job, asking for 16GB scratch space:
qsub -l select=1:ncpus=1:mem=8gb:scratch_local=16gb -I
In the job, use the allocated scratch space as a temporary directory:
export TMPDIR=$SCRATCHDIR
Finally, create the virtual environment and install TensorFlow in it:
module add python/python-3.10.4-gcc-8.3.0-ovkjwzd cuda/cuda-11.2.0-intel-19.0.4-tn4edsz cudnn/cudnn-8.1.0.77-11.2-linux-x64-intel-19.0.4-wx22b5t
python3 -m venv CHOSEN_VENV_DIR
CHOSEN_VENV_DIR/bin/pip install --no-cache-dir tensorflow==2.8.3 tensorflow_addons==0.16.1 tensorflow_probability==0.16.0 numpy==1.23.3 gym==0.26.1 pygame==2.1.2 mujoco==2.2.2 ufal.pybox2d==2.3.10.2 imageio==2.22.4
How to run a GPU computation on MetaCentrum?
First, read the official MetaCentrum documentation: Beginners guide, About scheduling system, GPU clusters.
TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, 16GB RAM, and 8GB scatch space, run:
qsub -q gpu -l select=1:ncpus=1:ngpus=1:mem=16gb:scratch_local=8gb -I
To run a script in a non-interactive way, replace the -I
option with the script to be executed.
If you want to run a CPU-only computation, remove the -q gpu
and ngpus=1:
from the above commands.
How to install TensorFlow dependencies on AIC?
To enable CUDA 11.2 and cuDNN 8.1 on AIC, you can either use modules
as described in
the section “CUDA modules” at https://aic.ufal.mff.cuni.cz/index.php/Submitting_GPU_Jobs,
or you can add the following to your .profile
:
export PATH="/lnet/aic/opt/cuda/cuda-11.2/bin:$PATH"
export LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-11.2/lib64:/lnet/aic/opt/cuda/cuda-11.2/cudnn/8.1.1/lib64:/lnet/aic/opt/cuda/cuda-11.2/extras/CUPTI/lib64:$LD_LIBRARY_PATH"
How to run a GPU computation on AIC?
First, read the official AIC documentation: Submitting CPU Jobs, Submitting GPU Jobs.
TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, and 16GB RAM, run:
srun -p gpu -c1 --gpus=1 --mem=16G --pty bash
To run a shell script requiring a GPU in a non-interactive way, use
sbatch -p gpu -c1 --gpus=1 --mem=16G SCRIPT_PATH
If you want to run a CPU-only computation, remove the -p gpu
and --gpus=1
from the above commands.
What files can be submitted to ReCodEx?
You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.
What file does ReCodEx execute and what arguments does it use?
Exactly one file with py
suffix must contain a line starting with def main(
.
Such a file is imported by ReCodEx and the main
method is executed
(during the import, __name__ == "__recodex__"
).
The file must also export an argument parser called parser
. ReCodEx uses its
arguments and default values, but it overwrites some of the arguments
depending on the test being executed – the template should always indicate which
arguments are set by ReCodEx and which are left intact.
What are the time and memory limits?
The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution.
Do agents need to be trained directly in ReCodEx?
No, you can pre-train your agent locally (unless specified otherwise in the task description).
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.
To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every but the last lecture). In addition, you can get surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.
Lecture 1 Questions
Derive how to incrementally update a running average (how to compute an average of $N$ numbers using the average of the first $N-1$ numbers). [5]
Describe multi-arm bandits and write down the $\epsilon$-greedy algorithm for solving it. [5]
Define a Markov Decision Process, including the definition of a return. [5]
Describe how does a partially observable Markov decision process extend the Markov decision process and how is the agent altered. [5]
Lecture 2 Questions
Define a value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]
Define an action-value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]
Express a value function using an action-value function, and express an action-value function using a value function. [5]
Define optimal value function and optimal action-value function. Then define optimal policy in such a way that its existence is guaranteed. [5]
Write down the Bellman optimality equation. [5]
Define the Bellman backup operator. [5]
Write down the value iteration algorithm. [5]
Define the supremum norm $||\cdot||_\infty$ and prove that Bellman backup operator is a contraction with respect to this norm. [10]
Formulate and prove the policy improvement theorem. [10]
Write down the policy iteration algorithm. [10]
Write down the tabular Monte-Carlo on-policy every-visit $\epsilon$-soft algorithm. [10]
Lecture 3 Questions
Write down the Sarsa algorithm. [10]
Write down the Q-learning algorithm. [10]
Write down the Double Q-learning algorithm. [10]
Elaborate on how can importance sampling estimate expectations with respect to $\pi$ based on samples of $b$. [5]
Show how to estimate returns in the off-policy case, both with (1) ordinary importance sampling and (2) weighted importance sampling. [10]
Write down the Expected Sarsa algorithm and show how to obtain Q-learning from it. [10]
Show the bootstrapped estimate of $n$-step return. [5]
Write down the update in on-policy $n$-step Sarsa (assuming you already have $n$ previous steps, actions and rewards). [5]
Write down the update in off-policy $n$-step Sarsa with importance sampling (assuming you already have $n$ previous steps, actions and rewards). [10]
Write down the update of $n$-step Tree-backup algorithm (assuming you already have $n$ previous steps, actions and rewards). [10]
Lecture 4 Questions
Assuming function approximation, define Mean squared value error. [5]
Write down the gradient Monte-Carlo on-policy every-visit $\epsilon$-soft algorithm. [10]
Write down the semi-gradient $\epsilon$-greedy Sarsa algorithm. [10]
Prove that semi-gradient TD update is not an SGD update of any loss. [10]
What are the three elements causing off-policy divergence with function approximation? Write down the Baird's counterexample. [10]
Explain the role of a replay buffer in Deep Q Networks. [5]
How is the target network used and updated in Deep Q Networks? [5]
Explain how is reward clipping used in Deep Q Networks. What other clipping is used? [5]
Formulate the loss used in Deep Q Networks. [5]
Write down the Deep Q Networks training algorithm. [10]
Lecture 5 Questions
Explain the difference between DQN and Double DQN. [5]
Describe prioritized replay (how are transitions sampled from the replay buffer, how up-to-date are the priorities [according to which we sample], how are unseen transitions boosted, how is importance sampling used to account for the change in the sampling distribution). [10]
How is the action-value function computed in dueling networks? [5]
Describe a fully connected layer in Noisy nets (parametrization, computation, effective noise generation). [5]
In Distributional RL, describe how is the distribution of rewards represented and how it is predicted using a neural network. [5]
Write down the distributional Bellman equation, describe how are the atom probabilities of a reward distribution modeled, and write down the loss used to train a distributional Q network (including the mapping of atoms, which does not need to be mathematically flawless -- it is enough to describe how it should be done). [10]
Lecture 6 Questions
Formulate the policy gradient theorem. [5]
Prove the part of the policy gradient theorem showing the value of $\nabla_{\boldsymbol\theta} v_\pi(s)$. [10]
Assuming the policy gradient theorem, formulate the loss used by the REINFORCE algorithm and show how can its gradient be expressed as an expectation over states and actions. [5]
Write down the REINFORCE algorithm. [10]
Show that introducing baseline does not influence validity of the policy gradient theorem. [5]
Write down the REINFORCE with baseline algorithm. [10]
Write down the trajectory formulation of the operator version of REINFORCE, and show that the usual REINFORCE performs one gradient step to minimize the same utility function. [10]
Write down the state-action formulation of the operator version of REINFORCE, and show that the usual REINFORCE performs one gradient step to minimize the same utility function. [10]
Write down the one-step Actor-critic algorithm. [10]
How and why is entropy regularization used in policy gradient algorithms? [5]
The Asynchronous advantage actor-critic (A3C) policy may utilize recurrent neural networks. How is the training structured to allow backpropagation through them (would vanilla DQN, vanilla REINFORCE, vanilla actor-critic work with recurrent neural networks)? [5]
Lecture 7 Questions
Explain the difference between a regular Actor-critic and Parallel Advantage Actor Critic algorithms. [5]
Considering continuous actions modeled by a normal distribution with diagonal covariance, describe how is the policy distribution computed (network architecture, output activation functions) and how does the loss of a simple REINFORCE algorithm look like. [5]
Formulate the deterministic policy gradient theorem for $\nabla_{\boldsymbol\theta} v_\pi(s)$. [5]
Formulate the deterministic policy gradient theorem for $\nabla_{\boldsymbol\theta} J(\boldsymbol\theta)$. [5]
Prove the part of the deterministic policy gradient theorem showing the value of $\nabla_{\boldsymbol\theta} v_\pi(s)$. [10]
Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Deep Deterministic Policy Gradients (DDPG) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]
How is the return estimated in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm? [5]
Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]
Lecture 8 Questions
Write down how is the reward augmented in Soft actor critic, and the definitions of the soft action-value function and the soft (state-)value function. Then, define the modified Bellman backup operator $\mathcal{T}_\pi$ (be sure to indicate whether you are using the augmented or non-augmented reward), whose repeated application converges to the soft actor-value function $q_\pi$, and prove it. [10]
Considering soft policy improvement of a policy $\pi$, write down the update formula for the improved policy $\pi'$, and prove that the soft action-value function of the improved policy is greater or equal to the soft action-value function of the original policy. [10]
Write down how are the critics and target critics updated in the Soft actor critic algorithm. [5]
Write down how is the actor updated in the Soft actor critic algorithm, including the policy reparametrization trick. [5]
Regarding the entropy penalty coefficient $\alpha$ in the Soft actor critic, define what constrained optimization problem we are solving, what is the corresponding Lagrangian (and whether we are minimizing/maximizing it with respect to the policy and $\alpha$), and what does the $\alpha$ update looks like. [5]
Define a one-step TD error and express the $n$-step return as a sum of them. [5]
Define a one-step TD error and express the $n$-step return with off-policy correction using control variates as a sum of TD errors. [5]
Lecture 9 Questions
Define the $\lambda$-return. [5]
Define the $n$-step truncated $\lambda$-return. [5]
Define a one-step TD error and express the $n$-step truncated $\lambda$-return as a sum of them. [5]
Define a one-step TD error and express the $n$-step truncated $\lambda$-return with off-policy correction as a sum of them. [5]
Define the V-trace estimate and write down the policy to whose value function the V-trace estimate converges to. [10]
Explain why the fixed point of the V-trace operator does not depend on the truncation of all but the last importance sampling ratios. [10]
Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the IMPALA algorithm, including the V-trace formula. [10]
Sketch the population based training used in the IMPALA algorithm. [5]
In PopArt normalization, the value function is computed based on a normalized value predictor $n$ as $\sigma n + \mu$. Describe how to maintain $\sigma$ and $\mu$, how to compute normalized advantage based on return $G$, and how is the normalized value predictor modified when the estimates of $\sigma$ and $\mu$ change. [10]
Define the transformed Bellman operator. [5]
Define the transformed Bellman operator. Then, assuming $h$ is strictly monotonically increasing function and considering a deterministic Markov decision process, show to what does a transformed Bellman operator $\mathcal{T}_h$ converge and prove it. [10]
Write down the return transformation used for Atari environments (for example by R2D2). [5]
Describe the replay buffer elements in R2D2. What is the difference between the zero-state and stored-state strategies, and how is burn-in used? [5]
Write down the Retrace operator and describe the three possibilities of setting the traces $c_t$: importance sampling, Tree-backup($\lambda$) and Retrace($\lambda$). [10]
Lecture 10 Questions
Considering multi-arm bandits, write down the UCB algorithm. [5]
Describe the inputs and outputs of a neural network used in AlphaZero, and describe the inputs and outputs of a Monte-Carlo tree search. [5]
Write down the loss used in AlphaZero algorithm. [5]
What quantities are kept in a node of a Monte-Carlo tree search? [5]
How are actions selected in a Monte-Carlo tree search? [10]
What does AlphaZero use to maintain exploration in a Monte-Carlo tree search? [5]
Describe the backup phase of Monte-Carlo tree search, i.e., the steps you perform when you reach a leaf during the tree search. [5]
How are the actions selected in AlphaZero self-play? [5]
Lecture 11 Questions
Describe the three components of a MuZero model, and describe/draw how they are used to traverse the MCTS tree. [5]
Describe the MCTS in MuZero – action selection (including the exact action-values used), how are the three components of a MuZero model used during the tree traversal and leaf evaluation, and the updates during the backup phase. [10]
Assuming we already have a filled replay buffer, describe the MuZero training – the losses and the target values used in them. [10]
Describe the components of a typical latent-space model in PlaNet (the transition, observation and reward functions, the encoder) and the components of a recurrent state-space model (RSSM). [5]
Derive the variational lower bound on $\log p(o_{1:T} | a_{1:T})$ used in PlaNet (you can utilize the Jensen's inequality $\log \mathbb{E} [x] \ge \mathbb{E} [\log x]$). [10]
Lecture 12 Questions
Consider a model with a discrete categorical latent variable $\boldsymbol z$ sampled from $p(\boldsymbol z; \boldsymbol \theta)$, with a loss $L(\boldsymbol z; \boldsymbol \omega)$. Describe how we compute the derivative of the loss $L$ with respect to the parameters $\boldsymbol \theta$ using (a) a straight-through estimator, and (b) a REINFORCE-like gradient estimator with a baseline. [5]
Describe the Gumbel-Max trick; in other words, write down how to perform sampling from a categorical distribution using an $\operatorname{argmax}$, including the procedure for sampling from the $\operatorname{Gumbel}(0, 1)$ distribution. [5]
Consider a discrete categorical variable sampled from logits $\boldsymbol l$. Define the $\operatorname{Gumbel-softmax}(\boldsymbol l, T)$ distribution with logits $\boldsymbol l$ and a temperature $T$ (no need to describe sampling from $\operatorname{Gumbel}(0, 1)$), and describe the main difference between the $\operatorname{Gumbel-softmax}(\boldsymbol l, T)$ and the $\operatorname{softmax}(\boldsymbol l)$ distributions. [5]
Consider a model with a discrete categorical latent variable $\boldsymbol z$ sampled from $p(\boldsymbol z; \boldsymbol \theta)$, with a loss $L(\boldsymbol z; \boldsymbol \omega)$. Describe how we compute the derivative of the loss $L$ with respect to the parameters $\boldsymbol \theta$ using (a) a Gumbel-softmax estimator, and (b) a straight-through Gumbel-softmax estimator. [5]
Write down an algorithm implementing a straight-through estimator using automatic differentiation (i.e., in TensorFlow or Pytorch). [5]
Describe the six components of the DreamerV2 recurrent state-space model (RSSM). [5]
Explain the KL balancing used in DreamerV2. [5]
Describe the training of both a critic and an actor in DreamerV2 (including the explicit losses). [10]