Deep Reinforcement Learning – Winter 2020/21
In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with superhuman performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by selfplay), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.
Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.
About
SIS code: NPFL122
Semester: winter
Ecredits: 6
Examination: 2/2 C+Ex
Guarantor: Milan Straka
Timespace Coordinates
 lecture: the lecture is held on Monday 15:40; first lecture is on Oct 05
 practicals: the practicals take place on Wednesday 9:00; first practicals are on Oct 07
Lectures
1. Introduction to Reinforcement Learning Slides PDF Slides Lecture Questions bandits monte_carlo
2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods Slides PDF Slides Lecture Questions policy_iteration policy_iteration_exact policy_iteration_exploring_mc policy_iteration_greedy_mc
3. Temporal Difference Methods, OffPolicy Methods Slides PDF Slides Lecture Questions importance_sampling q_learning lunar_lander
4. Function Approximation, Deep Q Network Slides PDF Slides Lecture Questions td_algorithms q_learning_tiles q_network
5. Rainbow Slides PDF Slides Lecture Questions car_racing
6. Policy Gradient Methods Slides PDF Slides Lecture Questions reinforce reinforce_baseline cart_pole_pixels
7. PAAC, Continuous Actions, DDPG Slides PDF Slides Lecture Questions paac paac_continuous ddpg
8. TD3, SAC, TRPO, PPO Slides PDF Slides Lecture Questions walker walker_hardcore sac_bonus
9. Eligibility traces, Vtrace, IMPALA Slides PDF Slides Lecture Questions trace_algorithms
10. UCB, Monte Carlo Tree Search, AlphaZero Slides PDF Slides Lecture Questions az_quiz az_quiz_randomized
11. PopArt Normalization, R2D2, MuZero Slides PDF Slides Lecture Questions paac_lstm
12. Retrace, NGU, Agent57 Lecture
13. Explicit Memory, MERLIN, FTW Slides PDF Slides Lecture memory_game memory_game_rl
License
Unless otherwise stated, teaching materials for this course are available under CC BYSA 4.0.
The lecture content, including references to study materials.
The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.
References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.
1. Introduction to Reinforcement Learning
Oct 05 Slides PDF Slides Lecture Questions bandits monte_carlo
 History of RL [Chapter 1 of RLB]
 Multiarmed bandits [Sections 22.6 of RLB]
 Markov Decision Process [Sections 33.3 of RLB]
2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods
Oct 12 Slides PDF Slides Lecture Questions policy_iteration policy_iteration_exact policy_iteration_exploring_mc policy_iteration_greedy_mc
 Policies and Value Functions [Sections 3.53.6 of RLB]
 Value Iteration [Sections 4 and 4.4 of RLB]
 Proof of convergence only in slides
 Policy Iteration [Sections 4.14.3 of RLB]
 Generalized Policy Iteration [Section 4.6 or RLB]
 Monte Carlo Methods [Sections 55.4 of RLB]
3. Temporal Difference Methods, OffPolicy Methods
Oct 19 Slides PDF Slides Lecture Questions importance_sampling q_learning lunar_lander
 Modelfree and modelbased methods, using statevalue or actionvalue functions [Chapter 8 before Section 8.1, and Section 6.8 of RLB]
 Temporaldifference methods [Sections 66.3 of RLB]
 Sarsa [Section 6.4 of RLB]
 Qlearning [Section 6.5 of RLB]
 Offpolicy Monte Carlo Methods [Sections 5.55.7 of RLB]
 Expected Sarsa [Section 6.6 of RLB]
 Nstep TD policy evaluation [Section 7.1 of RLB]
 Nstep Sarsa [Section 7.2 of RLB]
 Offpolicy nstep Sarsa [Section 7.3 of RLB]
 Tree backup algorithm [Section 7.5 of RLB]
4. Function Approximation, Deep Q Network
Oct 26 Slides PDF Slides Lecture Questions td_algorithms q_learning_tiles q_network
 Function approximation [Sections 99.3 of RLB]
 Tile coding [Section 9.5.4 of RLB]
 Linear function approximation [Section 9.4 of RLB, without the Proof of Convergence if Linear TD(0)]
 SemiGradient TD methods [Sections 9.3, 1010.2 of RLB]
 Offpolicy function approximation TD divergence [Sections 11.211.3 of RLB]
 Deep Q Network [Volodymyr Mnih et al.: Humanlevel control through deep reinforcement learning]
5. Rainbow
Nov 02 Slides PDF Slides Lecture Questions car_racing
 Double Deep Q Network (DDQN) [Hado van Hasselt et al.: Deep Reinforcement Learning with Double Qlearning]
 Prioritized Experience Replay [Tom Schaul et al.: Prioritized Experience Replay]
 Dueling Deep Q Network [Ziyu Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning]
 Noisy Nets [Meire Fortunato et al.: Noisy Networks for Exploration]
 Distributional Reinforcement Learning [Marc G. Bellemare et al.: A Distributional Perspective on Reinforcement Learning]
 Rainbow [Matteo Hessel et al.: Rainbow: Combining Improvements in Deep Reinforcement Learning]
6. Policy Gradient Methods
Nov 09 Slides PDF Slides Lecture Questions reinforce reinforce_baseline cart_pole_pixels
 Policy Gradient Methods [Sections 1313.1 of RLB]
 Policy Gradient Theorem [Section 13.2 of RLB]
 REINFORCE algorithm [Section 13.3 of RLB]
 REINFORCE with baseline algorithm [Section 13.4 of RLB]
 ActorCritic methods [Section 13.5 of RLB, without the eligibility traces variant]
 A3C and asynchronous RL [Volodymyr Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning]
7. PAAC, Continuous Actions, DDPG
Nov 16 Slides PDF Slides Lecture Questions paac paac_continuous ddpg
 PAAC [Alfredo V. Clemente et al.: Efficient Parallel Methods for Deep Reinforcement Learning]
 Gradient methods with continuous actions [Section 13.7 of RLB]
 Deterministic policy gradient theorem (DPG) [David Silver et al.: Deterministic Policy Gradient Algorithms]
 Deep deterministic policy gradient (DDPG) [Timothy P. Lillicrap et al.: Continuous Control with Deep Reinforcement Learning]
8. TD3, SAC, TRPO, PPO
Nov 23 Slides PDF Slides Lecture Questions walker walker_hardcore sac_bonus
 Twin delayed deep deterministic policy gradient (TD3) [Scott Fujimoto et al.: Addressing Function Approximation Error in ActorCritic Methods]
 Soft actorcritic (SAC) [Tuomas Haarnoja et al.: Soft ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor]
 Natural policy gradient (NPG) [Sham Kakade: A Natural Policy Gradient]
 Truncated natural policy gradient (TNPG), Trust Region Policy Optimalization (TRPO) [John Schulman et al.: Trust Region Policy Optimization]
 Proximal policy optimization (PPO) [John Schulman et al.: Proximal Policy Optimization Algorithms]
9. Eligibility traces, Vtrace, IMPALA
Nov 30 Slides PDF Slides Lecture Questions trace_algorithms
 Offpolicy correction using control varietes [Section 7.4 of RLB]
 Eligibility traces [Sections 12, 12.1, 12.3, 12.8, 12.9 of RLB]
 TD(λ) [Section 12.2 of RLB]
 The Vtrace algorithm, IMPALA [Lasse Espeholt et al.: IMPALA: Scalable Distributed DeepRL with Importance Weighted ActorLearner Architectures]
10. UCB, Monte Carlo Tree Search, AlphaZero
Dec 07 Slides PDF Slides Lecture Questions az_quiz az_quiz_randomized
 UCB [Section 2.7 of RLB]
 MonteCarlo tree search [Guillaume Chaslot et al.: MonteCarlo Tree Search: A New Framework for Game A]
 AlphaZero [David Silver et al.: A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay
11. PopArt Normalization, R2D2, MuZero
Dec 14 Slides PDF Slides Lecture Questions paac_lstm
 PopArt reward normalization [Matteo Hessel et al.: Multitask Deep Reinforcement Learning with PopArt]
 Transformed rewards [Tobias Pohlen et al.: Observe and Look Further: Achieving Consistent Performance on Atari]
 Recurrent Replay Distributed DQN (R2D2) [Steven Kapturowski et al.: Recurrent Experience Replay in Distributed Reinforcement Learning]
 MuZero [Julian Schrittwieser et al.: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model]
12. Retrace, NGU, Agent57
Dec 21 Lecture
 Retrace [Rémi Munos et al.: Safe and Efficient OffPolicy Reinforcement Learning]
 Never give up [Adrià P. Badia et al.: Never Give Up: Learning Directed Exploration Strategies]
 Agent57 [Adrià P. Badia et al.: Agent57: Outperforming the Atari Human Benchmark]
13. Explicit Memory, MERLIN, FTW
Jan 04 Slides PDF Slides Lecture memory_game memory_game_rl
 Variational RNN [Junyoung Chung et al.: A Recurrent Latent Variable Model for Sequential Data]
 Generative temporal models with memory [Mevlana Gemici: Generative Temporal Models with Memory]
 PlaNet [Danijar Hafner et al.: Learning Latent Dynamics for Planning from Pixels]
 MERLIN model [Greg Wayne et al.:Unsupervised Predictive Memory in a GoalDirected Agent]
 FTW agent for multiplayer CTF [Max Jaderberg et al.: Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning]
Requirements
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 (including the bonus points) will be transfered to the exam.
Environment
The tasks are evaluated automatically using the ReCodEx Code Examiner.
The evaluation is performed using Python 3.8, TensorFlow 2.3.1, TensorFlow Probability 0.11.1, NumPy 1.18.5, OpenAI Gym 0.17.2 and Box2D 2.3.10. You should install the exact version of these packages yourselves. For those using PyTorch, 1.6.0 is also available.
Teamwork
Solving assignments in teams of size 2 or 3 is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
bandits
Deadline: Oct 20, 23:59 4 points
Implement the $ε$greedy strategy for solving multiarmed bandits.
Start with the bandits.py
template, which defines MultiArmedBandits
environment, which has the following
two methods:
reset()
: reset the environmentstep(action) → reward
: perform the chosen action in the environment, obtaining a rewardgreedy(epsilon)
: returnTrue
with probability 1epsilon
Your goal is to implement the following solution variants:
alpha
$=0$: perform $ε$greedy search, updating the estimated using averaging.alpha
$≠0$: perform $ε$greedy search, updating the estimated using a fixed learning ratealpha
.
Note that the initial estimates should be set to a given value and epsilon
can
be zero, in which case purely greedy actions are used.
Please note that the results are stochastic, so your results may differ slightly.
python3 bandits.py alpha=0 epsilon=0.1 initial=0
1.39 0.08
python3 bandits.py alpha=0 epsilon=0 initial=1
1.48 0.22
python3 bandits.py alpha=0.15 epsilon=0.1 initial=0
1.37 0.09
python3 bandits.py alpha=0.15 epsilon=0 initial=1
1.52 0.04
monte_carlo
Deadline: Oct 20, 23:59 6 points
Solve the CartPolev1 environment
environment from the OpenAI Gym using the Monte Carlo
reinforcement learning algorithm. The gym
environments have the followng
methods and properties:
observation_space
: the description of environment observationsaction_space
: the description of environment actionsreset() → new_state
: starts a new episodestep(action) → new_state, reward, done, info
: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environmentspecific informationrender()
: render current environment state
We additionaly extend the gym
environment by:
episode
: number of the current episode (zerobased)reset(start_evaluation=False) → new_state
: ifstart_evaluation
isTrue
, an evaluation is started
Once you finish training (which you indicate by passing start_evaluation=True
to reset
), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100episode
average return each 10 episodes even during training.
Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
policy_iteration
Deadline: Oct 27, 23:59 4 points
Consider the following gridworld:
Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:
GridWorld.states
: return number of states (11
)GridWorld.actions
: return lists with labels of the actions (["↑", "→", "↓", "←"]
)GridWorld.step(state, action)
: return possible outcomes of performing theaction
in a givenstate
, as a list of triples containingprobability
: probability of the outcomereward
: reward of the outcomenew_state
: new state of the outcome
Implement policy iteration algorithm, with steps
steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform iterations
applications of the Bellman equation.
Perform the policy evaluation synchronously (i.e., do not overwrite the current
value function when computing its improvement). Assume the initial policy is
“go North” and initial value function is zero.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration.py gamma=0.95 iterations=1 steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 10.00← 10.00↑
0.00↑ 0.00→ 0.10← 79.90←
python3 policy_iteration.py gamma=0.95 iterations=1 steps=2
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 7.59← 11.90←
0.00→ 0.08← 0.94← 18.36←
python3 policy_iteration.py gamma=0.95 iterations=1 steps=3
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↓ 5.86← 7.41←
0.06↓ 0.01← 0.75← 13.49↓
python3 policy_iteration.py gamma=0.95 iterations=1 steps=10
0.04↓ 0.04← 0.01↑ 0.00↑
0.04↓ 0.95← 1.00←
0.04↓ 0.04← 0.10→ 0.52↓
python3 policy_iteration.py gamma=0.95 iterations=10 steps=10
11.79↓ 11.03← 10.31← 6.54↑
12.69↓ 10.14← 9.95←
13.56→ 14.59→ 15.58→ 16.26↓
python3 policy_iteration.py gamma=1 iterations=1 steps=100
66.54↓ 65.53← 64.42← 56.34↑
67.68↓ 63.58← 62.97←
68.69→ 69.83→ 70.84→ 71.75↓
policy_iteration_exact
Deadline: Oct 27, 23:59 2 points
Starting with policy_iteration_exact.py,
extend the policy_iteration
assignment to perform policy evaluation
exactly by solving a system of linear equations.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration_exact.py gamma=0.95 steps=1
0.00→ 0.00→ 0.00↑ 0.00↑
0.00↑ 12.35← 12.35↑
0.85← 8.10← 19.62← 100.71←
python3 policy_iteration_exact.py gamma=0.95 steps=2
0.00→ 0.00→ 0.00→ 0.00↑
0.00↑ 0.00← 11.05←
0.00↑ 0.00→ 0.00← 12.10↓
python3 policy_iteration_exact.py gamma=0.95 steps=3
0.00↓ 0.00← 0.00↓ 0.00↑
0.00↑ 0.00← 0.69←
0.00← 0.00← 0.00→ 6.21↓
python3 policy_iteration_exact.py gamma=0.95 steps=8
12.12↓ 11.37← 9.19← 6.02↑
13.01↓ 9.92← 9.79←
13.87→ 14.89→ 15.87→ 16.60↓
python3 policy_iteration_exact.py gamma=0.95 steps=9
12.24↓ 11.49← 10.76← 7.05↑
13.14↓ 10.60← 10.42←
14.01→ 15.04→ 16.03→ 16.71↓
python3 policy_iteration_exact.py gamma=0.95 steps=10
12.24↓ 11.49← 10.76← 7.05↑
13.14↓ 10.60← 10.42←
14.01→ 15.04→ 16.03→ 16.71↓
policy_iteration_exploring_mc
Deadline: Oct 27, 23:59 2 points
Starting with policy_iteration_exploring_mc.py,
extend the policy_iteration
assignment to perform policy evaluation
by using Monte Carlo estimation with exploring starts.
The estimation can now be performed modelfree (without the access to the full
MDP dynamics), therefore, the GridWorld.step
returns a randomly sampled
result instead of a full distribution.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration_exploring_mc.py gamma=0.95 seed=42 steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00↑ 0.00↓
python3 policy_iteration_exploring_mc.py gamma=0.95 seed=42 steps=10
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ 19.50↑
0.27↓ 0.48← 2.21↓ 8.52↓
python3 policy_iteration_exploring_mc.py gamma=0.95 seed=42 steps=50
0.09↓ 0.32↓ 0.22← 0.15↑
0.18↑ 2.43← 5.12↓
0.18↓ 1.80↓ 3.90↓ 9.14↓
python3 policy_iteration_exploring_mc.py gamma=0.95 seed=42 steps=100
3.09↓ 2.42← 2.39← 1.17↑
3.74↓ 1.66← 0.18←
3.92→ 5.28→ 7.16→ 11.07↓
python3 policy_iteration_exploring_mc.py gamma=0.95 seed=42 steps=200
7.71↓ 6.76← 6.66← 3.92↑
8.27↓ 6.17← 5.31←
8.88→ 10.12→ 11.36→ 13.92↓
policy_iteration_greedy_mc
Deadline: Oct 27, 23:59 2 points
Starting with policy_iteration_greedy_mc.py,
extend the policy_iteration_exploring_mc
assignment to perform policy
evaluation by using $ε$greedy Monte Carlo estimation.
For the sake of replicability, use the provided
GridWorld.epsilon_greedy(epsilon, greedy_action)
method, which returns
a random action with probability of epsilon
and otherwise returns the
given greedy_action
.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration_greedy_mc.py gamma=0.95 seed=42 steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00→
0.00↑ 0.00↑ 0.00→ 0.00→
python3 policy_iteration_greedy_mc.py gamma=0.95 seed=42 steps=10
1.20↓ 1.43← 0.00← 6.00↑
0.78→ 20.26↓ 0.00←
0.09← 0.00↓ 9.80↓ 10.37↓
python3 policy_iteration_greedy_mc.py gamma=0.95 seed=42 steps=50
0.16↓ 0.19← 0.56← 6.30↑
0.13→ 6.99↓ 3.51↓
0.01← 0.00← 3.18↓ 7.57↓
python3 policy_iteration_greedy_mc.py gamma=0.95 seed=42 steps=100
0.07↓ 0.09← 0.28← 4.66↑
0.06→ 5.04↓ 8.32↓
0.00← 0.00← 1.70↓ 4.38↓
python3 policy_iteration_greedy_mc.py gamma=0.95 seed=42 steps=200
0.04↓ 0.04← 0.76← 4.15↑
0.03→ 8.02↓ 5.96↓
0.00← 0.00← 2.53↓ 4.36↓
python3 policy_iteration_greedy_mc.py gamma=0.95 seed=42 steps=500
0.02↓ 0.02← 0.65← 3.52↑
0.01→ 11.34↓ 8.07↓
0.00← 0.00← 3.15↓ 3.99↓
importance_sampling
Deadline: Nov 03, 23:59 2 points
Using the FrozenLakev0 environment environment, implement Monte Carlo weighted importance sampling to estimate state value function $V$ of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behaviour policy, which uniformly chooses among all four actions.
Start with the importance_sampling.py template, which creates the environment and generates episodes according to behaviour policy.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 importance_sampling.py episodes=500
0.00 0.00 0.00 0.00
0.03 0.00 0.00 0.00
0.22 0.14 0.29 0.00
0.00 0.50 1.00 0.00
python3 importance_sampling.py episodes=5000
0.00 0.01 0.02 0.00
0.00 0.00 0.08 0.00
0.06 0.08 0.17 0.00
0.00 0.19 0.89 0.00
python3 importance_sampling.py episodes=50000
0.02 0.01 0.04 0.01
0.03 0.00 0.06 0.00
0.08 0.17 0.24 0.00
0.00 0.34 0.78 0.00
q_learning
Deadline: Nov 03, 23:59 6 points
Solve the MountainCarv0 environment environment from the OpenAI Gym using the Qlearning reinforcement learning algorithm. Note that this task does not require TensorFlow.
The environment methods and properties are described in the monte_carlo
assignment.
Once you finish training (which you indicate by passing start_evaluation=True
to reset
), your goal is to reach an average return of 150 during 100
evaluation episodes.
You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Qlearning is a bit tricky – I usualy start with a larger value of $ε$ (like 0.2 or even 0.5) an then gradually decrease it to almost zero.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
lunar_lander
Deadline: Nov 03, 23:59 7 points + 7 bonus
Solve the LunarLanderv2 environment environment from the OpenAI Gym. Note that this task does not require TensorFlow.
The environment methods and properties are described in the monte_carlo
assignment,
but include one additional method:
expert_trajectory() → initial_state, trajectory
This method generates one expert trajectory and returns a pair ofinitial_state
andtrajectory
, wheretrajectory
is a list of the tripples (action, reward, next_state). You can use this method only during training, not during evaluation.
To pass the task, you need to reach an average return of 0 during 100 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
The task is additionally a competition and at most 7 points will be awarded according to relative ordering of your solution performances.
You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
td_algorithms
Deadline: Nov 10, 23:59 Nov 17, 23:59
5 points
Starting with the td_algorithms.py template, implement all of the following $n$step TD methods variants:
 SARSA, expected SARSA and Tree backup;
 either onpolicy (with $ε$greedy behaviour policy) or offpolicy (with the same behaviour policy, but greedy target policy).
This assignment is new, so if you think you have implemented everything correctly and it does not pass, do not hesitate to write me.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 td_algorithms.py mode=sarsa n=1
Episode 100, mean 100episode return 360.33 +184.47
Episode 200, mean 100episode return 224.77 +98.00
Episode 300, mean 100episode return 182.97 +100.50
Episode 400, mean 100episode return 138.60 +96.61
Episode 500, mean 100episode return 106.91 +82.49
Episode 600, mean 100episode return 88.05 +79.26
Episode 700, mean 100episode return 57.03 +53.84
Episode 800, mean 100episode return 33.09 +54.11
Episode 900, mean 100episode return 30.81 +45.75
Episode 1000, mean 100episode return 21.21 +35.98
python3 td_algorithms.py mode=sarsa n=1 off_policy
Episode 100, mean 100episode return 368.15 +184.96
Episode 200, mean 100episode return 207.59 +101.46
Episode 300, mean 100episode return 170.73 +100.77
Episode 400, mean 100episode return 143.05 +97.48
Episode 500, mean 100episode return 93.66 +90.10
Episode 600, mean 100episode return 66.25 +68.43
Episode 700, mean 100episode return 38.15 +56.84
Episode 800, mean 100episode return 25.82 +44.67
Episode 900, mean 100episode return 18.04 +37.85
Episode 1000, mean 100episode return 14.56 +34.09
python3 td_algorithms.py mode=sarsa n=4
Episode 100, mean 100episode return 516.63 +256.06
Episode 200, mean 100episode return 205.93 +160.51
Episode 300, mean 100episode return 169.65 +165.20
Episode 400, mean 100episode return 68.71 +131.53
Episode 500, mean 100episode return 15.79 +45.34
Episode 600, mean 100episode return 8.01 +38.65
Episode 700, mean 100episode return 6.21 +30.64
Episode 800, mean 100episode return 5.69 +16.12
Episode 900, mean 100episode return 0.68 +8.99
Episode 1000, mean 100episode return 1.56 +10.94
python3 td_algorithms.py mode=sarsa n=4 off_policy
Episode 100, mean 100episode return 524.26 +195.11
Episode 200, mean 100episode return 345.41 +181.73
Episode 300, mean 100episode return 286.07 +165.83
Episode 400, mean 100episode return 249.51 +187.19
Episode 500, mean 100episode return 112.83 +158.33
Episode 600, mean 100episode return 80.56 +145.49
Episode 700, mean 100episode return 20.16 +71.73
Episode 800, mean 100episode return 17.42 +62.07
Episode 900, mean 100episode return 5.14 +27.98
Episode 1000, mean 100episode return 1.83 +12.61
python3 td_algorithms.py mode=expected_sarsa n=1
Episode 100, mean 100episode return 361.33 +186.01
Episode 200, mean 100episode return 214.54 +104.67
Episode 300, mean 100episode return 179.69 +103.63
Episode 400, mean 100episode return 147.74 +92.59
Episode 500, mean 100episode return 109.10 +89.53
Episode 600, mean 100episode return 79.89 +75.51
Episode 700, mean 100episode return 59.05 +57.01
Episode 800, mean 100episode return 40.03 +44.50
Episode 900, mean 100episode return 25.21 +38.41
Episode 1000, mean 100episode return 19.67 +34.80
python3 td_algorithms.py mode=expected_sarsa n=1 off_policy
Episode 100, mean 100episode return 358.93 +187.30
Episode 200, mean 100episode return 221.93 +91.20
Episode 300, mean 100episode return 176.05 +110.42
Episode 400, mean 100episode return 124.69 +92.91
Episode 500, mean 100episode return 98.44 +86.99
Episode 600, mean 100episode return 64.75 +69.56
Episode 700, mean 100episode return 51.46 +52.95
Episode 800, mean 100episode return 28.69 +44.46
Episode 900, mean 100episode return 17.27 +30.60
Episode 1000, mean 100episode return 10.83 +25.23
python3 td_algorithms.py mode=expected_sarsa n=4
Episode 100, mean 100episode return 555.15 +204.64
Episode 200, mean 100episode return 261.06 +131.13
Episode 300, mean 100episode return 144.66 +157.24
Episode 400, mean 100episode return 88.66 +144.94
Episode 500, mean 100episode return 25.55 +69.55
Episode 600, mean 100episode return 6.82 +30.54
Episode 700, mean 100episode return 2.32 +18.24
Episode 800, mean 100episode return 0.09 +10.35
Episode 900, mean 100episode return 0.06 +14.05
Episode 1000, mean 100episode return 0.28 +11.60
python3 td_algorithms.py mode=expected_sarsa n=4 off_policy
Episode 100, mean 100episode return 526.36 +202.40
Episode 200, mean 100episode return 306.17 +167.38
Episode 300, mean 100episode return 258.25 +180.35
Episode 400, mean 100episode return 146.21 +174.19
Episode 500, mean 100episode return 120.67 +167.93
Episode 600, mean 100episode return 85.25 +153.25
Episode 700, mean 100episode return 23.43 +92.30
Episode 800, mean 100episode return 21.92 +70.71
Episode 900, mean 100episode return 4.94 +22.51
Episode 1000, mean 100episode return 5.79 +26.25
python3 td_algorithms.py mode=tree_backup n=1
Episode 100, mean 100episode return 361.33 +186.01
Episode 200, mean 100episode return 214.54 +104.67
Episode 300, mean 100episode return 179.69 +103.63
Episode 400, mean 100episode return 147.74 +92.59
Episode 500, mean 100episode return 109.10 +89.53
Episode 600, mean 100episode return 79.89 +75.51
Episode 700, mean 100episode return 59.05 +57.01
Episode 800, mean 100episode return 40.03 +44.50
Episode 900, mean 100episode return 25.21 +38.41
Episode 1000, mean 100episode return 19.67 +34.80
python3 td_algorithms.py mode=tree_backup n=1 off_policy
Episode 100, mean 100episode return 358.93 +187.30
Episode 200, mean 100episode return 221.93 +91.20
Episode 300, mean 100episode return 176.05 +110.42
Episode 400, mean 100episode return 124.69 +92.91
Episode 500, mean 100episode return 98.44 +86.99
Episode 600, mean 100episode return 64.75 +69.56
Episode 700, mean 100episode return 51.46 +52.95
Episode 800, mean 100episode return 28.69 +44.46
Episode 900, mean 100episode return 17.27 +30.60
Episode 1000, mean 100episode return 10.83 +25.23
python3 td_algorithms.py mode=tree_backup n=4
Episode 100, mean 100episode return 522.36 +226.08
Episode 200, mean 100episode return 264.75 +136.85
Episode 300, mean 100episode return 163.50 +168.74
Episode 400, mean 100episode return 54.18 +105.95
Episode 500, mean 100episode return 27.66 +70.12
Episode 600, mean 100episode return 9.05 +23.62
Episode 700, mean 100episode return 4.76 +31.53
Episode 800, mean 100episode return 2.57 +12.74
Episode 900, mean 100episode return 0.58 +12.08
Episode 1000, mean 100episode return 1.17 +9.07
python3 td_algorithms.py mode=tree_backup n=4 off_policy
Episode 100, mean 100episode return 519.80 +233.81
Episode 200, mean 100episode return 302.58 +123.70
Episode 300, mean 100episode return 203.98 +153.41
Episode 400, mean 100episode return 95.12 +136.49
Episode 500, mean 100episode return 25.28 +65.11
Episode 600, mean 100episode return 4.79 +19.20
Episode 700, mean 100episode return 8.53 +29.38
Episode 800, mean 100episode return 5.13 +19.44
Episode 900, mean 100episode return 1.98 +12.35
Episode 1000, mean 100episode return 1.59 +11.99
q_learning_tiles
Deadline: Nov 10, 23:59 Nov 17, 23:59
4 points
Improve the q_learning
task performance on the
MountainCarv0 environment
environment using linear function approximation with tile coding.
Your goal is to reach an average reward of 110 during 100 evaluation episodes.
The environment methods are described in the q_learning
assignments, with
the following changes:
 The
state
returned by theenv.step
method is a list containing weight indices of the current state (i.e., the feature vector of the state consists of zeros and ones, and only the indices of the ones are returned). The actionvalue function is therefore approximated as a sum of the weights whose indices are returned byenv.step
.  The
env.observation_space.nvec
returns a list, where the $i$th element is a number of weights used by first $i$ elements ofstate
. Notably,env.observation_space.nvec[1]
is the total number of weights.
You can start with the q_learning_tiles.py
template, which parses several useful parameters and creates the environment.
Implementing Qlearning is enough to pass the assignment, even if both Nstep
Sarsa and Tree Backup converge a little faster. The default number of tiles in
tile encoding (i.e., the size of the list with weight indices) is
args.tiles=8
, but you can use any number you want (but the assignment is
solvable with 8).
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
q_network
Deadline: Nov 10, 23:59 Nov 17, 23:59
6 points
Solve the CartPolev1 environment environment from the OpenAI Gym using Qlearning with neural network as a function approximation.
You can start with the q_network.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.
The continuous environment is very similar to a discrete one, except
that the states are vectors of realvalued observations with shape
env.observation_space.shape
.
Use Qlearning with neural network as a function approximation, which for a given state returns stateaction values for all actions. You can use any network architecture, but one hidden layer of several dozens ReLU units is a good start. Your goal is to reach an average return of 400 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes (so you can train in ReCodEx, but you can also pretrain your network if you like).
car_racing
Deadline: Nov 17, 23:59 8 points + 10 bonus
The goal of this competition is to use Deep Q Networks (and any of Rainbow improvements) on a more realworld CarRacingv0 environment from the OpenAI Gym.
The supplied car_racing_environment.py provides the environment. It is continuous and states are RGB images of size $96×96×3$, but you can downsample them even more. The actions are also continuous and consist of an array with the following three elements:
steer
in range [1, 1]gas
in range [0, 1]brake
in range [0, 1]; note that full brake is quite aggressive, so you might consider using less force when braking
Internally you should probably generate discrete actions and convert them to the
required representation before the step
call. The smallest set is probably
left, right, gas, brake and noop, but you can use a more finegrained one if
you like.
The environment also support frame skipping, which improves its performance (only some frames need to be rendered).
In ReCodEx, you are expected to submit an already trained model, which is evaluated on 15 different tracks with a total time limit of 15 minutes. If your average return is at least 300, you obtain 8 points. The task is also a competition and at most 10 points will be awarded according to relative ordering of your solution performances.
The car_racing.py template parses several useful parameters and creates the environment. Note that the car_racing_environment.py can be executed directly and in that case you can drive the car using arrows.
Also, you might want to use a vectorized version of the environment for
training, which runs several individual environments in separate processes.
The template contains instructions how to create it. The vectorized environment
expects a vector of actions and returns a vector of observations, rewards, dones
and infos. When one of the environments is done
, it is immediately reset and
state
is the initial state of a new episode.
reinforce
Deadline: Nov 24, 23:59 5 points
Solve the CartPolev1 environment environment from the OpenAI Gym using the REINFORCE algorithm.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
Start with the reinforce.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
reinforce_baseline
Deadline: Nov 24, 23:59 4 points
This is a continuation of reinforce
assignment.
Using the reinforce_baseline.py template, solve the CartPolev1 environment environment using the REINFORCE with baseline algorithm.
Using a baseline lowers the variance of the value function gradient estimator, which allows faster training and decreases sensitivity to hyperparameter values. To reflect this effect in ReCodEx, note that the evaluation phase will automatically start after 200 episodes. Using only 200 episodes for training in this setting is probably too little for the REINFORCE algorithm, but suffices for the variant with a baseline.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
cart_pole_pixels
Deadline: Nov 24, 23:59 5 points + 6 bonus
The supplied cart_pole_pixels_environment.py
generates a pixel representation of the CartPole
environment
as an $80×80$ image with three channels, with each channel representing one time step
(i.e., the current observation and the two previous ones).
During evaluation in ReCodEx, three different random seeds will be employed, each with time limit of 10 minutes, and if you reach an average return at least 300 on all of them, you obtain 5 points. The task is also a competition and at most 6 points will be awarded according to relative ordering of your solution performances.
The cart_pole_pixels.py template parses several parameters and creates the environment. You are again supposed to train the model beforehand and submit only the trained neural network.
paac
Deadline: Dec 01, 23:59 5 points
Solve the CartPolev1 environment
environment using parallel actorcritic algorithm, employing the vectorized
environment described in car_racing
assignment.
Your goal is to reach an average return of 450 during 100 evaluation episodes.
Start with the paac.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
paac_continuous
Deadline: Dec 01, 23:59 6 points
Solve the MountainCarContinuousv0 environment
environment using parallel actorcritic algorithm with continuous actions.
When actions are continuous, env.action_space
is the same Box
space
as env.observation_space
, offering:
env.action_space.shape
, which specifies the shape of actions (you can assume actions are always a 1D vector),env.action_space.low
andenv.action_space.high
, which specify the ranges of the corresponding actions.
Your goal is to reach an average return of 90 during 100 evaluation episodes.
Start with the paac_continuous.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
ddpg
Deadline: Dec 01, 23:59 7 points
Solve the Pendulumv0 environment environment using deep deterministic policy gradient algorithm. The environment is continuous, states and actions are described at OpenAI Gym Wiki.
Your goal is to reach an average return of 200 during 100 evaluation episodes.
Start with the ddpg.py template, which provides a simple network implementation in TensorFlow. Feel free to use PyTorch instead, if you like.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
walker
Deadline: Dec 08, 23:59 8 points + 10 bonus
In this exercise exploring continuous robot control, try solving the BipedalWalkerv3 environment environment from the OpenAI Gym. The environment is continuous, states and actions are described at OpenAI Gym Wiki.
In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 100, you obtain 8 points. The task is also a competition and at most 10 points will be awarded according to relative ordering of your solution performances.
You can start with the walker.py template, but should probably reuse a lot of code from ddpg.py.
walker_hardcore
Deadline: Dec 15, 23:59 10 bonus
As an extesnion of the walker
assignment, try solving the
BipedalWalkerHardcorev3 environment
environment from the OpenAI Gym.
The task is a competition only and at most 10 points will be awarded according to relative ordering of your solution performances. In ReCodEx, your solution will be evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 0, ReCodEx shows the solution as correct.
You can start with the walker_hardcore.py template, but should probably reuse a lot of code from ddpg.py.
sac_bonus
Deadline: Feb 28, 23:59 8 bonus
In this bonusonly exercise, try using the SAC algorithm to solve the
Hopper environment, but using
HopperBulletEnvv0
from opensource PyBullet framework.
Basic information about Gym interface is in
PyBullet Quickstart Guide;
generally you just need to import pybullet_envs
, which will register the
Gym environments.
You can install PyBullet by pip3 install pybullet
. However, precompiled
binaried are available for Linux only, on Windows the library must be compiled
by a suitable Microsoft Visual C++ compiler.
In ReCodEx, you are expected to submit an already trained model, which is
evaluated on the HopperBulletEnvv0
environment with two seeds, each for 100
episodes with a time limit of 15 minutes. If your average return is at
least 1000, you obtain 6 bonus points.
A template for the SAC algorithm will be made available.
trace_algorithms
Deadline: Feb 28, 23:59 4 points
Starting with the trace_algorithms.py template, implement the following state value estimations:
 use $n$step estimates for a given $n$;
 if requested, use eligibility traces with a given $λ$;
 allow offpolicy correction using importance sampling with control variates, optionally clipping the individual importance sampling ratios by a given threshold.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 trace_algorithms.py n=1
Episode 100, mean 100episode return 96.61 +96.27
Episode 200, mean 100episode return 30.95 +57.27
Episode 300, mean 100episode return 27.00 +47.28
Episode 400, mean 100episode return 11.97 +32.78
Episode 500, mean 100episode return 10.74 +34.76
Episode 600, mean 100episode return 6.70 +32.46
Episode 700, mean 100episode return 3.87 +18.23
Episode 800, mean 100episode return 0.72 +11.11
Episode 900, mean 100episode return 1.34 +24.02
Episode 1000, mean 100episode return 2.68 +9.17
The mean 1000episode return after evaluation 38.58 +87.25
python3 trace_algorithms.py n=4
Episode 100, mean 100episode return 81.76 +98.03
Episode 200, mean 100episode return 3.72 +19.53
Episode 300, mean 100episode return 0.59 +10.51
Episode 400, mean 100episode return 0.99 +8.76
Episode 500, mean 100episode return 0.35 +9.10
Episode 600, mean 100episode return 1.39 +8.22
Episode 700, mean 100episode return 2.42 +7.80
Episode 800, mean 100episode return 2.38 +8.33
Episode 900, mean 100episode return 2.79 +7.16
Episode 1000, mean 100episode return 0.42 +8.51
The mean 1000episode return after evaluation 9.03 +57.13
python3 trace_algorithms.py n=8
Episode 100, mean 100episode return 107.63 +113.99
Episode 200, mean 100episode return 3.57 +16.96
Episode 300, mean 100episode return 0.17 +10.35
Episode 400, mean 100episode return 0.20 +8.34
Episode 500, mean 100episode return 0.27 +12.30
Episode 600, mean 100episode return 1.45 +8.57
Episode 700, mean 100episode return 2.39 +8.68
Episode 800, mean 100episode return 1.92 +8.32
Episode 900, mean 100episode return 2.12 +15.31
Episode 1000, mean 100episode return 5.06 +28.00
The mean 1000episode return after evaluation 69.59 +100.82
python3 trace_algorithms.py n=4 trace_lambda=0.6
Episode 100, mean 100episode return 87.36 +95.03
Episode 200, mean 100episode return 10.61 +28.93
Episode 300, mean 100episode return 3.48 +15.93
Episode 400, mean 100episode return 2.11 +12.50
Episode 500, mean 100episode return 1.09 +8.20
Episode 600, mean 100episode return 1.40 +8.85
Episode 700, mean 100episode return 3.78 +7.59
Episode 800, mean 100episode return 1.77 +8.44
Episode 900, mean 100episode return 0.53 +9.03
Episode 1000, mean 100episode return 1.73 +7.72
The mean 1000episode return after evaluation 7.63 +2.44
python3 trace_algorithms.py n=8 trace_lambda=0.6
Episode 100, mean 100episode return 110.14 +107.12
Episode 200, mean 100episode return 18.70 +45.52
Episode 300, mean 100episode return 4.57 +23.40
Episode 400, mean 100episode return 1.17 +8.73
Episode 500, mean 100episode return 1.57 +8.19
Episode 600, mean 100episode return 2.46 +8.84
Episode 700, mean 100episode return 1.47 +8.32
Episode 800, mean 100episode return 0.11 +9.15
Episode 900, mean 100episode return 1.59 +8.02
Episode 1000, mean 100episode return 0.85 +9.86
The mean 1000episode return after evaluation 6.63 +16.25
python3 trace_algorithms.py n=1 off_policy
Episode 100, mean 100episode return 74.33 +71.96
Episode 200, mean 100episode return 24.48 +32.66
Episode 300, mean 100episode return 19.26 +26.23
Episode 400, mean 100episode return 10.81 +22.29
Episode 500, mean 100episode return 10.40 +19.60
Episode 600, mean 100episode return 2.12 +14.89
Episode 700, mean 100episode return 3.98 +17.19
Episode 800, mean 100episode return 0.89 +11.64
Episode 900, mean 100episode return 0.04 +9.86
Episode 1000, mean 100episode return 1.02 +7.64
The mean 1000episode return after evaluation 22.17 +73.86
python3 trace_algorithms.py n=4 off_policy
Episode 100, mean 100episode return 88.75 +101.17
Episode 200, mean 100episode return 28.60 +73.61
Episode 300, mean 100episode return 7.32 +37.73
Episode 400, mean 100episode return 1.85 +8.22
Episode 500, mean 100episode return 1.18 +9.91
Episode 600, mean 100episode return 3.61 +7.13
Episode 700, mean 100episode return 2.85 +7.65
Episode 800, mean 100episode return 1.69 +7.72
Episode 900, mean 100episode return 1.58 +8.04
Episode 1000, mean 100episode return 1.17 +9.15
The mean 1000episode return after evaluation 7.69 +2.61
python3 trace_algorithms.py n=8 off_policy
Episode 100, mean 100episode return 122.49 +113.92
Episode 200, mean 100episode return 35.93 +85.81
Episode 300, mean 100episode return 13.67 +58.23
Episode 400, mean 100episode return 3.70 +7.68
Episode 500, mean 100episode return 1.98 +8.26
Episode 600, mean 100episode return 0.45 +8.44
Episode 700, mean 100episode return 2.41 +8.29
Episode 800, mean 100episode return 1.90 +6.92
Episode 900, mean 100episode return 3.42 +7.81
Episode 1000, mean 100episode return 2.59 +7.62
The mean 1000episode return after evaluation 7.88 +2.69
python3 trace_algorithms.py n=1 off_policy vtrace_clip=1
Episode 100, mean 100episode return 68.42 +71.79
Episode 200, mean 100episode return 29.45 +42.22
Episode 300, mean 100episode return 18.72 +26.28
Episode 400, mean 100episode return 13.66 +27.16
Episode 500, mean 100episode return 6.64 +19.08
Episode 600, mean 100episode return 3.34 +14.91
Episode 700, mean 100episode return 5.72 +16.65
Episode 800, mean 100episode return 0.49 +11.18
Episode 900, mean 100episode return 0.67 +10.13
Episode 1000, mean 100episode return 0.11 +11.14
The mean 1000episode return after evaluation 18.13 +69.41
python3 trace_algorithms.py n=4 off_policy vtrace_clip=1
Episode 100, mean 100episode return 67.42 +80.24
Episode 200, mean 100episode return 5.69 +18.39
Episode 300, mean 100episode return 0.28 +11.68
Episode 400, mean 100episode return 2.29 +8.32
Episode 500, mean 100episode return 1.61 +8.47
Episode 600, mean 100episode return 1.73 +7.92
Episode 700, mean 100episode return 1.89 +8.27
Episode 800, mean 100episode return 3.10 +7.44
Episode 900, mean 100episode return 2.72 +8.17
Episode 1000, mean 100episode return 2.86 +7.23
The mean 1000episode return after evaluation 7.81 +2.56
python3 trace_algorithms.py n=8 off_policy vtrace_clip=1
Episode 100, mean 100episode return 82.55 +99.27
Episode 200, mean 100episode return 0.48 +13.67
Episode 300, mean 100episode return 0.41 +9.27
Episode 400, mean 100episode return 1.35 +8.62
Episode 500, mean 100episode return 1.43 +7.77
Episode 600, mean 100episode return 1.62 +7.91
Episode 700, mean 100episode return 0.85 +7.96
Episode 800, mean 100episode return 1.99 +7.76
Episode 900, mean 100episode return 2.13 +7.47
Episode 1000, mean 100episode return 1.71 +7.98
The mean 1000episode return after evaluation 7.59 +2.75
az_quiz
Deadline: Jan 05, 23:59 (competition); Feb 28, 23:49 (nonbonus points) 10 points + 10 bonus
In this competition assignment, use Monte Carlo Tree Search to learn an agent for a simplified version of AZkvíz. In our version, the agent does not have to answer questions and we assume that all answers are correct.
The game itself is implemented in the
az_quiz.py
module, using randomized=False
constructor argument.
The evaluation in ReCodEx should be implemented by returning an object
implementing a method play
, which given an AZkvíz instance returns the chosen
move. The illustration of the interface is in the
az_quiz_player_random.py
module, which implements a random agent.
Your solution in ReCodEx is automatically evaluated against a very simple heuristic az_quiz_player_simple_heuristic.py, playing 56 games as a starting player and 56 games as a nonstarting player. The time limit for the games is 15 minutes and you should see the win rate directly in ReCodEx. If you achieve at least 80%, you will pass the assignment.
The final competition evaluation will be performed after the deadline by a roundrobin tournament.
Note that az_quiz_evaluator.py can be used to evaluate any two given implementations and there are two interactive players available, az_quiz_player_interactive_mouse.py and az_quiz_player_interactive_keyboard.py.
For inspiration, use the official pseudocode for AlphaZero. However, note that there are some errors in it.
 Below line 215, the following line should be inserted
Otherwise theroot.visit_count = 1
visit_count
is 0,ucb_score
will return all zeros for all actions and during the first simulation, the last valid action in the root will always be chosen.  On line 237, next action should be sampled according to a distribution of normalized visit counts, not according to a softmax of visit counts.
 On line 258, the value of a child should be inverted, if the player to play in
the current node is the other one than in the child (which is almost always
true). If the assume the values are in $[1, 1]$ range, the fixed line should be
value_score =  child.value()
 On line 279, a value is inverted using
1  value
; however, for values in $[1, 1]$, it should be inverted as value
.  Below line 287, the sampled gamma random variables should be normalized
to produce a Dirichlet random sample:
noise /= np.sum(noise)
az_quiz_randomized
Deadline: Jan 05, 23:59 10 bonus
Extend the az_quiz
assignment to handle the possibility of wrong
answers. Therefore, when choosing a field, the agent might answer
incorrectly.
To instantiate this randomized game variant, pass randomized=True
to the AZQuiz
class of az_quiz.py.
The Monte Carlo Tree Search has to be slightly modified to handle stochastic
MDP. The information about distribution of possible next states is provided
by the AZQuiz.all_moves
method, which returns a list of (probability, az_quiz_instance)
next states (in our environment, there are always two
possible next states).
paac_lstm
Deadline: Feb 28, 23:59
memory_game
Deadline: Feb 28, 23:59 3 points
In this exercise we explore a partially observable environment. Consider a oneplayer variant of a memory game (pexeso), where a player repeatedly flip cards. If the player flips two cards with the same symbol in succession, the cards are removed and the player recieves a reward of +2. Otherwise the player recieves a reward of 1. An episode ends when all cards are removed. Note that it is valid to try to flip an already removed card.
Let there be $N$ cards in the environment, $N$ being even. There are $N+1$
actions – the first $N$ flip the corresponding card, and the last action
flips the unused card with the lowest index (or the card $N$ if all have
been used already). The observations consist of a pair of discrete values
(card, symbol), where the card is the index of the card flipped, and
the symbol is the symbol on the flipped card. The env.states
returns
a pair $(N, N/2)$, representing there are $N$ card indices and $N/2$
symbol indices.
Every episode can be ended by at most $3N/2$ actions, and the required return is therefore greater or equal to zero. Note that there is a limit of at most $2N$ actions per episode. The described environment is provided by the memory_game_environment.py module.
Your goal is to solve the environment, using supervised learning via the provided
expert episodes and networks with external memory. The environment implements
an env.expert_episode()
method, which returns a fresh correct episode
as a list of (state, action)
pairs (with the last action
being None
).
ReCodEx evaluates your solution on environments with 8, 12 and 16 cards
(utilizing the cards
argument). For each card number, 100 episodes are
simulated once you pass evaluating=True
to env.reset
and your solution gets
1 point if the average return is nonnegative. You can
train the agent directly in ReCodEx (the time limit is 15 minutes),
or submit a pretrained one.
A template memory_game.py is available, commenting a possible use of memory augmented networks.
memory_game_rl
Deadline: Feb 28, 23:59 5 points
This is a continuation of the memory_game
assignment.
In this task, your goal is to solve the memory game environment
using reinforcement learning. That is, you must not use the
env.expert_episode
method during training. You can start with the
memory_game_rl.py
template, which extends the memory_game
template by generating
training episodes suitable for some reinforcement learning algorithm.
ReCodEx evaluates your solution on environments with 4, 6 and 8 cards (utilizing
the cards
argument). For each card number, your solution gets 2 points
(1 point for 4 cards) if the average return is nonnegative. You can train the agent
directly in ReCodEx (the time limit is 15 minutes), or submit a pretrained one.
Install

Installing to central user packages repository
You can install all required packages to central user packages repository using
pip3 install user tensorflow==2.3.1 tensorflow_probability==0.11.1 numpy==1.18.5 gym==0.17.2 box2d==2.3.10
. 
Installing to a virtual environment
Python supports virtual environments, which are directories containing independent sets of installed packages. You can create the virtual environment by running
python3 m venv VENV_DIR
followed byVENV_DIR/bin/pip3 install tensorflow==2.3.1 tensorflow_probability==0.11.1 numpy==1.18.5 gym==0.17.2 box2d==2.3.10
.
ReCodEx

What files can be submitted to ReCodEx?
You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.

What file does ReCodEx execute and what arguments does it use?
Exactly one file with
py
suffix must contain a line starting withdef main(
. Such a file is imported by ReCodEx and themain
method is executed (during the import,__name__ == "__recodex__"
).The file must also export an argument parser called
parser
. ReCodEx uses its arguments and default values, but is overwrites some of the arguments depending on the test being executed – the template should always indicate which arguments are set by ReCodEx and which are left intact. 
What are the time and memory limits?
The memory limit during evaluation is 1.5GB. The time limit varies, but should be at least 10 seconds and at least twice the running time of my solution.
Competitions

Which algorithms are allowed during competitions?
Unless stated otherwise, you can use any algorithm to solve the competition task at hand. However, the implementation should be either created by you or you should understand it fully (i.e., if you use for example PPO algorithm, you should know how it works and how it is implemented).
Requirements
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 (including the bonus points) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available.
To pass the exam, you need to obtain at least 60, 75 and 90 out of 100point exam, to obtain grades 3, 2 and 1, respectively. (PhD students with binary grades require 75 points.) The exam consists of 100pointworth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every lecture 1 to 11). In addition, you can get at most 40 surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count.
Exam Questions
Lecture 1 Questions

Derive how to incrementally update a running average (how to compute an average of $N$ numbers using the average of the first $N1$ numbers). [5]

Describe multiarm bandits and write down the $\epsilon$greedy algorithm for solving it. [5]

Define a Markov Decision Process, including the definition of a return. [5]

Describe how does a partially observable Markov decision process extend the Markov decision process and how is the agent altered. [5]
Lecture 2 Questions

Define a value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]

Define an actionvalue function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]

Express a value function using an actionvalue function, and express an actionvalue function using a value function. [5]

Define optimal value function and optimal actionvalue function. Then define optimal policy in such a way that its existence is guaranteed. [5]

Write down the Bellman optimality equation. [5]

Define the Bellman backup operator. [5]

Write down the value iteration algorithm. [5]

Define the supremum norm $\cdot_\infty$ and prove that Bellman backup operator is a contraction with respect to this norm. [10]

Formulate and prove the policy improvement theorem. [10]

Write down the policy iteration algorithm. [10]

Write down the tabular MonteCarlo onpolicy everyvisit $\epsilon$soft algorithm. [10]
Lecture 3 Questions

Write down the Sarsa algorithm. [10]

Write down the Qlearning algorithm. [10]

Write down the Double Qlearning algorithm. [10]

Elaborate on how can importance sampling estimate expectations with respect to $\pi$ based on samples of $b$. [5]

Show how to estimate returns in the offpolicy case, both with (1) ordinary importance sampling and (2) weighted importance sampling. [10]

Write down the Expected Sarsa algorithm and show how to obtain Qlearning from it. [10]

Show the bootstrapped estimate of $n$step return. [5]

Write down the update in onpolicy $n$step Sarsa (assuming you already have $n$ previous steps, actions and rewards). [5]

Write down the update in offpolicy $n$step Sarsa with importance sampling (assuming you already have $n$ previous steps, actions and rewards). [10]

Write down the update of $n$step Treebackup algorithm (assuming you already have $n$ previous steps, actions and rewards). [10]
Lecture 4 Questions

Assuming function approximation, define Mean squared value error. [5]

Write down the gradient MonteCarlo onpolicy everyvisit $\epsilon$soft algorithm. [10]

Write down the semigradient $\epsilon$greedy Sarsa algorithm. [10]

Prove that semigradient TD update is not an SGD update of any loss. [10]

What are the three elements causing offpolicy divergence with function approximation? Write down the Baird's counterexample. [10]

Explain the role of a replay buffer in Deep Q Networks. [5]

How is the target network used and updated in Deep Q Networks? [5]

Explain how is reward clipping used in Deep Q Networks. What other clipping is used? [5]

Formulate the loss used in Deep Q Networks. [5]

Write down the Deep Q Networks training algorithm. [10]
Lecture 5 Questions

Explain the difference between DQN and Double DQN. [5]

Describe prioritized replay (how are transitions sampled from the replay buffer, how uptodate the priorities [according to which we sample] are, how are unseen transitions boosted, how is importance sampling used to account for the change in the sampling distribution). [10]

How is the actionvalue function computed in dueling networks? [5]

Describe a fully connected layer in Noisy nets (parametrization, computation, effective noise generation). [5]

In Distributional RL, describe how is the distribution of rewards represented and how it is predicted using a neural network. [5]

Write down the Bellman update in Distributional RL (the mapping of atoms does not need to be mathematically flawless, it is enough to describe how it should be done). [5]

Formulate the loss used in Distributional RL. [5]
Lecture 6 Questions

Formulate the policy gradient theorem. [5]

Prove the part of the policy gradient theorem showing the value of $\nabla_{\boldsymbol\theta} v_\pi(s)$. [10]

Assuming the policy gradient theorem, formulate the loss used by the REINFORCE algorithm and show how can its gradient be expressed as an expectation over states and actions. [5]

Write down the REINFORCE algorithm. [10]

Show that introducing baseline does not influence validity of the policy gradient theorem. [5]

Write down the REINFORCE with baseline algorithm. [10]

Write down the onestep Actorcritic algorithm. [10]

How and why is entropy regularization used in policy gradient algorithms? [5]

The Asynchronous advantage actorcritic (A3C) policy may utilize recurrent neural networks. How is the training structured to allow backpropagation through them (would vanilla DQN, vanilla REINFORCE, vanilla actorcritic work with recurrent neural networks)? [5]
Lecture 7 Questions

Explain the difference between a regular Actorcritic and Parallel Advantage Actor Critic algorithms. [5]

Considering continuous actions modeled by a normal distribution with diagonal covariance, describe how is the policy distribution computed (network architecture, output activation functions) and how does the loss of a simple REINFORCE algorithm look like. [5]

Formulate the deterministic policy gradient theorem for $\nabla_{\boldsymbol\theta} v_\pi(s)$. [5]

Formulate the deterministic policy gradient theorem for $\nabla_{\boldsymbol\theta} J(\boldsymbol\theta)$. [5]

Prove the part of the deterministic policy gradient theorem showing the value of $\nabla_{\boldsymbol\theta} v_\pi(s)$. [10]

Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Deep Determinisic Policy Gradients (DDPG) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]
Lecture 8 Questions

How is the return in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm estimated? [5]

Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]
Lecture 9 Questions

Define a onestep TD error and express the $n$step return as a sum of them. [5]

Define a onestep TD error and express the $n$step return with offpolicy correction as a sum of them. [5]

Define the $\lambda$return. [5]

Define the $n$step truncated $\lambda$return. [5]

Define a onestep TD error and express the $n$step truncated $\lambda$return as a sum of them. [5]

Define a onestep TD and express the $n$step truncated $\lambda$return with offpolicy correction as a sum of them. [5]

Define the Vtrace estimate and write down the policy to whose value function the Vtrace estimate converges to. [10]

Explain why the fixed point of the Vtrace operator does not depend on the truncation of all but the last importance sampling ratios. [10]

Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the IMPALA algorithm. [10]

Sketch the population based training used in the IMPALA algorithm. [10]
Lecture 10 Questions

Considering multiarm bandits, write down the UCB algorithm. [5]

Describe the inputs and outputs of a neural network used in AlphaZero, and describe the inputs and outputs of a MonteCarlo tree search. [5]

Write down the loss used in AlphaZero algorithm. [5]

What quantities are kept in a node of a MonteCarlo tree search? [5]

How are actions selected in a MonteCarlo tree search? [10]

What does AlphaZero use to maintain exploration in a MonteCarlo tree search? [5]

Describe the backup phase of MonteCarlo tree search, i.e., the steps you perform when you reach a leaf during the tree search. [5]

How are the actions selected in AlphaZero selfplay? [5]
Lecture 11 Questions

In PopArt normalization, the value function is computed based on a normalized value predictor $n$ as $\sigma n + \mu$. Describe how to maintain $\sigma$ and $\mu$, how to compute normalized advantage based on return $G$, and how is the normalized value predictor modified when the estimates of $\sigma$ and $\mu$ change. [10]

Define the transformed Bellman operator. [5]

Define the transformed Bellman operator. Then, assuming $h$ is strictly monotonically increasing function and considering a deterministic Markov decision process, show to what does a transformed Bellman operator $\mathcal{T}_h$ converge and prove it. [10]

Write down the return transformation used for Atari environments (for example by R2D2). [5]

Describe the replay buffer elements in R2D2. What is the difference between the zerostate and storedstate strategies, and how is burnin used? [5]