Deep Reinforcement Learning – Winter 2019/20
In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with superhuman performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by selfplay), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.
Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.
About
SIS code: NPFL122
Semester: winter
Ecredits: 6
Examination: 2/2 C+Ex
Guarantor: Milan Straka
Timespace Coordinates
 lecture: the lecture is held on Monday 15:40 in S4; first lecture is on Oct 07
 practicals: there are two parallel practicals, on Monday 17:20 in SU1 and on Tuesday 10:40 in SU2; first practicals are on Oct {07,08}
Lectures
1. Introduction to Reinforcement Learning Slides PDF Slides Video multiarmed_bandits
2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods Slides PDF Slides Video policy_iteration monte_carlo
3. Temporal Difference Methods, OffPolicy Methods Slides PDF Slides Video importance_sampling q_learning lunar_lander
4. SelfStudy: Nstep Temporal Difference Methods Slides PDF Slides
5. Function Approximation, Deep Q Network Slides PDF Slides Video q_learning_tiles q_network
6. Rainbow Slides PDF Slides Video car_racing
7. Gradient Based Methods Slides PDF Slides Video reinforce reinforce_baseline cart_pole_pixels
8. Continuous Actions Slides PDF Slides Video paac paac_continuous ddpg walker walker_hardcore
9. TD3, Monte Carlo Tree Search Slides PDF Slides Video az_quiz az_quiz_randomized
10. Vtrace, PopArt Normalization, Partially Observable MDPs Slides PDF Slides Video vtrace memory_game memory_game_rl
License
Unless otherwise stated, teaching materials for this course are available under CC BYSA 4.0.
The lecture content, including references to study materials.
The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.
References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.
1. Introduction to Reinforcement Learning
Oct 07 Slides PDF Slides Video
 History of RL [Chapter 1 of RLB]
 Multiarmed bandits [Chapter 2 of RLB]
2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods
Oct 14 Slides PDF Slides Video policy_iteration monte_carlo
 Markov Decision Process [Sections 33.3 of RLB]
 Policies and Value Functions [Sections 3.53.6 of RLB]
 Value Iteration [Sections 4 and 4.4 of RLB]
 Proof of convergence only in slides
 Policy Iteration [Sections 4.14.3 of RLB]
 Generalized Policy Iteration [Section 4.6 or RLB]
 Monte Carlo Methods [Sections 55.4 of RLB]
3. Temporal Difference Methods, OffPolicy Methods
Oct 21 Slides PDF Slides Video importance_sampling q_learning lunar_lander
 Modelfree and modelbased methods, using statevalue or actionvalue functions [Chapter 8 before Section 8.1, and Section 6.8 of RLB]
 Temporaldifference methods [Sections 66.3 of RLB]
 Sarsa [Section 6.4 of RLB]
 Qlearning [Section 6.5 of RLB]
 Offpolicy Monte Carlo Methods [Sections 5.55.7 of RLB]
 Expected Sarsa [Section 6.6 of RLB]
4. SelfStudy: Nstep Temporal Difference Methods
Nov 04 Slides PDF Slides
This is a selfstudy lecture.
 Nstep TD policy evaluation [Section 7.1 of RLB]
 Nstep Sarsa [Section 7.2 of RLB]
 Offpolicy nstep Sarsa [Section 7.3 of RLB]
 Tree backup algorithm [Section 7.5 of RLB]
5. Function Approximation, Deep Q Network
Nov 11 Slides PDF Slides Video q_learning_tiles q_network
 Function approximation [Sections 99.3 of RLB]
 Tile coding [Section 9.5.4 of RLB]
 Linear function approximation [Section 9.4 of RLB, without the Proof of Convergence if Linear TD(0)]
 SemiGradient TD methods [Sections 9.3, 1010.2 of RLB]
 Offpolicy function approximation TD divergence [Sections 11.211.3 of RLB]
 Deep Q Network [Volodymyr Mnih et al.: Humanlevel control through deep reinforcement learning]
6. Rainbow
Nov 18 Slides PDF Slides Video car_racing
 Double Deep Q Network (DDQN) [Hado van Hasselt et al.: Deep Reinforcement Learning with Double Qlearning]
 Prioritized Experience Replay [Tom Schaul et al.: Prioritized Experience Replay]
 Dueling Deep Q Network [Ziyu Wang et al.: Dueling Network Architectures for Deep Reinforcement Learning]
 Noisy Nets [Meire Fortunato et al.: Noisy Networks for Exploration]
 Distributional Reinforcement Learning [Marc G. Bellemare et al.: A Distributional Perspective on Reinforcement Learning]
 Rainbow [Matteo Hessel et al.: Rainbow: Combining Improvements in Deep Reinforcement Learning]
7. Gradient Based Methods
Nov 25 Slides PDF Slides Video reinforce reinforce_baseline cart_pole_pixels
 Policy Gradient Methods [Sections 1313.1 of RLB]
 Policy Gradient Theorem [Section 13.2 of RLB]
 REINFORCE algorithm [Section 13.3 of RLB]
 REINFORCE with baseline algorithm [Section 13.4 of RLB]
 ActorCritic methods [Section 13.5 of RLB, without the eligibility traces variant]
 A3C and asynchronous RL [Volodymyr Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning]
8. Continuous Actions
Dec 02 Slides PDF Slides Video paac paac_continuous ddpg walker walker_hardcore
 PAAC [Alfredo V. Clemente et al.: Efficient Parallel Methods for Deep Reinforcement Learning]
 Gradient methods with continuous actions [Section 13.7 of RLB]
 Deterministic policy gradient theorem (DPG) [David Silver et al.: Deterministic Policy Gradient Algorithms]
 Deep deterministic policy gradient (DDPG) [Timothy P. Lillicrap et al.: Continuous Control with Deep Reinforcement Learning]
9. TD3, Monte Carlo Tree Search
Dec 09 Slides PDF Slides Video az_quiz az_quiz_randomized
 Twin delayed deep deterministic policy gradient (TD3) [Scott Fujimoto et al.: Addressing Function Approximation Error in ActorCritic Methods]
 AlphaZero [David Silver et al.: A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay
10. Vtrace, PopArt Normalization, Partially Observable MDPs
Dec 16 Slides PDF Slides Video vtrace memory_game memory_game_rl
 The Vtrace algorithm of IMPALA [Lasse Espeholt et al.: IMPALA: Scalable Distributed DeepRL with Importance Weighted ActorLearner Architectures]
 PopArt reward normalization [Matteo Hessel et al.: Multitask Deep Reinforcement Learning with PopArt]
 Recurrent Replay Distributed DQN (R2D2) [Steven Kapturowski et al.: Recurrent Experience Replay in Distributed Reinforcement Learning]
 MERLIN model [Greg Wayne et al.:Unsupervised Predictive Memory in a GoalDirected Agent]
 FTW agent for multiplayer CTF [Max Jaderberg et al.: Humanlevel performance in firstperson multiplayer games with populationbased deep reinforcement learning]
Requirements
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 will be transfered to the exam.
Environment
The tasks are evaluated automatically using the ReCodEx Code Examiner. The evaluation is performed using Python 3.6, TensorFlow 2.0.0, NumPy 1.17.2 and OpenAI Gym 0.14.0. For those using PyTorch, CPU version 1.2.0 is available.
You can install TensorFlow and Gym either to user packages using
pip3 install user tensorflow==2.0.0 gym==0.14.0 scipy box2dpy ataripy
(with the last three backages being optinal dependencies of gym
),
or create a virtual environment using python3 m venv VENV_DIR
and then installing
the packages inside it by running
VENV_DIR/bin/pip3 install ...
.
On Windows, you can use thirdparty precompiled versions of
box2dpy.
Teamwork
Working in teams of size 2 (or at most 3) is encouraged. All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. However, each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
Submitting Data Files to ReCodEx
Because ReCodEx allows submitting only Python
sources in our settings, we need to embed models and other nonPython data
into Python sources. You can use
the embed.py
script,
which compresses the given files and directories and embeds them into a Python
module, which extracts them when imported or executed.
multiarmed_bandits
Deadline: Oct 20, 23:59 8 points
Perform a parameter study of various approaches to solving multiarmed bandits. For every hyperparameter choice, perform 100 episodes, each consisting of 1000 trials, and report the average and standard deviation of the 100 episode returns.
Start with the multiarmed_bandits.py
template, which defines MultiArmedBandits
environment. We use API based on
OpenAI Gym Environment
class, notably the following
two methods:
reset() → new_state
: starts a new episodestep(action) → new_state, reward, done, info
: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environmentspecific information Of course, the states are not used by the multiarmed bandits (None
is returned).
Your goal is to implement the following modes of calculation. In addition to submitting the solution to ReCodEx, you should use multiarmed_bandits_draw.py to plots the results in a graph.
greedy
[2 points]: perform $ε$greedy search with parameterepsilon
, computing the value function using averaging. (Results for $ε ∈ \{1/64, 1/32, 1/16, 1/8, 1/4\}$ are plotted.)greedy
andalpha
$≠0$ [1 point]: perform $ε$greedy search with parameterepsilon
and initial function estimate of 0, using fixed learning ratealpha
. (Results for $α=0.15$ and $ε ∈ \{1/64, 1/32, 1/16, 1/8, 1/4\}$ are plotted.)greedy
,alpha
$≠0$ andinitial
$≠0$ [1 point]: perform $ε$greedy search with parameterepsilon
, giveninitial
value as starting value function and fixed learning ratealpha
. (Results forinitial
$=1$, $α=0.15$ and $ε ∈ \{1/128, 1/64, 1/32, 1/16\}$ are plotted.)ucb
[2 points]: perform UCB search with confidence levelc
and computing the value function using averaging. (Results for $c ∈ \{1/4, 1/2, 1, 2, 4\}$ are plotted.)gradient
[2 points]: choose actions according to softmax distribution, updating the parameters using SGD to maximize expected reward. (Results for $α ∈ \{1/16, 1/8, 1/4, 1/2\}$ are plotted.)
policy_iteration
Deadline: Nov 03, 23:59 5 points
Consider the following gridworld:
Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:
GridWorld.states
: return number of states (11
)GridWorld.actions
: return lists with labels of the actions (["↑", "→", "↓", "←"]
)GridWorld.step(state, action)
: return possible outcomes of performing theaction
in a givenstate
, as a list of triples containingprobability
: probability of the outcomereward
: reward of the outcomenew_state
: new state of the outcome
Implement policy iteration algorithm, with steps
steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform iterations
applications of the Bellman equation.
Perform the policy evaluation synchronously (i.e., do not overwrite the current
value function when computing its improvement). Assume the initial policy is
“go North” and initial value function is zero.
After given number of steps and iterations, print the resulting value function and resulting policy. For example, the output after 4 steps and 4 iterations should be:
9.15→ 10.30→ 11.32→ 12.33↑
8.12↑ 3.35← 2.58←
6.95↑ 5.90← 4.66← 4.93↓
monte_carlo
Deadline: Nov 03, 23:59 6 points
Solve the CartPolev1 environment environment from the OpenAI Gym using the Monte Carlo reinforcement learning algorithm.
Use the supplied cart_pole_evaluator.py module (depending on gym_evaluator.py) to interact with the discretized environment. The environment has the following methods and properties:
states
: number of states of the environmentactions
: number of actions of the environmentepisode
: number of the current episode (zerobased)reset(start_evaluate=False) → new_state
: starts a new episodestep(action) → new_state, reward, done, info
: perform the chosen action in the environment, returning the new state, obtained reward, a boolean flag indicating an end of episode, and additional environmentspecific informationrender()
: render current environment state
Once you finish training (which you indicate by passing start_evaluate=True
to reset
), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100episode
average return each 10 episodes even during training.
You can start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Note that you must not submit gym_evaluator.py
nor cart_pole_evaluator.py
to ReCodEx.
importance_sampling
Deadline: Nov 10, 23:59 4 points
Using the FrozenLakev0 environment environment, implement Monte Carlo weighted importance sampling to estimate state value function $V$ of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behaviour policy, which uniformly chooses among all four actions.
Start with the importance_sampling.py template, which creates the environment and generates episodes according to behaviour policy.
For $1000$ episodes, the output of your program should be the following:
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.21 0.00
0.00 0.00 0.45 0.00
q_learning
Deadline: Nov 10, 23:59 6 points
Solve the MountainCarv0 environment environment from the OpenAI Gym using the Qlearning reinforcement learning algorithm. Note that this task does not require TensorFlow.
Use the supplied mountain_car_evaluator.py
module (depending on gym_evaluator.py)
to interact with the discretized environment. The environment
methods and properties are described in the monte_carlo
assignment.
Your goal is to reach an average reward of 150 during 100 evaluation episodes.
You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Qlearning is a bit tricky – I usualy start with a larger value of $ε$ (like 0.2 or even 0.5) an then gradually decrease it to almost zero.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
Note that you must not submit gym_evaluator.py
nor mountain_car_evaluator.py
to ReCodEx.
lunar_lander
Deadline: Nov 10, 23:59 7 points + 7 bonus
Solve the LunarLanderv2 environment environment from the OpenAI Gym. Note that this task does not require TensorFlow.
Use the supplied lunar_lander_evaluator.py
module (depending on gym_evaluator.py
to interact with the discretized environment. The environment
methods and properties are described in the monte_carlo
assignment,
but include one additional method:
expert_trajectory() → initial_state, trajectory
This method generates one expert trajectory and returns a pair ofinitial_state
andtrajectory
, wheretrajectory
is a list of the tripples (action, reward, next_state). You can use this method only during training, not during evaluation.
To pass the task, you need to reach an average return of 0 during 100 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
The task is additionally a competition and at most 7 points will be awarded according to relative ordering of your solution performances.
You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
Note that you must not submit gym_evaluator.py
nor lunar_lander_evaluator.py
to ReCodEx.
q_learning_tiles
Deadline: Nov 24, 23:59 5 points
Improve the q_learning
task performance on the
MountainCarv0 environment
environment using linear function approximation with tile coding.
Your goal is to reach an average reward of 110 during 100 evaluation episodes.
Use the mountain_car_evaluator.py
module (depending on gym_evaluator.py)
to interact with the discretized environment. The environment
methods and properties are described in the monte_carlo
assignment, with the
following changes:

The
env.weights
method return the number of weights of the linear function approximation. 
The
state
returned by theenv.step
method is a list containing weight indices of the current state (i.e., the feature vector of the state consists of zeros and ones, and only the indices of the ones are returned). The (action)value function for a state is therefore approximated as a sum of the weights whose indices are returned byenv.step
.The default number of tiles in tile encoding (i.e., the size of the list with weight indices) is
args.tiles=8
, but you can use any number you want (but the assignment is solvable with 8).
You can start with the q_learning_tiles.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Implementing Qlearning is enough to pass the assignment, even if both Nstep Sarsa and Tree Backup converge a little faster.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
Note that you must not submit gym_evaluator.py
nor mountain_car_evaluator.py
to ReCodEx.
q_network
Deadline: Nov 24, 23:59 6 points
Solve the CartPolev1 environment environment from the OpenAI Gym using Qlearning with neural network as a function approximation.
The cart_pole_evaluator.py
module (depending on gym_evaluator.py)
can also create a continuous environment using environment(discrete=False)
.
The continuous environment is very similar to the discrete environment, except
that the states are vectors of realvalued observations with shape environment.state_shape
.
Use Qlearning with neural network as a function approximation, which for a given states returns stateaction values for all actions. You can use any network architecture, but one hidden layer of 20 ReLU units is a good start.
Your goal is to reach an average return of 400 during 100 evaluation episodes.
You can start with the q_network.py template, which provides a simple network implementation in TensorFlow.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes (so you can train in ReCodEx, but you can also pretrain your network if you like).
Note that you must not submit gym_evaluator.py
nor cart_pole_evaluator.py
to ReCodEx.
car_racing
Deadline: Dec 01, 23:59 8 points + 10 bonus
The goal of this competition is to use Deep Q Networks and its improvements on a more realworld CarRacingv0 environment environment from the OpenAI Gym.
Use the supplied car_racing_evaluator.py module (depending on gym_evaluator.py to interact with the environment. The environment is continuous and states are RGB images of size $96×96×3$, but you can downsample them even more. The actions are also continuous and consist of an array with the following three elements:
steer
in range [1, 1]gas
in range [0, 1]brake
in range [0, 1]
Internally you should generate discrete actions and convert them to the required
representation before the step
call. Good initial action space is to use
9 actions – a Cartesian product of 3 steering actions (left/right/none) and
3 driving actions (gas/brake/none).
Nov 22: The frame skipping support was changed. The
evironment supports frame skipping without rendering the skipped frames, by
passing frame_skip
parameter to car_racing_evaluator.environment(frame_skip=1)
method – the value of frame_skip
determines how many times is an action
repeated.
Nov 19: The environment also supports parallel execution (use multiple CPU threads to simulate several environments in parallel during training), by providing the following two methods:
parallel_init(num_workers) → initial_states
, which initializes the given number of parallel workers and returns their environment initial states. This method can be called at most once.parallel_step(actions) → List[next_state, reward, done, info]
, which performs given action in respective environment, and return the usual information with one exception: Ifdone=True
, thennext_state
is already an initial state of newly started episode.
In ReCodEx, you are expected to submit an already trained model, which is evaluated on 15 different tracks with a total time limit of 15 minutes. If your average return is at least 200, you obtain 8 points. The task is also a competition and at most 10 points will be awarded according to relative ordering of your solution performances.
The car_racing.py template parses several useful parameters and creates the environment. Note that the car_racing_evaluator.py can be executed directly and in that case you can drive the car using arrows.
Note that you must not submit gym_evaluator.py
nor car_racing_evaluator.py
to ReCodEx.
reinforce
Deadline: Dec 08, 23:59 5 points
Solve the CartPolev1 environment environment from the OpenAI Gym using the REINFORCE algorithm.
The supplied cart_pole_evaluator.py
module (depending on gym_evaluator.py)
can create a continuous environment using environment(discrete=False)
.
The continuous environment is very similar to the discrete environment, except
that the states are vectors of realvalued observations with shape environment.state_shape
.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
You can start with the reinforce.py template, which provides a simple network implementation in TensorFlow.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Note that you must not submit gym_evaluator.py
nor cart_pole_evaluator.py
to ReCodEx.
reinforce_baseline
Deadline: Dec 08, 23:59 5 points
This is a continuation of reinforce
assignment.
Using the reinforce_baseline.py template, solve the CartPolev1 environment environment using the REINFORCE with baseline algorithm.
Using a baseline lowers the variance of the value function gradient estimator, which allows faster training and decreases sensitivity to hyperparameter values. To reflect this effect in ReCodEx, note that the evaluation phase will automatically start after 200 episodes. Using only 200 episodes for training in this setting is probably too little for the REINFORCE algorithm, but suffices for the variant with a baseline.
Your goal is to reach an average return of 490 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Note that you must not submit gym_evaluator.py
nor cart_pole_evaluator.py
to ReCodEx.
cart_pole_pixels
Deadline: Dec 08, 23:59 6 points + 6 bonus
The supplied cart_pole_pixels_evaluator.py
module (depending on gym_evaluator.py)
generates a pixel representation of the CartPole
environment
as an $80×80$ image with three channels, with each channel representing one time step
(i.e., the current observation and the two previous ones).
To pass the compulsory part of the assignment, you need to reach an average return of 200 during 100 evaluation episodes. During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
The task is additionally a competition and at most 5 points will be awarded according to relative ordering of your solution performances.
The cart_pole_pixels.py template parses several parameters and creates the environment. You are again supposed to train the model beforehand and submit only the trained neural network.
Note that you must not submit gym_evaluator.py
nor cart_pole_pixels_evaluator.py
to ReCodEx.
paac
Deadline: Dec 15, 23:59 5 points
Using the paac.py
template, solve the CartPolev1 environment
environment using parallel actorcritic algorithm. Use the parallel_init
and parallel_step
methods described in car_racing
assignment.
Your goal is to reach an average return of 450 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
Note that you must not submit gym_evaluator.py
to ReCodEx.
paac_continuous
Deadline: Dec 15, 23:59 6 points
Using the paac_continuous.py template, solve the MountainCarContinuousv0 environment environment using parallel actorcritic algorithm with continuous actions.
The gym_environment
now provides two additional methods:
action_shape
: returns required shape of continuous action. You can assume the actions are always an onedimensional vector.action_ranges
: returns a pair of vectorslow
,high
. These denote valid ranges for the actions, solow[i]
$≤$action[i]
$≤$high[i]
.
Your goal is to reach an average return of 90 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
Note that you must not submit gym_evaluator.py
nor continuous_mountain_car_evaluator.py
to ReCodEx.
ddpg
Deadline: Dec 15, 23:59 7 points
Using the ddpg.py template, solve the Pendulumv0 environment environment using deep deterministic policy gradient algorithm.
To create the evaluator, use
gym_evaluator.py.GymEvaluator("Pendulumv0")
.
The environment is continuous, states and actions are described at
OpenAI Gym Wiki.
Your goal is to reach an average return of 200 during 100 evaluation episodes.
During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.
Note that you must not submit gym_evaluator.py
to ReCodEx.
walker
Deadline: Jan 05, 23:59 8 points + 10 bonus
In this exercise exploring continuous robot control, try solving the BipedalWalkerv2 environment environment from the OpenAI Gym.
To create the evaluator, use
gym_evaluator.py.GymEvaluator("BipedalWalkerv2")
.
The environment is continuous, states and actions are described at
OpenAI Gym Wiki.
In ReCodEx, you are expected to submit an already trained model, which is evaluated on 100 episodes with a total time limit of 10 minutes. If your average return is at least 100, you obtain 8 points. The task is also a competition and at most 10 points will be awarded according to relative ordering of your solution performances.
You can start with the ddpg.py
template, only set args.env
to BipedalWalkerv2
.
Note that you must not submit gym_evaluator.py
to ReCodEx.
walker_hardcore
Deadline: Jan 05, 23:59 10 bonus
As an extesnion of the walker
assignment, try solving the
BipedalWalkerHardcorev2 environment
environment from the OpenAI Gym.
The task is a competition only and at most 10 points will be awarded according to relative ordering of your solution performances. In ReCodEx, your solution will be evaluated on 100 episodes with a total time limit of 10 minutes. If your average return is at least 0, ReCodEx shows the solution as correct.
You can start with the ddpg.py
template, only set args.env
to BipedalWalkerHardcorev2
.
Note that you must not submit gym_evaluator.py
to ReCodEx.
az_quiz
Deadline: Jan 05, 23:59 10 points + 10 bonus
In this competition assignment, use Monte Carlo Tree Search to learn an agent for a simplified version of AZkvíz. In our version, the agent does not have to answer questions and we assume that all answers are correct.
The game itself is implemented in the
az_quiz.py
module, using randomized=False
constructor argument.
The evaluation in ReCodEx should be implemented by importing a module
az_quiz_evaluator_recodex
and calling its evaluate
function. The argument
this functions is an object providing a method play
which given an AZkvíz
instance returns the chosen move. The illustration of the interface is in the
az_quiz_evaluator_recodex.py
module, a simple random player implementing the interface is the
az_quiz_player_random.py.
Your solution in ReCodEx is automatically evaluated against a very simple heuristic az_quiz_player_simple_heuristic.py, playing 50 games as a starting player and 50 games as a nonstarting player. The time limit for the games is 15 minutes and you should see the win rate directly in ReCodEx. If you achieve at least 75%, you will pass the assignment.
The final competition evaluation will be performed after the deadline by a roundrobin tournament.
Note that az_quiz_evaluator.py can be used to evaluate any two given implementations and there are two interactive players available, az_quiz_player_interactive_mouse.py and az_quiz_player_interactive_keyboard.py.
For inspiration, use the official pseudocode for AlphaZero. However, note that there are some errors in it.
 On line 258, value of the children should be inverted, resulting in:
value_score = 1  child.value()
 On line 237, next action should be sampled according to a distribution of normalized visit counts, not according to a softmax of visit counts.
 Below line 287, the sampled gamma random variables should be normalized
to produce Dirichlet random sample:
noise /= np.sum(noise)
Note that you must not submit az_quiz.py
nor az_quiz_evaluator_recodex.py
to ReCodEx.
az_quiz_randomized
Deadline: Jan 05, 23:59 10 bonus
Extend the az_quiz
assignment to handle the possibility of wrong
answers. Therefore, when choosing a field, the agent might answer
incorrectly.
To instantiate this randomized game variant, pass randomized=True
to the AZQuiz
class of az_quiz.py.
The Monte Carlo Tree Search has to be slightly modified to handle stochastic
MDP. The information about distribution of possible next states is provided
by the AZQuiz.all_moves
method, which returns a list of (probability, az_quiz_instance)
next states (in our environment, there are always two
possible next states).
Note that you must not submit az_quiz.py
nor az_quiz_evaluator_recodex.py
to ReCodEx.
vtrace
Deadline: Mar 08, 23:59 6 points
Using the vtrace.py template, implement the Vtrace algorithm.
The evaluation in ReCodEx is performed in two ways:
 your solution is imported as a module and the output of
Network.vtrace
is compared to the referential implementation;  your solution is evaluated on
CartPolev1
and should achieve an average return of 490 (two seeds are tried, each with 10minute limit).
You can perform the test of your Network.vtrace
implementation yourself using the
vtrace_test.py
module, which loads reference data from
vtrace_test.pickle
and then evaluates Network.vtrace
implementation from a given module.
Note that you must not submit gym_environment.py
to ReCodEx.
memory_game
Deadline: Mar 15, 23:59 4 points
In this exercise we explore a partially observable environment. Consider a oneplayer variant of a memory game (pexeso), where a player repeatedly flip cards. If the player flips two cards with the same symbol in succession, the cards are removed and the player recieves a reward of +2. Otherwise the player recieves a reward of 1. An episode ends when all cards are removed. Note that it is valid to try to flip an already removed card.
Let there be $N$ cards in the environment, $N$ being even. There are $N+1$
actions – the first $N$ flip the corresponding card, and the last action
flips the unused card with the lowest index (or the card $N$ if all have
been used already). The observations consist of a pair of discrete values
(card, symbol), where the card is the index of the card flipped, and
the symbol is the symbol on the flipped card. The env.states
returns
a pair $(N, N/2)$, representing there are $N$ card indices and $N/2$
symbol indices.
Every episode can be ended by at most $3N/2$ actions, and the required return is therefore greater or equal to zero. Note that there is a limit of at most $2N$ actions per episode. The described environment is provided by the memory_game_evaluator.py module.
Your goal is to solve the environment, using supervised learning via provided
expert episodes and networks with external memory. The environment implements
an env.expert_episode()
method, which returns a fresh correct episode
as a list of (state, action)
pairs (with the last action
being None
).
ReCodEx evaluates your solution on environments with 8, 12 and 16 cards
(utilizing the cards
argument). For each card number, 100 episodes are
simulated once you pass evaluating=True
to env.reset
and your solution gets
1 point (2 points for 16 cards) if the average return is nonnegative. You can
train the agent directly in ReCodEx, or submit a pretrained one.
A template memory_game.py is available, commenting a possible use of memory augmented networks.
Note that you must not submit gym_environment.py
nor
memory_game_evaluator.py
to ReCodEx.
memory_game_rl
Deadline: Mar 15, 23:59 5 points
This is a continuation of the memory_game
assignment.
In this task, your goal is to solve the memory game environment
using reinforcement learning. That is, you must not use the
env.expert_episode
method during training.
ReCodEx evaluates your solution on environments with 4, 6 and 8 cards (utilizing
the cards
argument). For each card number, your solution gets 2 points
(1 point for 4 cards) if the average return is nonnegative. You can train the agent
directly in ReCodEx, or submit a pretrained one.
There is no specific template for this assignment, reuse the memory_game.py for the previous assignment.
Note that you must not submit gym_environment.py
nor
memory_game_evaluator.py
to ReCodEx.
Requirements
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 will be transfered to the exam.
To pass the exam, you need to obtain at least 60, 75 and 90 out of 100point exam, to obtain grades 3, 2 and 1, respectively. The exam consists of five 20point questions and in addition, you can get at most 40 surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues).
Exam Questions

General RL Settings, Value Iteration Define reinforcement learning as a Markov Decision Process, define a policy, value function and an actionvalue function (and show how value and actionvalue functions can be computed from one another). Then, define optimal value and actionvalue function, and show, how optimal value function can be computed using Bellman backup operator, i.e., the value iteration algorithm (including a proof of convergence).

General RL Settings, Policy Iteration Define reinforcement learning as a Markov Decision Process, define a policy, value function and an actionvalue function (and show how value and actionvalue functions can be computed from one another). Then, define optimal value and actionvalue function, and show, how optimal policy can be computed using policy iteration algorithm (ideally including proofs).

TD Methods Describe temporal difference methods and formulate Sarsa, Qlearning, Expected Sarsa, Double Qlearning and $n$step Sarsa in tabular settings.

Offpolicy Methods Describe difference between onpolicy and offpolicy methods, and show an offpolicy variant of Monte Carlo algorithm in tabular settings, both with ordinary and weighted importance sampling. Then describe Expected Sarsa as an offpolicy algorithm not using importance sampling. Finally, describe offpolicy $n$step Sarsa.

Function Approximation Assuming function approximation, define the usual mean squared value error and describe gradient Monte Carlo and Semigradient TD algorithms. Then show how offpolicy methods in function approximation settings may diverge. Finally, sketch Deep Q Network architecture, especially the experience replay and target network.

DQN Describe Deep Q Network, and especially experience replay, target network and reward clipping tricks. Then, describe at least three improvements present in Rainbow algorithm (i.e., something of DDQN, prioritized replay, duelling architecture, noisy nets and distributional RL).

Policy Gradient Methods, REINFORCE Describe policy gradient methods, prove policy gradient theorem, and describe REINFORCE, REINFORCE with baseline (including the proof of the baseline).

Policy Gradient Methods, PAAC Describe policy gradient methods, prove policy gradient theorem, and describe Parallel advantage actorcritic (PAAC) algorithm.

Gradient Methods with Continuous Actions, DDPG Show how continuous actions can be incorporated in policy gradient algorithms (i.e., in a REINFORCE algorithm, without proving the policy gradient theorem). Then formulate and prove deterministic policy gradient theorem. Finally, describe the DDPG algorithm.

Gradient Methods with Continuous Actions, TD3 Formulate and prove deterministic policy gradient theorem. Then, describe the DDPG and TD3 algorithms.

AlphaZero Describe the algorithm used by AlphaZero – Monte Carlo tree search, overall neural network architecture, and training and inference procedures.

Vtrace and PopArt Normalization Describe the Vtrace algorithm and sketch populationbased training used in the IMPALA algorithm. Then, show how the rewards can be normalized using the PopArt normalization approach.