In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with super-human performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by self-play), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.
Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.
SIS code: NPFL122
Semester: winter
E-credits: 6
Examination: 2/2 C+Ex
Guarantor: Milan Straka
1. Introduction to Reinforcement Learning Slides PDF Slides Lecture Practicals bandits monte_carlo
2. Markov Decision Process, Optimal Solutions, Monte Carlo Methods Slides PDF Slides Lecture Practicals policy_iteration policy_iteration_exact policy_iteration_exploring_mc policy_iteration_greedy_mc
3. Temporal Difference Methods, Off-Policy Methods Slides PDF Slides Lecture Practicals importance_sampling q_learning lunar_lander
4. Function Approximation, Deep Q Network Slides PDF Slides Lecture
The lecture content, including references to study materials.
The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.
References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.
Oct 05 Slides PDF Slides Lecture Practicals bandits monte_carlo
Oct 12 Slides PDF Slides Lecture Practicals policy_iteration policy_iteration_exact policy_iteration_exploring_mc policy_iteration_greedy_mc
Oct 19 Slides PDF Slides Lecture Practicals importance_sampling q_learning lunar_lander
Oct 26 Slides PDF Slides Lecture
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 (including the bonus points) will be transfered to the exam.
The tasks are evaluated automatically using the ReCodEx Code Examiner.
The evaluation is performed using Python 3.8, TensorFlow 2.3.1, NumPy 1.18.5, OpenAI Gym 0.17.2 and Box2D 2.3.10. You should install the exact version of these packages yourselves. For those using PyTorch, 1.6.0 is also available.
Solving assignments in teams of size 2 or 3 is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.
Deadline: Oct 20, 23:59 4 points
Implement the $ε$-greedy strategy for solving multi-armed bandits.
Start with the bandits.py
template, which defines MultiArmedBandits
environment, which has the following
two methods:
reset()
: reset the environmentstep(action) → reward
: perform the chosen action in the environment,
obtaining a rewardgreedy(epsilon)
: return True
with probability 1-epsilon
Your goal is to implement the following solution variants:
alpha
$=0$: perform $ε$-greedy search, updating the estimated using
averaging.alpha
$≠0$: perform $ε$-greedy search, updating the estimated using
a fixed learning rate alpha
.Note that the initial estimates should be set to a given value and epsilon
can
be zero, in which case purely greedy actions are used.
Please note that the results are stochastic, so your results may differ slightly.
python3 bandits.py --alpha=0 --epsilon=0.1 --initial=0
1.39 0.08
python3 bandits.py --alpha=0 --epsilon=0 --initial=1
1.48 0.22
python3 bandits.py --alpha=0.15 --epsilon=0.1 --initial=0
1.37 0.09
python3 bandits.py --alpha=0.15 --epsilon=0 --initial=1
1.52 0.04
Deadline: Oct 20, 23:59 6 points
Solve the CartPole-v1 environment
environment from the OpenAI Gym using the Monte Carlo
reinforcement learning algorithm. The gym
environments have the followng
methods and properties:
observation_space
: the description of environment observationsaction_space
: the description of environment actionsreset() → new_state
: starts a new episodestep(action) → new_state, reward, done, info
: perform the chosen action
in the environment, returning the new state, obtained reward, a boolean
flag indicating an end of episode, and additional environment-specific
informationrender()
: render current environment stateWe additionaly extend the gym
environment by:
episode
: number of the current episode (zero-based)reset(start_evaluation=False) → new_state
: if start_evaluation
is True
,
an evaluation is startedOnce you finish training (which you indicate by passing start_evaluation=True
to reset
), your goal is to reach an average return of 490 during 100
evaluation episodes. Note that the environment prints your 100-episode
average return each 10 episodes even during training.
Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
Deadline: Oct 27, 23:59 4 points
Consider the following gridworld:
Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:
GridWorld.states
: return number of states (11
)GridWorld.actions
: return lists with labels of the actions (["↑", "→", "↓", "←"]
)GridWorld.step(state, action)
: return possible outcomes of performing the
action
in a given state
, as a list of triples containing
probability
: probability of the outcomereward
: reward of the outcomenew_state
: new state of the outcomeImplement policy iteration algorithm, with --steps
steps of policy
evaluation/policy improvement. During policy evaluation, use the current value
function and perform --iterations
applications of the Bellman equation.
Perform the policy evaluation synchronously (i.e., do not overwrite the current
value function when computing its improvement). Assume the initial policy is
“go North” and initial value function is zero.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -10.00← -10.00↑
0.00↑ 0.00→ 0.10← -79.90←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=2
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ -7.59← -11.90←
0.00→ 0.08← -0.94← -18.36←
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=3
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↓ -5.86← -7.41←
0.06↓ 0.01← -0.75← -13.49↓
python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=10
0.04↓ 0.04← 0.01↑ 0.00↑
0.04↓ -0.95← -1.00←
0.04↓ 0.04← -0.10→ -0.52↓
python3 policy_iteration.py --gamma=0.95 --iterations=10 --steps=10
11.79↓ 11.03← 10.31← 6.54↑
12.69↓ 10.14← 9.95←
13.56→ 14.59→ 15.58→ 16.26↓
python3 policy_iteration.py --gamma=1 --iterations=1 --steps=100
66.54↓ 65.53← 64.42← 56.34↑
67.68↓ 63.58← 62.97←
68.69→ 69.83→ 70.84→ 71.75↓
Deadline: Oct 27, 23:59 2 points
Starting with policy_iteration_exact.py,
extend the policy_iteration
assignment to perform policy evaluation
exactly by solving a system of linear equations.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration_exact.py --gamma=0.95 --steps=1
-0.00→ 0.00→ 0.00↑ 0.00↑
-0.00↑ -12.35← -12.35↑
-0.85← -8.10← -19.62← -100.71←
python3 policy_iteration_exact.py --gamma=0.95 --steps=2
0.00→ 0.00→ 0.00→ 0.00↑
0.00↑ 0.00← -11.05←
-0.00↑ -0.00→ 0.00← -12.10↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=3
-0.00↓ -0.00← -0.00↓ -0.00↑
-0.00↑ 0.00← 0.69←
-0.00← -0.00← -0.00→ 6.21↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=8
12.12↓ 11.37← 9.19← 6.02↑
13.01↓ 9.92← 9.79←
13.87→ 14.89→ 15.87→ 16.60↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=9
12.24↓ 11.49← 10.76← 7.05↑
13.14↓ 10.60← 10.42←
14.01→ 15.04→ 16.03→ 16.71↓
python3 policy_iteration_exact.py --gamma=0.95 --steps=10
12.24↓ 11.49← 10.76← 7.05↑
13.14↓ 10.60← 10.42←
14.01→ 15.04→ 16.03→ 16.71↓
Deadline: Oct 27, 23:59 2 points
Starting with policy_iteration_exploring_mc.py,
extend the policy_iteration
assignment to perform policy evaluation
by using Monte Carlo estimation with exploring starts.
The estimation can now be performed model-free (without the access to the full
MDP dynamics), therefore, the GridWorld.step
returns a randomly sampled
result instead of a full distribution.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00↑ 0.00↓
python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=10
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00↑ -19.50↑
0.27↓ 0.48← 2.21↓ 8.52↓
python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=50
0.09↓ 0.32↓ 0.22← 0.15↑
0.18↑ -2.43← -5.12↓
0.18↓ 1.80↓ 3.90↓ 9.14↓
python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=100
3.09↓ 2.42← 2.39← 1.17↑
3.74↓ 1.66← 0.18←
3.92→ 5.28→ 7.16→ 11.07↓
python3 policy_iteration_exploring_mc.py --gamma=0.95 --seed=42 --steps=200
7.71↓ 6.76← 6.66← 3.92↑
8.27↓ 6.17← 5.31←
8.88→ 10.12→ 11.36→ 13.92↓
Deadline: Oct 27, 23:59 2 points
Starting with policy_iteration_greedy_mc.py,
extend the policy_iteration_exploring_mc
assignment to perform policy
evaluation by using $ε$-greedy Monte Carlo estimation.
For the sake of replicability, use the provided
GridWorld.epsilon_greedy(epsilon, greedy_action)
method, which returns
a random action with probability of epsilon
and otherwise returns the
given greedy_action
.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=1
0.00↑ 0.00↑ 0.00↑ 0.00↑
0.00↑ 0.00→ 0.00→
0.00↑ 0.00↑ 0.00→ 0.00→
python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=10
-1.20↓ -1.43← 0.00← -6.00↑
0.78→ -20.26↓ 0.00←
0.09← 0.00↓ -9.80↓ 10.37↓
python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=50
-0.16↓ -0.19← 0.56← -6.30↑
0.13→ -6.99↓ -3.51↓
0.01← 0.00← 3.18↓ 7.57↓
python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=100
-0.07↓ -0.09← 0.28← -4.66↑
0.06→ -5.04↓ -8.32↓
0.00← 0.00← 1.70↓ 4.38↓
python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=200
-0.04↓ -0.04← -0.76← -4.15↑
0.03→ -8.02↓ -5.96↓
0.00← 0.00← 2.53↓ 4.36↓
python3 policy_iteration_greedy_mc.py --gamma=0.95 --seed=42 --steps=500
-0.02↓ -0.02← -0.65← -3.52↑
0.01→ -11.34↓ -8.07↓
0.00← 0.00← 3.15↓ 3.99↓
Deadline: Nov 03, 23:59 2 points
Using the FrozenLake-v0 environment environment, implement Monte Carlo weighted importance sampling to estimate state value function $V$ of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behaviour policy, which uniformly chooses among all four actions.
Start with the importance_sampling.py template, which creates the environment and generates episodes according to behaviour policy.
Note that your results may sometimes be slightly different (for example because of varying floating point arithmetic on your CPU).
python3 importance_sampling.py --episodes=500
0.00 0.00 0.00 0.00
0.03 0.00 0.00 0.00
0.22 0.14 0.29 0.00
0.00 0.50 1.00 0.00
python3 importance_sampling.py --episodes=5000
0.00 0.01 0.02 0.00
0.00 0.00 0.08 0.00
0.06 0.08 0.17 0.00
0.00 0.19 0.89 0.00
python3 importance_sampling.py --episodes=50000
0.02 0.01 0.04 0.01
0.03 0.00 0.06 0.00
0.08 0.17 0.24 0.00
0.00 0.34 0.78 0.00
Deadline: Nov 03, 23:59 6 points
Solve the MountainCar-v0 environment environment from the OpenAI Gym using the Q-learning reinforcement learning algorithm. Note that this task does not require TensorFlow.
The environment methods and properties are described in the monte_carlo
assignment.
Once you finish training (which you indicate by passing start_evaluation=True
to reset
), your goal is to reach an average return of -150 during 100
evaluation episodes.
You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Q-learning is a bit tricky – I usualy start with a larger value of $ε$ (like 0.2 or even 0.5) an then gradually decrease it to almost zero.
During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.
Deadline: Nov 03, 23:59 7 points + 7 bonus
Solve the LunarLander-v2 environment environment from the OpenAI Gym. Note that this task does not require TensorFlow.
The environment methods and properties are described in the monte_carlo
assignment,
but include one additional method:
expert_trajectory() → initial_state, trajectory
This method generates
one expert trajectory and returns a pair of initial_state
and trajectory
,
where trajectory
is a list of the tripples (action, reward, next_state).
You can use this method only during training, not during evaluation.To pass the task, you need to reach an average return of 0 during 100 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.
The task is additionally a competition and at most 7 points will be awarded according to relative ordering of your solution performances.
You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.
Installing to central user packages repository
You can install all required packages to central user packages repository using
pip3 install --user tensorflow==2.3.1 numpy==1.18.5 gym==0.17.2 box2d==2.3.10
.
Installing to a virtual environment
Python supports virtual environments, which are directories containing
independent sets of installed packages. You can create the virtual environment
by running python3 -m venv VENV_DIR
followed by
VENV_DIR/bin/pip3 install tensorflow==2.3.1 numpy==1.18.5 gym==0.17.2 box2d==2.3.10
.
What files can be submitted to ReCodEx?
You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.
What file does ReCodEx execute and what arguments does it use?
Exactly one file with py
suffix must contain a line starting with def main(
.
Such a file is imported by ReCodEx and the main
method is executed
(during the import, __name__ == "__recodex__"
).
The file must also export an argument parser called parser
. ReCodEx uses its
arguments and default values, but is overwrites some of the arguments
depending on the test being executed – the template should always indicate which
arguments are set by ReCodEx and which are left intact.
What are the time and memory limits?
The memory limit during evaluation is 1.5GB. The time limit varies, but should be at least 10 seconds and at least twice the running time of my solution.
To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that up to 40 points above 80 (including the bonus points) will be transfered to the exam.
To pass the exam, you need to obtain at least 60, 75 and 90 out of 100-point exam, to obtain grades 3, 2 and 1, respectively. (PhD students with binary grades require 75 points.) The exam consists of five 20-point questions, which are randomly generated, but always cover the whole course. In addition, you can get at most 40 surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count.