Deep Reinforcement Learning – Summer 2023/24

In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with super-human performance (for example for Go, chess, StarCraft II, capable of being trained solely by self-play), datacenter cooling algorithms being 50% more efficient than trained human operators, or faster code for sorting or matrix multiplication. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.

Python programming skills and basic PyTorch/TensorFlow skills are required (the latter can be obtained on the Deep Learning course). No previous knowledge of reinforcement learning is necessary.

About

SIS code: NPFL139
Semester: summer
E-credits: 8
Examination: 3/4 C+Ex
Guarantor: Milan Straka

Timespace Coordinates

  • lecture: the lecture is held on Monday 15:40 in S5; first lecture is on Feb 19
  • practicals: the practicals take place on Wednesday 14:00 in S5; first practicals are on Feb 21
  • consultations: entirely optional consultations take place on Tuesday 15:40 in S4; first consultations are on Feb 27

All lectures and practicals will be recorded and available on this website.

Lectures

1. Introduction to Reinforcement Learning Slides PDF Slides Lecture MonteCarlo Practicals Questions bandits monte_carlo

2. Value and Policy Iteration, Monte Carlo, Temporal Difference Slides PDF Slides Lecture Q-learning Practicals Questions policy_iteration policy_iteration_exact policy_iteration_mc_estarts policy_iteration_mc_egreedy q_learning

3. Off-Policy Methods, N-step, Function Approximation Slides PDF Slides Lecture Practicals Questions importance_sampling td_algorithms q_learning_tiles lunar_lander

4. Function Approximation, Deep Q Network, Rainbow Slides PDF Slides Lecture Practicals Questions q_network car_racing

5. Rainbow II, Distributional RL Slides PDF Slides Lecture Practicals Questions

6. Policy Gradient Methods Slides PDF Slides Lecture Practicals Questions reinforce reinforce_baseline cart_pole_pixels

7. Easter Monday

8. PAAC, DDPG, TD3, SAC Slides PDF Slides Lecture Questions paac paac_continuous ddpg

9. Eligibility traces, Impala Slides PDF Slides Lecture Practicals Questions walker walker_hardcore cheetah humanoid humanoid_standup

10. PPO, R2D2, Agent57 Slides PDF Slides Lecture Practicals trace_algorithms ppo

License

Unless otherwise stated, teaching materials for this course are available under CC BY-SA 4.0.

The lecture content, including references to study materials.

The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.

References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.

1. Introduction to Reinforcement Learning

 Feb 19 Slides PDF Slides Lecture MonteCarlo Practicals Questions bandits monte_carlo

  • History of RL [Chapter 1 of RLB]
  • Multi-armed bandits [Sections 2-2.6 of RLB]
  • Markov Decision Process [Sections 3-3.3 of RLB]
  • Policies and Value Functions [Sections 3.5-3.6 of RLB]

2. Value and Policy Iteration, Monte Carlo, Temporal Difference

 Feb 26 Slides PDF Slides Lecture Q-learning Practicals Questions policy_iteration policy_iteration_exact policy_iteration_mc_estarts policy_iteration_mc_egreedy q_learning

  • Value Iteration [Sections 4 and 4.4 of RLB]
    • Proof of convergence only in slides
  • Policy Iteration [Sections 4.1-4.3 of RLB]
  • Generalized Policy Iteration [Section 4.6 or RLB]
  • Monte Carlo Methods [Sections 5-5.4 of RLB]
  • Model-free and model-based methods, using state-value or action-value functions [Chapter 8 before Section 8.1, and Section 6.8 of RLB]
  • Temporal-difference methods [Sections 6-6.3 of RLB]
  • Sarsa [Section 6.4 of RLB]
  • Q-learning [Section 6.5 of RLB]

3. Off-Policy Methods, N-step, Function Approximation

 Mar 4 Slides PDF Slides Lecture Practicals Questions importance_sampling td_algorithms q_learning_tiles lunar_lander

  • Off-policy Monte Carlo Methods [Sections 5.5-5.7 of RLB]
  • Expected Sarsa [Section 6.6 of RLB]
  • N-step TD policy evaluation [Section 7.1 of RLB]
  • N-step Sarsa [Section 7.2 of RLB]
  • Off-policy n-step Sarsa [Section 7.3 of RLB]
  • Tree backup algorithm [Section 7.5 of RLB]
  • Function approximation [Sections 9-9.3 of RLB]
  • Tile coding [Section 9.5.4 of RLB]
  • Linear function approximation [Section 9.4 of RLB, without the Proof of Convergence of Linear TD(0)]

4. Function Approximation, Deep Q Network, Rainbow

 Mar 11 Slides PDF Slides Lecture Practicals Questions q_network car_racing

5. Rainbow II, Distributional RL

 Mar 18 Slides PDF Slides Lecture Practicals Questions

6. Policy Gradient Methods

 Mar 25 Slides PDF Slides Lecture Practicals Questions reinforce reinforce_baseline cart_pole_pixels

7. Easter Monday

 Apr 01

8. PAAC, DDPG, TD3, SAC

 Apr 08 Slides PDF Slides Lecture Questions paac paac_continuous ddpg

9. Eligibility traces, Impala

 Apr 15 Slides PDF Slides Lecture Practicals Questions walker walker_hardcore cheetah humanoid humanoid_standup

10. PPO, R2D2, Agent57

 Apr 22 Slides PDF Slides Lecture Practicals trace_algorithms ppo

Requirements

To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.

Environment

The tasks are evaluated automatically using the ReCodEx Code Examiner.

The evaluation is performed using Python 3.11, Gymnasium 1.0.0a1 and PyTorch 2.2.0. You should install the exact version of these packages yourselves. Additionally, ReCodEx also provides TensorFlow 2.16.1, TensorFlow Probability 0.24.0, Keras 3, and JAX 0.4.25.

Teamwork

Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.

No Cheating

Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are allowed to share code and submit identical solutions. Note that all students involved in cheating will be punished, so if you share your source code with a friend, both you and your friend will be punished. That also means that you should never publish your solutions.

bandits

 Deadline: Mar 05, 22:00  3 points

Implement the εε-greedy strategy for solving multi-armed bandits.

Start with the bandits.py template, which defines MultiArmedBandits environment, which has the following three methods:

  • reset(): reset the environment
  • step(action) → reward: perform the chosen action in the environment, obtaining a reward
  • greedy(epsilon): return True with probability 1-epsilon

Your goal is to implement the following solution variants:

  • alpha=0=0: perform εε-greedy search, updating the estimates using averaging.
  • alpha0≠0: perform εε-greedy search, updating the estimates using a fixed learning rate alpha.

Note that the initial estimates should be set to a given value and epsilon can be zero, in which case purely greedy actions are used.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 bandits.py --alpha=0 --epsilon=0.1 --initial=0
1.39 0.08
  1. python3 bandits.py --alpha=0 --epsilon=0 --initial=1
1.48 0.22
  1. python3 bandits.py --alpha=0.15 --epsilon=0.1 --initial=0
1.37 0.09
  1. python3 bandits.py --alpha=0.15 --epsilon=0 --initial=1
1.52 0.04

monte_carlo

 Deadline: Mar 05, 22:00  5 points

Solve the discretized CartPole-v1 environment from the Gymnasium library using the Monte Carlo reinforcement learning algorithm. The gymnasium environments have the followng methods and properties:

  • observation_space: the description of environment observations
  • action_space: the description of environment actions
  • reset() → new_state, info: starts a new episode, returning the new state and additional environment-specific information
  • step(action) → new_state, reward, terminated, truncated, info: perform the chosen action in the environment, returning the new state, obtained reward, boolean flags indicating a terminal state and episode truncation, and additional environment-specific information

We additionaly extend the gymnasium environment by:

  • episode: number of the current episode (zero-based)
  • reset(start_evaluation=False) → new_state, info: if start_evaluation is True, an evaluation is started

Once you finish training (which you indicate by passing start_evaluation=True to reset), your goal is to reach an average return of 490 during 100 evaluation episodes. Note that the environment prints your 100-episode average return each 10 episodes even during training.

Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. You will also need the wrappers.py module, which wraps the standard gymnasium API with the above-mentioned added features we use.

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

policy_iteration

 Deadline: Mar 12, 22:00  2 points

Consider the following gridworld:

Gridworld example

Start with policy_iteration.py, which implements the gridworld mechanics, by providing the following methods:

  • GridWorld.states: return the number of states (11)
  • GridWorld.actions: return a list with labels of the actions (["↑", "→", "↓", "←"])
  • GridWorld.step(state, action): return possible outcomes of performing the action in a given state, as a list of triples containing
    • probability: probability of the outcome
    • reward: reward of the outcome
    • new_state: new state of the outcome

Implement policy iteration algorithm, with --steps steps of policy evaluation/policy improvement. During policy evaluation, use the current value function and perform --iterations applications of the Bellman equation. Perform the policy evaluation asynchronously (i.e., update the value function in-place for states 0,1,0, 1, …). Assume the initial policy is “go North” and initial value function is zero.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=1
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑           -10.00←  -10.95↑
    0.00↑    0.00←   -7.50←  -88.93←
  1. python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=2
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑            -8.31←  -11.83←
    0.00↑    0.00←   -1.50←  -20.61←
  1. python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=3
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑            -6.46←   -6.77←
    0.00↑    0.00←   -0.76←  -13.08↓
  1. python3 policy_iteration.py --gamma=0.95 --iterations=1 --steps=10
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑            -1.04←   -0.83←
    0.00↑    0.00←   -0.11→   -0.34↓
  1. python3 policy_iteration.py --gamma=0.95 --iterations=10 --steps=10
   11.93↓   11.19←   10.47←    6.71↑
   12.83↓            10.30←   10.12←
   13.70→   14.73→   15.72→   16.40↓
  1. python3 policy_iteration.py --gamma=1 --iterations=1 --steps=100
   74.73↓   74.50←   74.09←   65.95↑
   75.89↓            72.63←   72.72←
   77.02→   78.18→   79.31→   80.16↓

policy_iteration_exact

 Deadline: Mar 12, 22:00  2 points

Starting with policy_iteration_exact.py, extend the policy_iteration assignment to perform policy evaluation exactly by solving a system of linear equations. Note that you need to use 64-bit floats because lower precision results in unacceptable error.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=1
   -0.00↑   -0.00↑   -0.00↑   -0.00↑
   -0.00↑           -12.35←  -12.35↑
   -0.85←   -8.10←  -19.62← -100.71←
  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=2
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑             0.00←  -11.05←
   -0.00↑   -0.00↑   -0.00←  -12.10↓
  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=3
   -0.00↑    0.00↑    0.00↑    0.00↑
   -0.00↑            -0.00←    0.69←
   -0.00↑   -0.00↑   -0.00→    6.21↓
  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=4
   -0.00↑    0.00↑    0.00↓    0.00↑
   -0.00↓             5.91←    6.11←
    0.65→    6.17→   14.93→   15.99↓
  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=5
    2.83↓    4.32→    8.09↓    5.30↑
   12.92↓             9.44←    9.35←
   13.77→   14.78→   15.76→   16.53↓
  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=6
   11.75↓    8.15←    8.69↓    5.69↑
   12.97↓             9.70←    9.59←
   13.82→   14.84→   15.82→   16.57↓
  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=7
   12.12↓   11.37←    9.19←    6.02↑
   13.01↓             9.92←    9.79←
   13.87→   14.89→   15.87→   16.60↓
  1. python3 policy_iteration_exact.py --gamma=0.95 --steps=8
   12.24↓   11.49←   10.76←    7.05↑
   13.14↓            10.60←   10.42←
   14.01→   15.04→   16.03→   16.71↓
  1. python3 policy_iteration_exact.py --gamma=0.9999 --steps=5
 7385.23↓ 7392.62→ 7407.40↓ 7400.00↑
 7421.37↓          7411.10← 7413.16↓
 7422.30→ 7423.34→ 7424.27→ 7425.84↓

policy_iteration_mc_estarts

 Deadline: Mar 12, 22:00  2 points

Starting with policy_iteration_mc_estarts.py, extend the policy_iteration assignment to perform policy evaluation by using Monte Carlo estimation with exploring starts. Specifically, we update the action-value function qπ(s,a)q_\pi(s, a) by running a simulation with a given number of steps and using the observed return as its estimate.

The estimation can now be performed model-free (without the access to the full MDP dynamics), therefore, the GridWorld.step returns a randomly sampled result instead of a full distribution.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=1
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑             0.00↑    0.00↑
    0.00↑    0.00→    0.00↑    0.00↓
  1. python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=10
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑             0.00↑  -19.50↑
    0.27↓    0.48←    2.21↓    8.52↓
  1. python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=50
    0.09↓    0.32↓    0.22←    0.15↑
    0.18↑            -2.43←   -5.12↓
    0.18↓    1.80↓    3.90↓    9.14↓
  1. python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=100
    3.09↓    2.42←    2.39←    1.17↑
    3.74↓             1.66←    0.18←
    3.92→    5.28→    7.16→   11.07↓
  1. python3 policy_iteration_mc_estarts.py --gamma=0.95 --seed=42 --mc_length=100 --steps=200
    7.71↓    6.76←    6.66←    3.92↑
    8.27↓             6.17←    5.31←
    8.88→   10.12→   11.36→   13.92↓

policy_iteration_mc_egreedy

 Deadline: Mar 12, 22:00  2 points

Starting with policy_iteration_mc_egreedy.py, extend the policy_iteration_mc_estarts assignment to perform policy evaluation by using εε-greedy Monte Carlo estimation. Specifically, we update the action-value function qπ(s,a)q_\pi(s, a) by running a simulation with a given number of steps and using the observed return as its estimate.

For the sake of replicability, use the provided GridWorld.epsilon_greedy(epsilon, greedy_action) method, which returns a random action with probability of epsilon and otherwise returns the given greedy_action.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=1
    0.00↑    0.00↑    0.00↑    0.00↑
    0.00↑             0.00→    0.00→
    0.00↑    0.00↑    0.00→    0.00→
  1. python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=10
   -1.20↓   -1.43←    0.00←   -6.00↑
    0.78→           -20.26↓    0.00←
    0.09←    0.00↓   -9.80↓   10.37↓
  1. python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=50
   -0.16↓   -0.19←    0.56←   -6.30↑
    0.13→            -6.99↓   -3.51↓
    0.01←    0.00←    3.18↓    7.57↓
  1. python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=100
   -0.07↓   -0.09←    0.28←   -4.66↑
    0.06→            -5.04↓   -8.32↓
    0.00←    0.00←    1.70↓    4.38↓
  1. python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=200
   -0.04↓   -0.04←   -0.76←   -4.15↑
    0.03→            -8.02↓   -5.96↓
    0.00←    0.00←    2.53↓    4.36↓
  1. python3 policy_iteration_mc_egreedy.py --gamma=0.95 --seed=42 --mc_length=100 --steps=500
   -0.02↓   -0.02←   -0.65←   -3.52↑
    0.01→           -11.34↓   -8.07↓
    0.00←    0.00←    3.15↓    3.99↓

q_learning

 Deadline: Mar 12, 22:00  4 points

Solve the discretized MountainCar-v0 environment from the Gymnasium library using the Q-learning reinforcement learning algorithm. Note that this task still does not require PyTorch.

The environment methods and properties are described in the monte_carlo assignment. Once you finish training (which you indicate by passing start_evaluation=True to reset), your goal is to reach an average return of -150 during 100 evaluation episodes.

You can start with the q_learning.py template, which parses several useful parameters, creates the environment and illustrates the overall usage. Note that setting hyperparameters of Q-learning is a bit tricky – I usually start with a larger value of εε (like 0.2 or even 0.5) and then gradually decrease it to almost zero.

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.

importance_sampling

 Deadline: Mar 19, 22:00  2 points

Using the FrozenLake-v1 environment, implement Monte Carlo weighted importance sampling to estimate state value function VV of target policy, which uniformly chooses either action 1 (down) or action 2 (right), utilizing behaviour policy, which uniformly chooses among all four actions.

Start with the importance_sampling.py template, which creates the environment and generates episodes according to behaviour policy.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 importance_sampling.py --episodes=200
 0.00  0.00  0.24  0.32
 0.00  0.00  0.40  0.00
 0.00  0.00  0.20  0.00
 0.00  0.00  0.22  0.00
  1. python3 importance_sampling.py --episodes=5000
 0.03  0.00  0.01  0.03
 0.04  0.00  0.09  0.00
 0.10  0.24  0.23  0.00
 0.00  0.44  0.49  0.00
  1. python3 importance_sampling.py --episodes=50000
 0.03  0.02  0.05  0.01
 0.13  0.00  0.07  0.00
 0.21  0.33  0.36  0.00
 0.00  0.35  0.76  0.00

td_algorithms

 Deadline: Mar 19, 22:00  4 points

Starting with the td_algorithms.py template, implement all of the following nn-step TD methods variants:

  • SARSA, expected SARSA and Tree backup;
  • either on-policy (with εε-greedy behaviour policy) or off-policy (with the same behaviour policy, but greedy target policy).

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1
Episode 10, mean 100-episode return -652.70 +-37.77
  1. python3 td_algorithms.py --episodes=10 --mode=sarsa --n=1 --off_policy
Episode 10, mean 100-episode return -632.90 +-126.41
  1. python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4
Episode 10, mean 100-episode return -715.70 +-156.56
  1. python3 td_algorithms.py --episodes=10 --mode=sarsa --n=4 --off_policy
Episode 10, mean 100-episode return -649.10 +-171.73
  1. python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1
Episode 10, mean 100-episode return -641.90 +-122.11
  1. python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=1 --off_policy
Episode 10, mean 100-episode return -633.80 +-63.61
  1. python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4
Episode 10, mean 100-episode return -713.90 +-107.05
  1. python3 td_algorithms.py --episodes=10 --mode=expected_sarsa --n=4 --off_policy
Episode 10, mean 100-episode return -648.20 +-107.08
  1. python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1
Episode 10, mean 100-episode return -641.90 +-122.11
  1. python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=1 --off_policy
Episode 10, mean 100-episode return -633.80 +-63.61
  1. python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4
Episode 10, mean 100-episode return -663.50 +-111.78
  1. python3 td_algorithms.py --episodes=10 --mode=tree_backup --n=4 --off_policy
Episode 10, mean 100-episode return -708.50 +-125.63

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  • python3 td_algorithms.py --mode=sarsa --n=1
Episode 200, mean 100-episode return -235.23 +-92.94
Episode 400, mean 100-episode return -133.18 +-98.63
Episode 600, mean 100-episode return -74.19 +-70.39
Episode 800, mean 100-episode return -41.84 +-54.53
Episode 1000, mean 100-episode return -31.96 +-52.14
  • python3 td_algorithms.py --mode=sarsa --n=1 --off_policy
Episode 200, mean 100-episode return -227.81 +-91.62
Episode 400, mean 100-episode return -131.29 +-90.07
Episode 600, mean 100-episode return -65.35 +-64.78
Episode 800, mean 100-episode return -34.65 +-44.93
Episode 1000, mean 100-episode return -8.70 +-25.74
  • python3 td_algorithms.py --mode=sarsa --n=4
Episode 200, mean 100-episode return -277.55 +-146.18
Episode 400, mean 100-episode return -87.11 +-152.12
Episode 600, mean 100-episode return -6.95 +-23.28
Episode 800, mean 100-episode return -1.88 +-19.21
Episode 1000, mean 100-episode return 0.97 +-11.76
  • python3 td_algorithms.py --mode=sarsa --n=4 --off_policy
Episode 200, mean 100-episode return -339.11 +-144.40
Episode 400, mean 100-episode return -172.44 +-176.79
Episode 600, mean 100-episode return -36.23 +-100.93
Episode 800, mean 100-episode return -22.43 +-81.29
Episode 1000, mean 100-episode return -3.95 +-17.78
  • python3 td_algorithms.py --mode=expected_sarsa --n=1
Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
  • python3 td_algorithms.py --mode=expected_sarsa --n=1 --off_policy
Episode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
  • python3 td_algorithms.py --mode=expected_sarsa --n=4
Episode 200, mean 100-episode return -263.10 +-161.97
Episode 400, mean 100-episode return -102.52 +-162.03
Episode 600, mean 100-episode return -7.13 +-24.53
Episode 800, mean 100-episode return -1.69 +-12.21
Episode 1000, mean 100-episode return -1.53 +-11.04
  • python3 td_algorithms.py --mode=expected_sarsa --n=4 --off_policy
Episode 200, mean 100-episode return -376.56 +-116.08
Episode 400, mean 100-episode return -292.35 +-166.14
Episode 600, mean 100-episode return -173.83 +-194.11
Episode 800, mean 100-episode return -89.57 +-153.70
Episode 1000, mean 100-episode return -54.60 +-127.73
  • python3 td_algorithms.py --mode=tree_backup --n=1
Episode 200, mean 100-episode return -223.35 +-102.16
Episode 400, mean 100-episode return -143.82 +-96.71
Episode 600, mean 100-episode return -79.92 +-68.88
Episode 800, mean 100-episode return -38.53 +-47.12
Episode 1000, mean 100-episode return -17.41 +-31.26
  • python3 td_algorithms.py --mode=tree_backup --n=1 --off_policy
Episode 200, mean 100-episode return -231.91 +-87.72
Episode 400, mean 100-episode return -136.19 +-94.16
Episode 600, mean 100-episode return -79.65 +-70.75
Episode 800, mean 100-episode return -35.42 +-44.91
Episode 1000, mean 100-episode return -11.79 +-23.46
  • python3 td_algorithms.py --mode=tree_backup --n=4
Episode 200, mean 100-episode return -270.51 +-134.35
Episode 400, mean 100-episode return -64.27 +-109.50
Episode 600, mean 100-episode return -1.80 +-13.34
Episode 800, mean 100-episode return -0.22 +-13.14
Episode 1000, mean 100-episode return 0.60 +-9.37
  • python3 td_algorithms.py --mode=tree_backup --n=4 --off_policy
Episode 200, mean 100-episode return -248.56 +-147.74
Episode 400, mean 100-episode return -68.60 +-126.13
Episode 600, mean 100-episode return -6.25 +-32.23
Episode 800, mean 100-episode return -0.53 +-11.82
Episode 1000, mean 100-episode return 2.33 +-8.35

q_learning_tiles

 Deadline: Mar 19, 22:00  3 points

Improve the q_learning task performance on the MountainCar-v0 environment using linear function approximation with tile coding. Your goal is to reach an average reward of -110 during 100 evaluation episodes.

The environment methods are described in the q_learning assignment, with the following changes:

  • The state returned by the env.step method is a list containing weight indices of the current state (i.e., the feature vector of the state consists of zeros and ones, and only the indices of the ones are returned). The action-value function is therefore approximated as a sum of the weights whose indices are returned by env.step.
  • The env.observation_space.nvec returns a list, where the ii-th element is a number of weights used by first ii elements of state. Notably, env.observation_space.nvec[-1] is the total number of the weights.

You can start with the q_learning_tiles.py template, which parses several useful parameters and creates the environment. Implementing Q-learning is enough to pass the assignment, even if both N-step Sarsa and Tree Backup converge a little faster. The default number of tiles in tile encoding (i.e., the size of the list with weight indices) is args.tiles=8, but you can use any number you want (but the assignment is solvable with 8).

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. The time limit for each test is 5 minutes.

lunar_lander

 Deadline: Mar 19, 22:00  5 points + 5 bonus

Solve the LunarLander-v2 environment from the Gymnasium library Note that this task does not require PyTorch.

The environment methods and properties are described in the monte_carlo assignment, but include one additional method:

  • expert_trajectory(seed=None) → trajectory: This method generates one expert trajectory, where trajectory is a list of triples (state, action, reward), where the action and reward is None when reaching the terminal state. If a seed is given, the expert trajectory random generator is reset before generating the trajectory.

    You cannot change the implementation of this method or use its internals in any other way than just calling expert_trajectory(). Furthermore, you can use this method only during training, not during evaluation.

To pass the task, you need to reach an average return of 0 during 1000 evaluation episodes. During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 15 minutes.

The task is additionally a competition, and at most 5 points will be awarded according to the relative ordering of your solutions.

You can start with the lunar_lander.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.

In the competition, you should consider the environment states meaning to be unknown, so you cannot use the knowledge about how they are created. But you can learn any such information from the data.

q_network

 Deadline: Mar 26, 22:00  5 points

Solve the continuous CartPole-v1 environment from the Gymnasium library using Q-learning with neural network as a function approximation.

You can start with the q_network.py template, which provides a simple network implementation in PyTorch. Feel free to use Tensorflow or JAX instead, if you like.

The continuous environment is very similar to a discrete one, except that the states are vectors of real-valued observations with shape env.observation_space.shape.

Use Q-learning with neural network as a function approximation, which for a given state returns state-action values for all actions. You can use any network architecture, but one hidden layer of several dozens ReLU units is a good start. Your goal is to reach an average return of 450 during 100 evaluation episodes.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes (so you can train in ReCodEx, but you can also pretrain your network if you like).

car_racing

 Deadline: Mar 26, 22:00  6 points + 6 bonus

The goal of this competition is to use Deep Q Networks (and any of Rainbow improvements) on a more real-world CarRacing-v2 environment from the Gymnasium library.

The supplied car_racing_environment.py provides the environment. The states are RGB np.uint8 images of size 96×96×396×96×3, but you can downsample them even more. The actions are also continuous and consist of an array with the following three elements:

  • steer in range [-1, 1]
  • gas in range [0, 1]
  • brake in range [0, 1]; note that full brake is quite aggressive, so you might consider using less force when braking Internally you should probably generate discrete actions and convert them to the required representation before the step call. Alternatively, you might set args.continuous=0, which changes the action space from continuous to 5 discrete actions – do nothing, steer left, steer right, gas, and brake. But you can experiment with different action space if you want.

The environment also supports frame skipping (args.frame_skipping), which improves its performance (only some frames need to be rendered). Note that ReCodEx respects both args.continuous and args.frame_skipping during evaluation.

In ReCodEx, you are expected to submit an already trained model, which is evaluated on 15 different tracks with a total time limit of 15 minutes. If your average return is at least 300, you obtain 6 points. The task is also a competition, and at most 6 points will be awarded according to relative ordering of your solutions.

The car_racing.py template parses several useful parameters and creates the environment. Note that the car_racing_environment.py can be executed directly and in that case you can drive the car using arrows.

Also, you might want to use a vectorized version of the environment for training, which runs several individual environments in separate processes. The template contains instructions how to create it. The vectorized environment expects a vector of actions and returns a vector of observations, rewards, dones and infos. When one of the environments finishes, it is automatically reset in the next step.

reinforce

 Deadline: Apr 09, 22:00  4 points

Solve the continuous CartPole-v1 environment from the Gymnasium library using the REINFORCE algorithm.

Your goal is to reach an average return of 490 during 100 evaluation episodes.

Start with the reinforce.py template, which provides a simple network implementation in PyTorch. Feel free to use TensorFlow (reinforce.tf.py) or JAX instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

reinforce_baseline

 Deadline: Apr 09, 22:00  3 points

This is a continuation of the reinforce assignment.

Using the reinforce_baseline.py template, solve the continuous CartPole-v1 environment using the REINFORCE with baseline algorithm. The TensorFlow version of the template reinforce_baseline.tf.py is also available.

Using a baseline lowers the variance of the value function gradient estimator, which allows faster training and decreases sensitivity to hyperparameter values. To reflect this effect in ReCodEx, note that the evaluation phase will automatically start after 200 episodes. Using only 200 episodes for training in this setting is probably too little for the REINFORCE algorithm, but suffices for the variant with a baseline. In this assignment, you must train your agent in ReCodEx using the provided environment only.

Your goal is to reach an average return of 490 during 100 evaluation episodes.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

cart_pole_pixels

 Deadline: Apr 09, 22:00  3 points + 4 bonus

The supplied cart_pole_pixels_environment.py generates a pixel representation of the CartPole environment as an 80×8080×80 np.uint8 image with three channels, with each channel representing one time step (i.e., the current observation and the two previous ones).

During evaluation in ReCodEx, three different random seeds will be employed, each with time limit of 10 minutes, and if you reach an average return at least 450 on all of them, you obtain 3 points. The task is also a competition, and at most 4 points will be awarded according to relative ordering of your solutions.

The cart_pole_pixels.py template parses several parameters and creates the environment. You are again supposed to train the model beforehand and submit only the trained neural network.

paac

 Deadline: Apr 23, 22:00  3 points

Solve the CartPole-v1 environment using parallel actor-critic algorithm, employing the vectorized environment described in the car_racing assignment.

Your goal is to reach an average return of 450 during 100 evaluation episodes.

Start with the paac.py template, which provides a simple network implementation in PyTorch. Feel free to use TensorFlow or JAX instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.

paac_continuous

 Deadline: Apr 23, 22:00  4 points

Solve the MountainCarContinuous-v0 environment using parallel actor-critic algorithm with continuous actions. When actions are continuous, env.action_space is the same Box space as env.observation_space, offering:

  • env.action_space.shape, which specifies the shape of actions (you can assume actions are always a 1D vector),
  • env.action_space.low and env.action_space.high, which specify the ranges of the corresponding actions.

Your goal is to reach an average return of 90 during 100 evaluation episodes.

Start with the paac_continuous.py template, which provides a simple network implementation in PyTorch. Feel free to use TensorFlow or JAX instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.

ddpg

 Deadline: Apr 23, 22:00  5 points

Solve the continuous Pendulum-v1 and InvertedDoublePendulum-v5 environments using deep deterministic policy gradient algorithm.

Your goal is to reach an average return of -200 for Pendulum-v1 and 9000 for InvertedDoublePendulum-v5 during 100 evaluation episodes.

Start with the ddpg.py template, which provides a simple network implementation in PyTorch. Feel free to use TensorFlow (ddpg.tf.py) or JAX instead, if you like.

During evaluation in ReCodEx, two different random seeds will be employed for both environments, and you need to reach the required return on all of them. Time limit for each test is 10 minutes.

walker

 Deadline: Apr 30, 22:00  5 points

In this exercise we explore continuous robot control by solving the continuous BipedalWalker-v3 environment.

Note that the penalty of -100 on crash makes the training considerably slower. Even if all of DDPG, TD3 and SAC can be trained with original rewards, overriding the reward at the end of episode to 0 speeds up training considerably.

In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 200, you obtain 5 points.

The walker.py template contains the skeleton for implementing the SAC agent, but you can also solve the assignment with DDPG/TD3. The TensorFlow template walker.tf.py is also available.

walker_hardcore

 Deadline: Apr 30, 22:00  5 points + 5 bonus

As an extension of the walker assignment, solve the hardcore version of the BipedalWalker-v3 environment.

Note that the penalty of -100 on crash can discourage or even stop training, so overriding the reward at the end of episode to 0 (or descresing it substantially) makes the training considerably easier (I have not surpassed return 0 with neither TD3 nor SAC with the original -100 penalty).

In ReCodEx, you are expected to submit an already trained model, which is evaluated with three seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 100, you obtain 5 points. The task is also a competition, and at most 5 points will be awarded according to relative ordering of your solutions.

The walker_hardcore.py template shows a basic structure of evaluaton in ReCodEx, but you most likely want to start either with ddpg.py. or with walker.py and just change the env argument to BipedalWalkerHardcore-v3.

cheetah

 Deadline: May 17, 22:00  2 points

In this exercise, use the DDPG/TD3/SAC algorithm to solve the HalfCheetah environment. If you start with DDPG, implementing the TD3 improvements should make the hyperparameter search significantly easier. However, for me, the SAC algorithm seems to work the best.

The template cheetah.py only creates the environment and shows the evaluation in ReCodEx.

In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 8000 on all of them, you obtain 2 points.

humanoid

 Deadline: May 14, 22:00  3 points; not required for passing the exam with grade 1 by solving all assignments

In this exercise, use the DDPG/TD3/SAC algorithm to solve the Humanoid environment.

The template humanoid.py only creates the environment and shows the evaluation in ReCodEx.

In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 8000 on all of them, you obtain 3 points.

humanoid_standup

 Deadline: May 14, 22:00  5 points; not required for passing the exam with grade 1 by solving all assignments

In this exercise, use the DDPG/TD3/SAC algorithm to solve the Humanoid Standup environment.

The template humanoid_standup.py only creates the environment and shows the evaluation in ReCodEx.

In ReCodEx, you are expected to submit an already trained model, which is evaluated with two seeds, each for 100 episodes with a time limit of 10 minutes. If your average return is at least 380000 on all of them, you obtain 5 points.

trace_algorithms

 Deadline: May 7, 22:00  4 points

Starting with the trace_algorithms.py template, implement the following state value estimations:

  • use nn-step estimates for a given nn;
  • if requested, use eligibility traces with a given λλ;
  • allow off-policy correction using importance sampling with control variates, optionally clipping the individual importance sampling ratios by a given threshold.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  1. python3 trace_algorithms.py --episodes=50 --n=1
The mean 1000-episode return after evaluation -196.80 +-25.96
  1. python3 trace_algorithms.py --episodes=50 --n=4
The mean 1000-episode return after evaluation -165.45 +-78.01
  1. python3 trace_algorithms.py --episodes=50 --n=8 --seed=62
The mean 1000-episode return after evaluation -180.20 +-61.48
  1. python3 trace_algorithms.py --episodes=50 --n=4 --trace_lambda=0.6
The mean 1000-episode return after evaluation -170.70 +-72.93
  1. python3 trace_algorithms.py --episodes=50 --n=8 --trace_lambda=0.6 --seed=77
The mean 1000-episode return after evaluation -154.24 +-86.67
  1. python3 trace_algorithms.py --episodes=50 --n=1 --off_policy
The mean 1000-episode return after evaluation -189.16 +-46.74
  1. python3 trace_algorithms.py --episodes=50 --n=4 --off_policy
The mean 1000-episode return after evaluation -159.09 +-83.40
  1. python3 trace_algorithms.py --episodes=50 --n=8 --off_policy
The mean 1000-episode return after evaluation -166.82 +-76.04
  1. python3 trace_algorithms.py --episodes=50 --n=1 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -198.50 +-17.93
  1. python3 trace_algorithms.py --episodes=50 --n=4 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -144.76 +-92.48
  1. python3 trace_algorithms.py --episodes=50 --n=8 --off_policy --vtrace_clip=1
The mean 1000-episode return after evaluation -167.63 +-75.87
  1. python3 trace_algorithms.py --episodes=50 --n=4 --off_policy --vtrace_clip=1 --trace_lambda=0.6
The mean 1000-episode return after evaluation -186.28 +-52.05
  1. python3 trace_algorithms.py --episodes=50 --n=8 --off_policy --vtrace_clip=1 --trace_lambda=0.6
The mean 1000-episode return after evaluation -185.67 +-53.04

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

  • python3 trace_algorithms.py --n=1
Episode 100, mean 100-episode return -96.50 +-92.02
Episode 200, mean 100-episode return -53.64 +-76.70
Episode 300, mean 100-episode return -29.03 +-54.84
Episode 400, mean 100-episode return -8.78 +-21.69
Episode 500, mean 100-episode return -14.24 +-41.76
Episode 600, mean 100-episode return -4.57 +-17.56
Episode 700, mean 100-episode return -7.90 +-27.92
Episode 800, mean 100-episode return -2.17 +-16.67
Episode 900, mean 100-episode return -2.07 +-14.01
Episode 1000, mean 100-episode return 0.13 +-13.93
The mean 1000-episode return after evaluation -35.05 +-84.82
  • python3 trace_algorithms.py --n=4
Episode 100, mean 100-episode return -74.01 +-89.62
Episode 200, mean 100-episode return -4.84 +-20.95
Episode 300, mean 100-episode return 0.37 +-11.81
Episode 400, mean 100-episode return 1.82 +-8.04
Episode 500, mean 100-episode return 1.28 +-8.66
Episode 600, mean 100-episode return 3.13 +-7.02
Episode 700, mean 100-episode return 0.76 +-8.05
Episode 800, mean 100-episode return 2.05 +-8.11
Episode 900, mean 100-episode return 0.98 +-9.22
Episode 1000, mean 100-episode return 0.29 +-9.13
The mean 1000-episode return after evaluation -11.49 +-60.05
  • python3 trace_algorithms.py --n=8 --seed=62
Episode 100, mean 100-episode return -102.83 +-104.71
Episode 200, mean 100-episode return -5.02 +-23.36
Episode 300, mean 100-episode return -0.43 +-13.33
Episode 400, mean 100-episode return 1.99 +-8.89
Episode 500, mean 100-episode return -2.17 +-16.57
Episode 600, mean 100-episode return -2.62 +-19.87
Episode 700, mean 100-episode return 1.66 +-7.81
Episode 800, mean 100-episode return -7.40 +-36.75
Episode 900, mean 100-episode return -5.95 +-34.04
Episode 1000, mean 100-episode return 3.51 +-7.88
The mean 1000-episode return after evaluation 6.88 +-14.89
  • python3 trace_algorithms.py --n=4 --trace_lambda=0.6
Episode 100, mean 100-episode return -85.33 +-91.17
Episode 200, mean 100-episode return -16.06 +-39.97
Episode 300, mean 100-episode return -2.74 +-15.78
Episode 400, mean 100-episode return -0.33 +-9.93
Episode 500, mean 100-episode return 1.39 +-9.48
Episode 600, mean 100-episode return 1.59 +-9.26
Episode 700, mean 100-episode return 3.66 +-6.99
Episode 800, mean 100-episode return 2.08 +-7.26
Episode 900, mean 100-episode return 1.32 +-8.76
Episode 1000, mean 100-episode return 3.33 +-7.27
The mean 1000-episode return after evaluation 7.93 +-2.63
  • python3 trace_algorithms.py --n=8 --trace_lambda=0.6 --seed=77
Episode 100, mean 100-episode return -118.76 +-105.12
Episode 200, mean 100-episode return -21.82 +-49.91
Episode 300, mean 100-episode return -0.59 +-11.21
Episode 400, mean 100-episode return 2.27 +-8.29
Episode 500, mean 100-episode return 1.65 +-8.52
Episode 600, mean 100-episode return 1.16 +-10.32
Episode 700, mean 100-episode return 1.18 +-9.62
Episode 800, mean 100-episode return 3.35 +-7.34
Episode 900, mean 100-episode return 1.66 +-8.67
Episode 1000, mean 100-episode return 0.86 +-8.56
The mean 1000-episode return after evaluation -11.93 +-60.63
  • python3 trace_algorithms.py --n=1 --off_policy
Episode 100, mean 100-episode return -68.47 +-73.52
Episode 200, mean 100-episode return -29.11 +-34.15
Episode 300, mean 100-episode return -20.30 +-31.24
Episode 400, mean 100-episode return -13.44 +-25.04
Episode 500, mean 100-episode return -4.72 +-13.75
Episode 600, mean 100-episode return -3.07 +-17.63
Episode 700, mean 100-episode return -2.70 +-13.81
Episode 800, mean 100-episode return 1.32 +-11.79
Episode 900, mean 100-episode return 0.78 +-8.95
Episode 1000, mean 100-episode return 1.15 +-9.27
The mean 1000-episode return after evaluation -12.63 +-62.51
  • python3 trace_algorithms.py --n=4 --off_policy
Episode 100, mean 100-episode return -96.25 +-105.93
Episode 200, mean 100-episode return -26.21 +-74.65
Episode 300, mean 100-episode return -4.84 +-31.78
Episode 400, mean 100-episode return -0.34 +-9.46
Episode 500, mean 100-episode return 1.15 +-8.49
Episode 600, mean 100-episode return 2.95 +-7.20
Episode 700, mean 100-episode return 0.94 +-10.19
Episode 800, mean 100-episode return 0.13 +-9.27
Episode 900, mean 100-episode return 1.95 +-9.69
Episode 1000, mean 100-episode return 1.91 +-7.59
The mean 1000-episode return after evaluation 6.79 +-3.68
  • python3 trace_algorithms.py --n=8 --off_policy
Episode 100, mean 100-episode return -180.08 +-112.11
Episode 200, mean 100-episode return -125.56 +-124.82
Episode 300, mean 100-episode return -113.66 +-125.12
Episode 400, mean 100-episode return -77.98 +-117.08
Episode 500, mean 100-episode return -23.71 +-69.71
Episode 600, mean 100-episode return -21.44 +-67.38
Episode 700, mean 100-episode return -2.43 +-16.31
Episode 800, mean 100-episode return 2.38 +-7.42
Episode 900, mean 100-episode return 1.29 +-7.78
Episode 1000, mean 100-episode return 0.84 +-8.37
The mean 1000-episode return after evaluation 7.03 +-2.37
  • python3 trace_algorithms.py --n=1 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -71.85 +-75.59
Episode 200, mean 100-episode return -29.60 +-39.91
Episode 300, mean 100-episode return -23.11 +-33.97
Episode 400, mean 100-episode return -12.00 +-21.72
Episode 500, mean 100-episode return -5.93 +-15.92
Episode 600, mean 100-episode return -7.69 +-16.03
Episode 700, mean 100-episode return -2.95 +-13.75
Episode 800, mean 100-episode return 0.45 +-9.76
Episode 900, mean 100-episode return 0.65 +-9.36
Episode 1000, mean 100-episode return -1.56 +-11.53
The mean 1000-episode return after evaluation -24.25 +-75.88
  • python3 trace_algorithms.py --n=4 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -76.39 +-83.74
Episode 200, mean 100-episode return -3.32 +-13.97
Episode 300, mean 100-episode return -0.33 +-9.49
Episode 400, mean 100-episode return 2.20 +-7.80
Episode 500, mean 100-episode return 1.49 +-7.72
Episode 600, mean 100-episode return 2.27 +-8.67
Episode 700, mean 100-episode return 1.07 +-9.07
Episode 800, mean 100-episode return 3.17 +-6.27
Episode 900, mean 100-episode return 3.25 +-7.39
Episode 1000, mean 100-episode return 0.70 +-8.61
The mean 1000-episode return after evaluation 7.70 +-2.52
  • python3 trace_algorithms.py --n=8 --off_policy --vtrace_clip=1
Episode 100, mean 100-episode return -110.07 +-106.29
Episode 200, mean 100-episode return -7.22 +-32.31
Episode 300, mean 100-episode return 0.54 +-9.65
Episode 400, mean 100-episode return 2.03 +-7.82
Episode 500, mean 100-episode return 1.64 +-8.63
Episode 600, mean 100-episode return 1.54 +-7.28
Episode 700, mean 100-episode return 2.80 +-7.86
Episode 800, mean 100-episode return 1.69 +-7.26
Episode 900, mean 100-episode return 1.17 +-8.59
Episode 1000, mean 100-episode return 2.39 +-7.59
The mean 1000-episode return after evaluation 7.57 +-2.35
  • python3 trace_algorithms.py --n=4 --off_policy --vtrace_clip=1 --trace_lambda=0.6
Episode 100, mean 100-episode return -81.87 +-87.96
Episode 200, mean 100-episode return -15.94 +-29.31
Episode 300, mean 100-episode return -5.24 +-20.41
Episode 400, mean 100-episode return -1.01 +-12.52
Episode 500, mean 100-episode return 1.09 +-9.55
Episode 600, mean 100-episode return 0.73 +-9.15
Episode 700, mean 100-episode return 3.09 +-7.59
Episode 800, mean 100-episode return 3.13 +-7.60
Episode 900, mean 100-episode return 1.30 +-8.72
Episode 1000, mean 100-episode return 3.77 +-7.11
The mean 1000-episode return after evaluation 6.46 +-17.53
  • python3 trace_algorithms.py --n=8 --off_policy --vtrace_clip=1 --trace_lambda=0.6
Episode 100, mean 100-episode return -127.86 +-106.40
Episode 200, mean 100-episode return -27.64 +-48.34
Episode 300, mean 100-episode return -12.75 +-35.05
Episode 400, mean 100-episode return -0.38 +-14.28
Episode 500, mean 100-episode return 1.35 +-9.10
Episode 600, mean 100-episode return 0.43 +-10.53
Episode 700, mean 100-episode return 3.11 +-9.26
Episode 800, mean 100-episode return 3.58 +-6.81
Episode 900, mean 100-episode return 1.24 +-8.24
Episode 1000, mean 100-episode return 1.58 +-7.15
The mean 1000-episode return after evaluation 7.93 +-2.67

ppo

 Deadline: May 7, 22:00  4 points

Implement the PPO algorithm in a single-agent settings. Notably, solve the SingleCollect environment implemented by the multi_collect_environment.py module. To familiarize with it, you can watch a trained agent and you can run the module directly, controlling the agent with the arrow keys. In the environment, your goal is to reach a known place, obtaining rewards based on the agent's distance. If the agent is continuously occupying the place for some period of time, it gets a large reward and the place is moved randomly. The environment runs for 250 steps and it is considered solved if you obtain a return of at least 500.

The ppo.tf.py TensorFlow template contains a skeleton of the PPO algorithm implementation. PyTorch template will be added later. Regarding the unspecified hyperparameters, I would consider the following ranges:

  • batch_size between 64 and 512
  • clip_epsilon between 0.1 and 0.2
  • epochs between 1 and 10
  • gamma between 0.97 and 1.0
  • trace_lambda is usually 0.95
  • envs between 16 and 128
  • worker_steps between tens and hundreds

My implementation trains in approximately three minutes of CPU time.

During evaluation in ReCodEx, two different random seeds will be employed, and you need to reach the average return of 450 on all of them. Time limit for each test is 10 minutes.

Submitting to ReCodEx

When submitting a competition solution to ReCodEx, you should submit a trained agent and a Python source capable of running it.

Furthermore, please also include the Python source and hyperparameters you used to train the submitted model. But be careful that there still must be exactly one Python source with a line starting with def main(.

Do not forget about the maximum allowed model size and time and memory limits.

Competition Evaluation

  • Before the deadline, ReCodEx prints the exact performance of your agent, but only if it is worse than the baseline.

    If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached performance.

  • After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.

  • After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.

What Is Allowed

  • Unless stated otherwise, you can use any algorithm to solve the competition task at hand, but the implementation must be created by you and you must understand it fully. You can of course take inspiration from any paper or existing implementation, but please reference it in that case.
  • PyTorch, TensorFlow, and JAX are available in ReCodEx (but there are no GPUs).

Install

  • Installing to central user packages repository

    You can install all required packages to central user packages repository using python3 -m pip install --user --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 torch~=2.2.0 torchaudio~=2.2.0 torchvision~=0.17.0 gymnasium~=1.0.0a1 pygame~=2.5.2 ufal.pybox2d~=2.3.10.3 mujoco==3.1.1 imageio~=2.34.0.

    The above command installs CUDA 11.8 PyTorch build, but you can change cu118 to:

    • cpu to get CPU-only (smaller) version,
    • cu121 to get CUDA 12.1 build,
    • rocm5.7 to get AMD ROCm 5.7 build.
  • Installing to a virtual environment

    Python supports virtual environments, which are directories containing independent sets of installed packages. You can create a virtual environment by running python3 -m venv VENV_DIR followed by VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 torch~=2.2.0 torchaudio~=2.2.0 torchvision~=0.17.0 gymnasium~=1.0.0a1 pygame~=2.5.2 ufal.pybox2d~=2.3.10.3 mujoco==3.1.1 imageio~=2.34.0. (or VENV_DIR/Scripts/pip on Windows).

    Again, apart from the CUDA 11.8 build, you can change cu118 to:

    • cpu to get CPU-only (smaller) version,
    • cu121 to get CUDA 12.1 build,
    • rocm5.7 to get AMD ROCm 5.7 build.
  • Windows installation

    • On Windows, it can happen that python3 is not in PATH, while py command is – in that case you can use py -m venv VENV_DIR, which uses the newest Python available, or for example py -3.11 -m venv VENV_DIR, which uses Python version 3.11.

    • If you encounter a problem creating the logs in the args.logdir directory, a possible cause is that the path is longer than 260 characters, which is the default maximum length of a complete path on Windows. However, you can increase this limit on Windows 10, version 1607 or later, by following the instructions.

  • GPU support on Linux and Windows

    PyTorch supports NVIDIA GPU or AMD GPU out of the box, you just need to select appropriate --extra-index-url when installing the packages.

    If you encounter problems loading CUDA or cuDNN libraries, make sure your LD_LIBRARY_PATH does not contain paths to older CUDA/cuDNN libraries.

  • GPU support on macOS

    The support for Apple Silicon GPUs in PyTorch+Keras is currently not great. Apple is working on mlx backend for Keras, which might improve the situation in the future.

MetaCentrum

  • How to apply for MetaCentrum account?

    After reading the Terms and conditions, you can apply for an account here.

    After your account is created, please make sure that the directories containing your solutions are always private.

  • How to activate Python 3.10 on MetaCentrum?

    On Metacentrum, currently the newest available Python is 3.10, which you need to activate in every session by running the following command:

    module add python/python-3.10.4-intel-19.0.4-sc7snnf
    
  • How to install the required virtual environment on MetaCentrum?

    To create a virtual environment, you first need to decide where it will reside. Either you can find a permanent storage, where you have large-enough quota, or you can use scratch storage for a submitted job.

    TL;DR:

    • Run an interactive CPU job, asking for 16GB scratch space:

      qsub -l select=1:ncpus=1:mem=8gb:scratch_local=16gb -I
      
    • In the job, use the allocated scratch space as the temporary directory:

      export TMPDIR=$SCRATCHDIR
      
    • You should clear the scratch space before you exit using the clean_scratch command. You can instruct the shell to call it automatically by running:

      trap 'clean_scratch' TERM EXIT
      
    • Finally, create the virtual environment and install PyTorch in it:

      module add python/python-3.10.4-intel-19.0.4-sc7snnf
      python3 -m venv CHOSEN_VENV_DIR
      CHOSEN_VENV_DIR/bin/pip install --no-cache-dir --upgrade pip setuptools
      CHOSEN_VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 torch~=2.2.0 torchaudio~=2.2.0 torchvision~=0.17.0 gymnasium~=1.0.0a1 pygame~=2.5.2 ufal.pybox2d~=2.3.10.3 mujoco==3.1.1 imageio~=2.34.0
      
  • How to run a GPU computation on MetaCentrum?

    First, read the official MetaCentrum documentation: Basic terms, Run simple job, GPU computing, GPU clusters.

    TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, 8GB RAM, and 16GB scatch space, run:

    qsub -q gpu -l select=1:ncpus=1:ngpus=1:mem=8gb:scratch_local=16gb -I
    

    To run a script in a non-interactive way, replace the -I option with the script to be executed.

    If you want to run a CPU-only computation, remove the -q gpu and ngpus=1: from the above commands.

AIC

  • How to install required packages on AIC?

    The Python 3.11.7 is available /opt/python/3.11.7/bin/python3, so you should start by creating a virtual environment using

    /opt/python/3.11.7/bin/python3 -m venv VENV_DIR
    

    and then install the required packages in it using

    VENV_DIR/bin/pip install --no-cache-dir --extra-index-url=https://download.pytorch.org/whl/cu118 torch~=2.2.0 torchaudio~=2.2.0 torchvision~=0.17.0 gymnasium~=1.0.0a1 pygame~=2.5.2 ufal.pybox2d~=2.3.10.3 mujoco==3.1.1 imageio~=2.34.0
    
  • How to run a GPU computation on AIC?

    First, read the official AIC documentation: Submitting CPU Jobs, Submitting GPU Jobs.

    TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, and 16GB RAM, run:

    srun -p gpu -c1 -G1 --mem=16G --pty bash
    

    To run a shell script requiring a GPU in a non-interactive way, use

    sbatch -p gpu -c1 -G1 --mem=16G SCRIPT_PATH
    

    If you want to run a CPU-only computation, remove the -p gpu and -G1 from the above commands.

Git

  • Is it possible to keep the solutions in a Git repository?

    Definitely. Keeping the solutions in a branch of your repository, where you merge them with the course repository, is probably a good idea. However, please keep the cloned repository with your solutions private.

  • On GitHub, do not create a public fork with your solutions

    If you keep your solutions in a GitHub repository, please do not create a clone of the repository by using the Fork button – this way, the cloned repository would be public.

    Of course, if you just want to create a pull request, GitHub requires a public fork and that is fine – just do not store your solutions in it.

  • How to clone the course repository?

    To clone the course repository, run

    git clone https://github.com/ufal/npfl139
    

    This creates the repository in the npfl139 subdirectory; if you want a different name, add it as a last parameter.

    To update the repository, run git pull inside the repository directory.

  • How to keep the course repository as a branch in your repository?

    If you want to store the course repository just in a local branch of your existing repository, you can run the following command while in it:

    git remote add upstream https://github.com/ufal/npfl139
    git fetch upstream
    git checkout -t upstream/master
    

    This creates a branch master; if you want a different name, add -b BRANCH_NAME to the last command.

    In both cases, you can update your checkout by running git pull while in it.

  • How to merge the course repository with your modifications?

    If you want to store your solutions in a branch merged with the course repository, you should start by

    git remote add upstream https://github.com/ufal/npfl139
    git pull upstream master
    

    which creates a branch master; if you want a different name, change the last argument to master:BRANCH_NAME.

    You can then commit to this branch and push it to your repository.

    To merge the current course repository with your branch, run

    git merge upstream master
    

    while in your branch. Of course, it might be necessary to resolve conflicts if both you and I modified the same place in the templates.

ReCodEx

  • What files can be submitted to ReCodEx?

    You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.

  • What file does ReCodEx execute and what arguments does it use?

    Exactly one file with py suffix must contain a line starting with def main(. Such a file is imported by ReCodEx and the main method is executed (during the import, __name__ == "__recodex__").

    The file must also export an argument parser called parser. ReCodEx uses its arguments and default values, but it overwrites some of the arguments depending on the test being executed – the template should always indicate which arguments are set by ReCodEx and which are left intact.

  • What are the time and memory limits?

    The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution.

  • Do agents need to be trained directly in ReCodEx?

    No, you can pre-train your agent locally (unless specified otherwise in the task description).

Requirements

To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.

To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every but the last lecture). In addition, you can get surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.

Exam Questions

Lecture 1 Questions

  • Derive how to incrementally update a running average (how to compute an average of NN numbers using the average of the first N1N-1 numbers). [5]

  • Describe multi-arm bandits and write down the ϵ\epsilon-greedy algorithm for solving it. [5]

  • Define a Markov Decision Process, including the definition of a return. [5]

  • Describe how does a partially observable Markov decision process extend the Markov decision process and how is the agent altered. [5]

  • Define a value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]

  • Define an action-value function, such that all expectations are over simple random variables (actions, states, rewards), not trajectories. [5]

  • Express a value function using an action-value function, and express an action-value function using a value function. [5]

  • Define optimal value function and optimal action-value function. Then define optimal policy in such a way that its existence is guaranteed. [5]

Lecture 2 Questions

  • Write down the Bellman optimality equation. [5]

  • Define the Bellman backup operator. [5]

  • Write down the value iteration algorithm. [5]

  • Define the supremum norm ||\cdot||_\infty and prove that Bellman backup operator is a contraction with respect to this norm. [10]

  • Formulate and prove the policy improvement theorem. [10]

  • Write down the policy iteration algorithm. [10]

  • Write down the tabular Monte-Carlo on-policy every-visit ϵ\epsilon-soft algorithm. [10]

  • Write down the Sarsa algorithm. [10]

  • Write down the Q-learning algorithm. [10]

Lecture 3 Questions

  • Elaborate on how can importance sampling estimate expectations with respect to π\pi based on samples of bb. [5]

  • Show how to estimate returns in the off-policy case, both with (1) ordinary importance sampling and (2) weighted importance sampling. [10]

  • Write down the Expected Sarsa algorithm and show how to obtain Q-learning from it. [10]

  • Write down the Double Q-learning algorithm. [10]

  • Show the bootstrapped estimate of nn-step return. [5]

  • Write down the update in on-policy nn-step Sarsa (assuming you already have nn previous steps, actions and rewards). [5]

  • Write down the update in off-policy nn-step Sarsa with importance sampling (assuming you already have nn previous steps, actions and rewards). [10]

  • Write down the update of nn-step Tree-backup algorithm (assuming you already have nn previous steps, actions and rewards). [10]

  • Assuming function approximation, define Mean squared value error. [5]

  • Write down the gradient Monte-Carlo on-policy every-visit ϵ\epsilon-soft algorithm. [10]

Lecture 4 Questions

  • Write down the semi-gradient ϵ\epsilon-greedy Sarsa algorithm. [10]

  • Prove that semi-gradient TD update is not an SGD update of any loss. [10]

  • What are the three elements causing off-policy divergence with function approximation? Write down the Baird's counterexample. [10]

  • Explain the role of a replay buffer in Deep Q Networks. [5]

  • How is the target network used and updated in Deep Q Networks? [5]

  • Explain how is reward clipping used in Deep Q Networks. What other clipping is used? [5]

  • Formulate the loss used in Deep Q Networks. [5]

  • Write down the Deep Q Networks training algorithm. [10]

  • Explain the difference between DQN and Double DQN. [5]

  • Describe prioritized replay (how are transitions sampled from the replay buffer, how up-to-date are the priorities [according to which we sample], how are unseen transitions boosted, how is importance sampling used to account for the change in the sampling distribution). [10]

  • How is the action-value function computed in dueling networks? [5]

Lecture 5 Questions

  • Describe a fully connected layer in Noisy nets (parametrization, computation, effective noise generation). [5]

  • In Distributional RL, describe how is the distribution of rewards represented and how it is predicted using a neural network. [5]

  • Write down the distributional Bellman equation, describe how are the atom probabilities of a reward distribution modeled, and write down the loss used to train a distributional Q network (including the mapping of atoms, which does not need to be mathematically flawless -- it is enough to describe how it should be done). [10]

  • More questions about distributional RL to be added later.

Lecture 6 Questions

  • Formulate the policy gradient theorem. [5]

  • Prove the part of the policy gradient theorem showing the value of θvπ(s)\nabla_{\boldsymbol\theta} v_\pi(s). [10]

  • Assuming the policy gradient theorem, formulate the loss used by the REINFORCE algorithm and show how can its gradient be expressed as an expectation over states and actions. [5]

  • Write down the REINFORCE algorithm. [10]

  • Show that introducing baseline does not influence validity of the policy gradient theorem. [5]

  • Write down the REINFORCE with baseline algorithm. [10]

  • Write down the trajectory formulation of the operator version of REINFORCE, and show that the usual REINFORCE performs one gradient step to minimize the same utility function. [10]

  • Write down the state-action formulation of the operator version of REINFORCE, and show that the usual REINFORCE performs one gradient step to minimize the same utility function. [10]

  • Write down the one-step Actor-critic algorithm. [10]

  • How and why is entropy regularization used in policy gradient algorithms? [5]

  • The Asynchronous advantage actor-critic (A3C) policy may utilize recurrent neural networks. How is the training structured to allow backpropagation through them (would vanilla DQN, vanilla REINFORCE, vanilla actor-critic work with recurrent neural networks)? [5]

Lecture 8 Questions

  • Explain the difference between a regular Actor-critic and Parallel Advantage Actor Critic algorithms. [5]

  • Considering continuous actions modeled by a normal distribution with diagonal covariance, describe how is the policy distribution computed (network architecture, output activation functions) and how does the loss of a simple REINFORCE algorithm look like. [5]

  • Formulate the deterministic policy gradient theorem for θvπ(s)\nabla_{\boldsymbol\theta} v_\pi(s). [5]

  • Formulate the deterministic policy gradient theorem for θJ(θ)\nabla_{\boldsymbol\theta} J(\boldsymbol\theta). [5]

  • Prove the part of the deterministic policy gradient theorem showing the value of θvπ(s)\nabla_{\boldsymbol\theta} v_\pi(s). [10]

  • Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Deep Deterministic Policy Gradients (DDPG) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]

  • How is the return estimated in the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm? [5]

  • Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Make sure to distinguish the target networks from the ones being trained. [10]

  • Write down how is the reward augmented in Soft actor critic, and the definitions of the soft action-value function and the soft (state-)value function. Then, define the modified Bellman backup operator Tπ\mathcal{T}_\pi (be sure to indicate whether you are using the augmented or non-augmented reward), whose repeated application converges to the soft actor-value function qπq_\pi, and prove it. [10]

  • Considering soft policy improvement of a policy π\pi, write down the update formula for the improved policy π\pi', and prove that the soft action-value function of the improved policy is greater or equal to the soft action-value function of the original policy. [10]

  • Write down how are the critics and target critics updated in the Soft actor critic algorithm. [5]

  • Write down how is the actor updated in the Soft actor critic algorithm, including the policy reparametrization trick. [5]

  • Regarding the entropy penalty coefficient α\alpha in the Soft actor critic, define what constrained optimization problem we are solving, what is the corresponding Lagrangian (and whether we are minimizing/maximizing it with respect to the policy and α\alpha), and what does the α\alpha update looks like. [5]

Lecture 9 Questions

  • Define a one-step TD error and express the nn-step return as a sum of them. [5]

  • Define a one-step TD error and express the nn-step return with off-policy correction using control variates as a sum of TD errors. [5]

  • Define the λ\lambda-return. [5]

  • Define the nn-step truncated λ\lambda-return. [5]

  • Define a one-step TD error and express the nn-step truncated λ\lambda-return as a sum of them. [5]

  • Define a one-step TD error and express the nn-step truncated λ\lambda-return with off-policy correction as a sum of them. [5]

  • Define the V-trace estimate and write down the policy to whose value function the V-trace estimate converges to. [10]

  • Explain why the fixed point of the V-trace operator does not depend on the truncation of all but the last importance sampling ratios. [10]

  • Write down the critic loss (or its derivative) and the actor policy loss (or its derivative) of the IMPALA algorithm, including the V-trace formula. [10]

  • Sketch the population based training used in the IMPALA algorithm. [5]

  • In PopArt normalization, the value function is computed based on a normalized value predictor nn as σn+μ\sigma n + \mu. Describe how to maintain σ\sigma and μ\mu, how to compute normalized advantage based on return GG, and how is the normalized value predictor modified when the estimates of σ\sigma and μ\mu change. [10]

Lecture 10 Questions

  • Define the transformed Bellman operator. [5]

  • Define the transformed Bellman operator. Then, assuming hh is strictly monotonically increasing function and considering a deterministic Markov decision process, show to what does a transformed Bellman operator Th\mathcal{T}_h converge and prove it. [10]

  • Write down the return transformation used for Atari environments (for example by R2D2). [5]

  • Describe the replay buffer elements in R2D2. What is the difference between the zero-state and stored-state strategies, and how is burn-in used? [5]

  • Write down the Retrace operator and describe the three possibilities of setting the traces ctc_t: importance sampling, Tree-backup(λ\lambda) and Retrace(λ\lambda). [10]