# Deep Reinforcement Learning – Winter 2022/23

In recent years, reinforcement learning has been combined with deep neural networks, giving rise to game agents with super-human performance (for example for Go, chess, or 1v1 Dota2, capable of being trained solely by self-play), datacenter cooling algorithms being 50% more efficient than trained human operators, or improved machine translation. The goal of the course is to introduce reinforcement learning employing deep neural networks, focusing both on the theory and on practical implementations.

Python programming skills and TensorFlow skills (or any other deep learning framework) are required, to the extent of the NPFL114 course. No previous knowledge of reinforcement learning is necessary.

SIS code: NPFL122
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Guarantor: Milan Straka

### Timespace Coordinates

• lecture: the lecture is held on Monday 9:00 in S9; first lecture is on Oct 03
• practicals: the practicals take place on Monday 17:20 in S3; first practicals are on Oct 03

All lectures and practicals will be recorded and available on this website.

### Lectures

The lecture content, including references to study materials.

The main study material is the Reinforcement Learning: An Introduction; second edition by Richard S. Sutton and Andrew G. Barto (reffered to as RLB). It is available online and also as a hardcopy.

References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.

### 1. Introduction to Reinforcement Learning

• History of RL [Chapter 1 of RLB]
• Multi-armed bandits [Sections 2-2.6 of RLB]
• Markov Decision Process [Sections 3-3.3 of RLB]

### Requirements

To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.

### Environment

The tasks are evaluated automatically using the ReCodEx Code Examiner.

The evaluation is performed using Python 3.9, TensorFlow 2.8.3, TensorFlow Addons 0.16.1, TensorFlow Probability 0.16.0, NumPy 1.23.3, and Gym 0.26.1. You should install the exact version of these packages yourselves. For those using PyTorch, 1.12.1 is also available.

### Teamwork

Solving assignments in teams (of size at most 3) is encouraged, but everyone has to participate (it is forbidden not to work on an assignment and then submit a solution created by other team members). All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. Each such solution must explicitly list all members of the team to allow plagiarism detection using this template.

### No Cheating

Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty. While discussing assignments with any classmate is fine, each team must complete the assignments themselves, without using code they did not write (unless explicitly allowed). Of course, inside a team you are expected to share code and submit identical solutions.

### bandits

Deadline: Oct 17, 7:59 a.m.  3 points

Implement the $ε$-greedy strategy for solving multi-armed bandits.

Start with the bandits.py template, which defines MultiArmedBandits environment, which has the following two methods:

• reset(): reset the environment
• step(action) → reward: perform the chosen action in the environment, obtaining a reward
• greedy(epsilon): return True with probability 1-epsilon

Your goal is to implement the following solution variants:

• alpha$=0$: perform $ε$-greedy search, updating the estimated using averaging.
• alpha$≠0$: perform $ε$-greedy search, updating the estimated using a fixed learning rate alpha.

Note that the initial estimates should be set to a given value and epsilon can be zero, in which case purely greedy actions are used.

Note that your results may be slightly different, depending on your CPU type and whether you use a GPU.

• python3 bandits.py --alpha=0 --epsilon=0.1 --initial=0
1.39 0.08

• python3 bandits.py --alpha=0 --epsilon=0 --initial=1
1.48 0.22

• python3 bandits.py --alpha=0.15 --epsilon=0.1 --initial=0
1.37 0.09

• python3 bandits.py --alpha=0.15 --epsilon=0 --initial=1
1.52 0.04


### monte_carlo

Deadline: Oct 17, 7:59 a.m.  5 points

Solve the discretized CartPole-v1 environment environment from the Gym library using the Monte Carlo reinforcement learning algorithm. The gym environments have the followng methods and properties:

• observation_space: the description of environment observations
• action_space: the description of environment actions
• reset() → new_state, info: starts a new episode, returning the new state and additional environment-specific information
• step(action) → new_state, reward, terminated, truncated, info: perform the chosen action in the environment, returning the new state, obtained reward, boolean flags indicating a terminal state and episode truncation, and additional environment-specific information

We additionaly extend the gym environment by:

• episode: number of the current episode (zero-based)
• reset(start_evaluation=False) → new_state, info: if start_evaluation is True, an evaluation is started

Once you finish training (which you indicate by passing start_evaluation=True to reset), your goal is to reach an average return of 490 during 100 evaluation episodes. Note that the environment prints your 100-episode average return each 10 episodes even during training.

Start with the monte_carlo.py template, which parses several useful parameters, creates the environment and illustrates the overall usage.

During evaluation in ReCodEx, three different random seeds will be employed, and you need to reach the required return on all of them. Time limit for each test is 5 minutes.

### Submitting to ReCodEx

When submitting a competition solution to ReCodEx, you should submit a trained agent and a Python source capable of running it.

Furthermore, please also include the Python source and hyperparameters you used to train the submitted model. But be careful that there still must be exactly one Python source with a line starting with def main(.

Do not forget about the maximum allowed model size and time and memory limits.

### Competition Evaluation

• Before the deadline, ReCodEx prints the exact performance of your agent, but only if it is worse than the baseline.

If you surpass the baseline, the assignment is marked as solved in ReCodEx and you immediately get regular points for the assignment. However, ReCodEx does not print the reached performance.

• After the competition deadline, the latest submission of every user surpassing the required baseline participates in a competition. Additional bonus points are then awarded according to the ordering of the performance of the participating submissions.

• After the competition results announcement, ReCodEx starts to show the exact performance for all the already submitted solutions and also for the solutions submitted later.

### What Is Allowed

• Unless stated otherwise, you can use any algorithm to solve the competition task at hand, but the implementation must be created by you.
• Both TensorFlow and PyTorch are available in ReCodEx (but there are no GPUs).

### Install

• Installing to central user packages repository

You can install all required packages to central user packages repository using pip3 install --user tensorflow==2.8.3 tensorflow_addons==0.16.1 tensorflow_probability==0.16.0 numpy==1.23.3 gym==0.26.1 pygame==2.1.2 mujoco==2.2.2 ufal.pybox2d==2.3.10.2.

• Installing to a virtual environment

Python supports virtual environments, which are directories containing independent sets of installed packages. You can create a virtual environment by running python3 -m venv VENV_DIR followed by VENV_DIR/bin/pip3 install tensorflow==2.8.3 tensorflow_addons==0.16.1 tensorflow_probability==0.16.0 numpy==1.23.3 gym==0.26.1 pygame==2.1.2 mujoco==2.2.2 ufal.pybox2d==2.3.10.2. (or VENV_DIR/Scripts/pip3 on Windows).

• Windows installation

• On Windows, it can happen that python3 is not in PATH, while py command is – in that case you can use py -m venv VENV_DIR, which uses the newest Python available, or for example py -3.9 -m venv VENV_DIR, which uses Python version 3.9.

• If your Windows TensorFlow fails with ImportError: DLL load failed, you are probably missing Visual C++ 2019 Redistributable.

• If you encounter a problem creating the logs in the args.logdir directory, a possible cause is that the path is longer than 260 characters, which is the default maximum length of a complete path on Windows. However, you can increase this limit on Windows 10, version 1607 or later, by following the instructions.

• macOS installation

• With an Intel processor, you should not need anything special.

• If you have Apple Silicon, use tensorflow-macos==2.8.0 protobuf==3.19.6 instead of tensorflow. As of Oct 1, the dependency package grpcio needs to be compiled during the installation (automatically, but you need working Xcode); the installation worked fine on my testing macOS. Furthermore, according to this issue, a binary wheel for grpcio could be provided soon.

• GPU support on Linux and Windows

TensorFlow 2.8 supports NVIDIA GPU out of the box, but you need to install CUDA 11.2 and cuDNN 8.1 libraries yourself.

• GPU support on macOS

The AMD and Apple Silicon GPUs can be used by installing a plugin providing the GPU acceleration using:

python -m pip install tensorflow-metal==0.5.1

• Errors when running with a GPU

If you encounter errors when running with a GPU:

• if you are using the GPU also for displaying, try using the following environment variable: export TF_FORCE_GPU_ALLOW_GROWTH=true
• you can rerun with export TF_CPP_MIN_LOG_LEVEL=0 environmental variable, which increases verbosity of the log messages.

### MetaCentrum

• How to install TensorFlow dependencies on MetaCentrum?

To install CUDA, cuDNN, and Python 3.10 on MetaCentrum, it is enough to run in every session the following command:

module add python/python-3.10.4-gcc-8.3.0-ovkjwzd cuda/cuda-11.2.0-intel-19.0.4-tn4edsz cudnn/cudnn-8.1.0.77-11.2-linux-x64-intel-19.0.4-wx22b5t

• How to install TensorFlow on MetaCentrum?

Once you have the required dependencies, you can create a virtual environment and install TensorFlow in it. However, note that by default the MetaCentrum jobs have a little disk space, so read about how to ask for scratch storage when submitting a job, and about quotas,

TL;DR:

• Run an interactive CPU job, asking for 16GB scratch space:

qsub -l select=1:ncpus=1:mem=8gb:scratch_local=16gb -I

• In the job, use the allocated scratch space as a temporary directory:

export TMPDIR=$SCRATCHDIR  • Finally, create the virtual environment and install TensorFlow in it: module add python/python-3.10.4-gcc-8.3.0-ovkjwzd cuda/cuda-11.2.0-intel-19.0.4-tn4edsz cudnn/cudnn-8.1.0.77-11.2-linux-x64-intel-19.0.4-wx22b5t python3 -m venv CHOSEN_VENV_DIR CHOSEN_VENV_DIR/bin/pip install --no-cache-dir tensorflow==2.8.3 tensorflow_addons==0.16.1 tensorflow_probability==0.16.0 numpy==1.23.3 gym==0.26.1 pygame==2.1.2 mujoco==2.2.2 ufal.pybox2d==2.3.10.2  • How to run a GPU computation on MetaCentrum? First, read the official MetaCentrum documentation: Beginners guide, About scheduling system, GPU clusters. TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, 16GB RAM, and 8GB scatch space, run: qsub -q gpu -l select=1:ncpus=1:ngpus=1:mem=16gb:scratch_local=8gb -I  To run a script in a non-interactive way, replace the -I option with the script to be executed. If you want to run a CPU-only computation, remove the -q gpu and ngpus=1: from the above commands. ### AIC • How to install TensorFlow dependencies on AIC? To install CUDA, cuDNN and Python 3.9 on AIC, you should add the following to your .profile: export PATH="/lnet/aic/data/python/3.9.9/bin:$PATH"
export LD_LIBRARY_PATH="/lnet/aic/opt/cuda/cuda-11.2/lib64:/lnet/aic/opt/cuda/cuda-11.2/cudnn/8.1.1/lib64:/lnet/aic/opt/cuda/cuda-11.2/extras/CUPTI/lib64:\$LD_LIBRARY_PATH"

• How to run a GPU computation on AIC?

First, read the official AIC documentation: Submitting CPU Jobs, Submitting GPU Jobs.

TL;DR: To run an interactive GPU job with 1 CPU, 1 GPU, and 16GB RAM, run:

qrsh -q gpu.q -l gpu=1,mem_free=16G,h_data=16G -pty yes bash -l


To run a script requiring a GPU in a non-interactive way, use

qsub -q gpu.q -l gpu=1,mem_free=16G,h_data=16G -cwd -b y SCRIPT_PATH


If you want to run a CPU-only computation, remove the -q gpu.q and gpu=1, from the above commands.

### ReCodEx

• What files can be submitted to ReCodEx?

You can submit multiple files of any type to ReCodEx. There is a limit of 20 files per submission, with a total size of 20MB.

• What file does ReCodEx execute and what arguments does it use?

Exactly one file with py suffix must contain a line starting with def main(. Such a file is imported by ReCodEx and the main method is executed (during the import, __name__ == "__recodex__").

The file must also export an argument parser called parser. ReCodEx uses its arguments and default values, but it overwrites some of the arguments depending on the test being executed – the template should always indicate which arguments are set by ReCodEx and which are left intact.

• What are the time and memory limits?

The memory limit during evaluation is 1.5GB. The time limit varies, but it should be at least 10 seconds and at least twice the running time of my solution.

• Do agents need to be trained directly in ReCodEx?

No, you can pre-train your agent locally (unless specified otherwise in the task description).

### Requirements

To pass the practicals, you need to obtain at least 80 points, excluding the bonus points. Note that all surplus points (both bonus and non-bonus) will be transfered to the exam. In total, assignments for at least 120 points (not including the bonus points) will be available, and if you solve all the assignments (any non-zero amount of points counts as solved), you automatically pass the exam with grade 1.

To pass the exam, you need to obtain at least 60, 75, or 90 points out of 100-point exam to receive a grade 3, 2, or 1, respectively. The exam consists of 100-point-worth questions from the list below (the questions are randomly generated, but in such a way that there is at least one question from every lecture). In addition, you can get surplus points from the practicals and at most 10 points for community work (i.e., fixing slides or reporting issues) – but only the points you already have at the time of the exam count. You can take the exam without passing the practicals first.

### Exam Questions

Lecture 1 Questions

• Derive how to incrementally update a running average (how to compute an average of $N$ numbers using the average of the first $N-1$ numbers). [5]

• Describe multi-arm bandits and write down the $\epsilon$-greedy algorithm for solving it. [5]

• Define a Markov Decision Process, including the definition of a return. [5]

• Describe how does a partially observable Markov decision process extend the Markov decision process and how is the agent altered. [5]