Deep Learning – Winter 2016/17

In recent years, deep neural networks have been used to solve complex machine-learning problems. They have achieved significant state-of-the-art results in many areas.

The goal of the course is to introduce deep neural networks, from the basics to the latest advances. The course will focus both on theory as well as on practical aspects (students will implement and train several deep neural networks capable of achieving state-of-the-art results, for example in named entity recognition, dependency parsing, machine translation, image labeling or in playing video games). No previous knowledge of artificial neural networks is required, but basic understanding of their core concepts and of machine learning is advisable.

Timespace Coordinates

  • lecture: Czech lecture is held on Monday 15:40 in S9, English lecture on Monday 14:00 in S4
  • practicals: there are two parallel practicals, on Monday 17:20 in SU1 and on Tuesday 12:20 in SU1

Pass Conditions

To complete the course, you need to pass the exam and obtain at least 30 points in the practicals.

  • The list of exam topics is available here, an example exam from 17th January is available here.
  • Points in the practicals are awarded for:
    • home assignments (recommended way of getting all the points)
    • talk (contact me if you are interested)
    • optional project (depending on complexity, up to 30 points can be awarded)

Lecture Outlines

The lecture outlines, including references to study materials. The main study material is the Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville, (referred to as DLB).

References to study materials cover all theory required at the exam, and sometimes even more -- the references in italics cover topics not required for the exam.

DateContent
Oct 10
  • History of Deep Learning [Section 1.2 of DLB]
  • Machine Learning Basics [Section 5.1 of DLB]
  • Brief description of Logistic Regression, Maximum Entropy models and SVM [Sections 5.7.1 and 5.7.2 of DLB]
  • Challenges Motivating Deep Learning [Section 5.11 of DLB]
  • Maximum Likelihood Estimation [Section 5.5 of DLB, excluding equations (5.59)-(5.61)]
Oct 17
  • Capacity, overfitting and underfitting [Section 5.2 of DLB, excluding Section 5.2.1]
  • Hyperparameters and validation sets [Section 5.3 of DLB]
  • Neural network basics (this topic is treated in detail withing the lecture NAIL002)
    • Neural networks as graphs [Chapter 6 before Section 6.1 of DLB]
    • Output activation functions [Section 6.2 of DLB, excluding Section 6.2.1.2 and 6.2.2.4]
    • Hidden activation functions [Section 6.3 of DLB, excluding Section 6.3.3]
    • Basic network architectures [Section 6.4 of DLB]
    • Gradient Descent and Stochastic Gradient Descent [Sections 4.3 and 5.9 of DLB]
    • Backpropagation algorithm [Section 6.5 to 6.5.3 of DLB, especially Algorithms 6.2 and 6.3; note that Algorithms 6.5 and 6.6 are used in practice]
  • Common Datasets
Name Description Instances
MNIST Images (28x28, grayscale) of handwritten digits. 60k
CIFAR-10 Images (32x32, color) of 10 classes of objects. 50k
CIFAR-100 Images (32x32, color) of 100 classes of objects (with 20 defined superclasses). 50k
ImageNet Labeled object image database (labeled objects, some with bounding boxes). 14.2M
ImageNet-ILSVRC Subset of ImageNet for Large Scale Visual Recognition Challenge, annotated with 1000 object classes and their bounding boxes. 1.2M
MS COCO (Microsoft Common Objects in Context) Complex everyday scenes with descriptions (5) and highlighting of objects (91 types). 2.5M
IAM-OnDB (IAM Online Handwriting Database) Pen tip movements of handwritten English collected from 221 writers. 86k words
TIMIT Recordings of 630 speakers (10 sentences each) of 8 major dialects of American English. 6.3k sentences
PTB (Penn Treebank) 2500 stories from Wall Street Journal, annotated with POS tags and parsed into trees. 1M words
PDT (Prague Dependency Treebank) Czech sentences annotated on 4 layers (word, morphological, analytical, tectogrammatical). 1.9M words
UD (Universal Dependencies) Treebanks of 40+ languages with consistent annotation of lemmas, POS tags, morphological features and dependency trees. 55 treebanks
Oct 24
  • Softmax with NLL (negative log likelyhood) as a loss functioin [Section 6.2.2.3 of DLB, notably equation (6.30); you should also be able to compute derivative of softmax + NLL with respect to the inputs of the softmax]
  • Gradient optimization algorithms (this topic is treated in detail withing the lecture NAIL002)
    • SGD algorithm [Section 8.3.1 and Algorithm 8.1 of DLB]
    • Learning rate decay [tf.train.exponential_decay]
    • SGD with Momentum algorithm [Section 8.3.2 and Algorithm 8.2 of DLB]
    • SGD with Nestorov Momentum algorithm [Section 8.3.3 and Algorithm 8.3 of DLB]
  • Optimization algorithms with adaptive gradients
    • AdaGrad algorithm [Section 8.5.1 and Algorithm 8.4 of DLB]
    • RMSProp algorithm [Section 8.5.2 and Algorithm 8.5 of DLB]
    • Adam algorithm [Section 8.5.3 and Algorithm 8.7 of DLB]
  • Parameter initialization strategies [Section 8.4 of DLB]
Oct 31
  • Gradient clipping [Section 10.11.1 of DLB]
  • Regularization [Chapter 7 until Section 7.1 of DLB]
  • Early stopping [Section 7.8 of DLB, without the How early stopping acts as a regularizer part]
  • L1 and L2 regularization [Section 7.1 of DLB]
  • Ensembling [Section 7.11 of DLB]
  • Dropout [Section 7.12 of DLB]
  • Introduction to convolutional networks [Chapter 9 and Sections 9.1-9.3 of DLB]
Nov 07
Nov 14
  • Residual connections in ResNet [Kaiming He et al.: Deep Residual Learning for Image Recognition]
  • Sequence modelling using Recurrent Neural Networks (RNN) [Chapter 10 until Section 10.2.1 (excluding) of DLB]
  • The challenge of long-term dependencies [Section 10.7 of DLB]
  • Long Shoft-Term Memory (LSTM) [Section 10.10.1 of DLB]
  • Gated Recurrent Unit (GRU) [Section 10.10.2 of DLB]
Nov 21
Nov 28
Dec 06
Dec 13

Study material for Reinforcement Learning is the second edition of Reinforcement Learning: An Introduction by Richar S. Sutton, available only as a draft.

Dec 20
  • Policy Gradient Methods [Chapter 13, Sections 13.1-13.5 of Sutton's Book]
  • Policy-gradient (aka REINFORCE) Reinforce Learning Algorithm [Algorithm in Section 13.3 Sutton's Book; note that the gamma^t on the last line should not be there]
  • REINFORCE with Baseline Reinforce Learning Algorithm [Algorithm in Section 13.4 Sutton's Book; note that the gamma^t on the last line should not be there]
  • Actor-Critic Reinforce Learning Algorithm [Algorithm in Section 13.5 Sutton's Book; note that the gamma on the last but one line should not be there]
  • Asynchronous Advantage Actor-Critic (aka A3C) Reinforce Learning Algorithm [Volodymyr Mnih et al.: Asynchronous Methods for Deep Reinforcement Learning]
Jan 09
  • Autoencoders (undercomplete, sparse, denoising) [Chapter 14, Sections 14-14.2.3 of DLB]
  • Deep Generative Models using Differentiable Generator Nets [Section 20.10.2 of DLB]
  • Variational Autoencoders [Section 20.10.3 plus Reparametrization trick from Section 20.9 (but not Section 20.9.1) of DLB]
  • Generative Adversarial Networks [Section 20.10.4 of DLB]

Tasks

Please send me the solved tasks via email (straka@...).

You can send small files (sources) as attachments, but if you need to send large files, please send me links only!

TaskPointsDue ToTask Description
mnist_layers_activations3Oct 31
15:39

Modify one of the MNIST examples from labs03 so that it uses the following hyperparameters:

  • layers: number of hidden layers (1-3)
  • activation: activation function, either tf.tanh or tf.nn.relu

Then implement hyperparameter search – find the values of hyperpamaters resulting in the best accuracy on the development set (mnist.validation) and using these hyperparameters compute the accuracy on the test set (mnist.test).

mnist_training2Nov 07
15:39

Using the MNIST example labs03/1-mnist.py, try the following optimizers:

  • standard SGD (tf.train.GradientDescentOptimizer), with batch sizes (10,50) and learning rates (0.01,0.001,0.0001)
  • SGD with exponential learning rate decay (use tf.train.exponential_decay), with batch sizes (10,50) and the following (starting learning rate, final learning rate) pairs: (0.01,0.001), (0.01,0.0001), (0.001, 0.0001)
  • SGD with momentum (tf.train.MomentumOptimizer), with batch sizes (10,50), learning rates (0.01,0.001,0.0001) and momentum 0.9
  • Adam optimizer (tf.train.AdamOptimizer), with batch sizes (10,50) and learning rates (0.002,0.001,0.0005)

Report the development set accuracy for all the listed possibilities.

mnist_dropout2Nov 14
15:39

Using the MNIST example from labs03/1-mnist.py, implement dropout (using tf.nn.dropout). During training, allow specifying dropout probability for the input layer and for the hidden layer separately. Then perform hyperparameter search using:

  • input layer dropout keep probability (0.8,0.9,1)
  • hidden layer dropout keep probability (0.8,0.9,1)

and report both development set accuracy for all hyperparameters and test set accuracy for the best hyperparameters.

gym_cartpole_supervised3Nov 14
15:39

Solve the CartPole-v1 environment from the OpenAI Gym using supervised learning. Very small amount of training data is available in the labs04/gym-cartpole-data.txt file, each line containing one observation (four space separated floats) and a corresponding action (the last space separated integer).

The solution to this task should be a model which passes evaluation on random inputs. This evaluation is performed by running the labs04/gym-cartpole-evaluate.py model_file command. (You can also pass --render argument to render the evaluations interactively.) In order to pass, you should achieve an average reward of at least 475 on 100 episodes.

In order to save the model, look at the labs04/gym-cartpole-save.py, which saves a model performing random guesses.

mnist_conv3-5Nov 21
15:39

Try achieving as high accuracy on the MNIST test set as possible (you can start from labs03/1-mnist.py, byt you can modify it freely). Nevertheless, remember that you should not perform hyperparameter search on the test set (when you design network architecture, you should perform hyperparameter search on the development set, and measure the test set accuracy only with the best hyperparameters; and optionally repeat with modified architecture). You will be awarded points according to the accuracy achieved:

  • 99.1 test set accuracy: 3 points
  • 99.25 test set accuracy: 4 points
  • 99.4 test set accuracy: 5 points

You should use convolution (see tf.contrib.layers.convolution2d) optionally with batch normalization (pass tf.contrib.layers.batch_norm as normalizer_fn argument of convolution2d). If you are unsure how, you can start with the following architecture (it is by no means the best solution, it is just a small network inspired by larger ImageNet processing networks): 3x3 convolution with ReLU and 8 filters, 3x3 convolution with ReLU and 8 filters, 3x3 maxpool with stride 2, 3x3 convolution with ReLU and 15 filters, 3x3 convolution with ReLU and 15 filters, 3x3 maxpool with stride 2, flatten (or possibly more convolutions and one maxpool), fully connected layer with 10 outputs and softmax (no more ReLU).

To solve this task, send me a source code I can execute (using python source.py) which trains a neural network and prints the test set accuracy on standard output (in less than a day :-).

resnet_subcaltech5Nov 28
15:39

[This task is intended mostly for people which are interested in image processing; you can pass the practicals easily without working on this task.]

Implement network which will perform image classification on Sub-Caltech50 dataset (this dataset was created for this task as a subset of Caltech101). The dataset contains images classified in 50 classes and has explicit train/test partitioning (it does not have explicit development partition, use some amount of training data if you need one).

In order to implement the image classification, use pre-trained ResNet50 network to extract image features (we do not use ResNet101 nor ResNet152 as they are more computationally demanding). To see how ResNet50 can be used to classify an image on the ImageNet classes, see the labs05/resnet50.py. When using the ResNet50 to extract features, pass num_classes=None when creating the network, and the network will return 2048 image features instead of logits of 1000 classes.

The goal of this task is to train an image classifier using the image features precomputed by ResNet50, and report the testing accuracy. The best course of action is probably to precompute the image features once (for both training and testing set) and save them to disc, and then train the classifier using the precomputed features. As for the classifier model, it is probably enough to create a fully connected layer to 50 neurons with softmax (without ReLU).

Bonus: if you are interested, you can finetune the classifier including the ResNet50 and get additional points for it. After you train the classifier as described above, put both the ResNet50 and the pretrained classifier in one Graph, and continue training including the ResNet50 (you need to pass is_training=True during ResNet construction).

sequence_generation4Nov 28
15:39

Implement network which performs sequence generation via LSTM/GRU. Note that for training purposes, we will be using very low-level approach.

The goal is to predict the labs06/international-airline-passengers.tsv sequence. Start with the labs06/sequence-generation-skeleton.py file, which loads the data and supports producing image summaries with the predicted sequence.

For training, construct an unrolled series of LSTM/GRU cells, using training portion of gold data as input, predicting the next value in the training sequence (the LSTM/GRU output contains several numbers, so use additional linear layer with one output, and MSE loss). In every epoch, train the same sequence several times (500 is the default in the script).

For prediction, use the last output state from the training portion of the network, and construct another unrolled series of LSTM/GRU cells, this time using the prediction from previous step as input.

Report results of both LSTM and GRU, each with 8, 10 and 12 cells (by sending the logs of the 6 runs).

uppercase_letters4Dec 05
15:39

Implement network, which is given an English sentence in lowercase letters and tries to uppercase appropriate letters. Use the labs06/en-ud-train.txt as training data, labs06/en-ud-dev.txt as development data and labs06/en-ud-test.txt as testing data.

Start with the labs06/uppercase-letters-skeleton.py file, which loads the data, remaps characters to integers, generates random batches and saves summaries.

Represent letters either as one-hot vectors (tf.one_hot) or using trainable embeddings (tf.nn.embedding_lookup), and use bidirectional LSTM/GRU (using tf.nn.bidirectional_dynamic_rnn) combined with a linear classification layer with softmax. Report test set accuracy. For your information, straightforward approach with small hyperparameter search on development data has test accuracy of 97.63%.

tagger1-7Dec 12
15:39

Implement network performing part-of-speech tagging for Czech and English. The data (and word embeddings precomputed using word2vec) are available here. The files are stored in vertical format – each word is on a separate line, with empty line denoting end of sentence. Each word line contain three tab-separated values: word form, lemma and tag (you can ignore the lemmas in this task). However, note that only word forms are available in the test data. You can load the dataset using the labs08/morpho_dataset.py module. You should start with the labs08/tagger-skeleton.py file.

This task has several subtasks, you can solve only some of them if you want. The network in each subtask is a bidirectional GRU (with dimension 100), only the word embeddings (always with dimension 100) differ:

  • learned_we (1 point): use randomly initialized word embeddings, which you update during training
  • updated_pretrained_we (1 point): use pretrained word embeddings, which you further update during training. The pretrained embeddings are in the original data and can be loaded using the labs08/word_embeddings.py module.
  • only_pretrained_we (1 point): use pretrained word embeddings, which you do not update during training
  • char_rnn (1 point): use character-level embeddings computed using bidirectional GRU on the word letters (beginning-of-word and end-of-word characters are not needed; pass including_charseqs=True to MorphoDataset.next_batch to get character-level information)
  • char_conv (1 points): compute word embeddings as convolution of filters followed by a max-pooling layer (beginning-of-word and end-of-word characters are needed), using 25 filters of width 2, 25 filters of width 3, 25 filters of width 4 and 25 filters of width 5 (pass including_charseqs=True to MorphoDataset.next_batch to get character-level information)
  • charagram (2 point): compute word embeddings as average of embeddings of character n-grams present in the word (beginning-of-word and end-of-word characters are needed), for n in (2,3,4)
  • English competition (1-3): using any deep learning approach which uses only the data in the provided archive, try achieving highest accuracy on English testing data. The solution to this subtask is both a source code of you network and annotated testing data, which will be evaluated using the labs08/morpho_evaluate.py script. The points will be awarded according to the accuracy reached – three best submissions get 3 points, next three best submissions get 2 points and next three submissions get 1 point.
    • Ondřej Hübsch (95.65) [3 points]
    • Martin Hora (94.48) [3 points]
    • Dušan Variš (93.83) [2 points]
    • Peter Krčah (92.28) [3 points]
    • Zafod (90.27) [2 points]
    • Kuba (89.79) [2 points]
  • Czech competition (1-3): using any deep learning approach which uses only the data in the provided archive, try achieving highest accuracy on Czech testing data. The solution to this subtask is both a source code of you network and annotated testing data, which will be evaluated using the labs08/morpho_evaluate.py script. The points will be awarded according to the accuracy reached – three best submissions get 3 points, next three best submissions get 2 points and next three submissions get 1 point
    • Ondřej Hübsch (96.08) [3 points]
    • Peter Krčah (95.56) [3 points]
    • Martin Hora (95.30) [3 points]
    • Dušan Variš (95.16) [2 points]
    • Kuba (86.92) [2 points]
lemmatizer2-6Dec 19
15:39

Implement network performing lemmatization for Czech and English. Use the data from the previous task. Note that the lemmas are all in lowercase.

You should start with the labs09/lemmatizer-skeleton.py file.

This task has several subtasks, you can solve only some of them if you want. In every subtask, represent a form using concatenation of final states of bidirectional GRU run on the form's characters.

  • individual_decoder (2 points): generate every lemma independently, using GRU as a decoder, producing one lemma letter at a time (use labs09/contrib_seq2seq.py as a dynamic rnn decoder, see labs09/rnn_example_decoder.py for a simple usage)
  • individual_attention_decoder (2 point): as in individual_decoder, but use attention
  • combined_attention_decoder (2 point): use the same approach as in the individual_attention_decoder, but use additional sentence-level bidirectional GRU (i.e., the form representations are processed by a bidirectional GRU and the results are used for the lemma generation)
  • English competition (1-3): using any deep learning approach which uses only the data in the provided archive, try achieving highest accuracy on English testing data. The solution to this subtask is both a source code of you network and annotated testing data, which will be evaluated using the labs08/morpho_evaluate.py script. The points will be awarded according to the accuracy reached – three best submissions get 3 points, next three best submissions get 2 points and next three submissions get 1 point
    • Krteček (95.68) [3 points]
    • Peter Krčah (65.23) [3 points]
    • Dušan Variš (62.95) [2 points]
  • Czech competition (1-3): using any deep learning approach which uses only the data in the provided archive, try achieving highest accuracy on Czech testing data. The solution to this subtask is both a source code of you network and annotated testing data, which will be evaluated using the labs08/morpho_evaluate.py script. The points will be awarded according to the accuracy reached – three best submissions get 3 points, next three best submissions get 2 points and next three submissions get 1 point
    • Dušan Variš (97.45) [2 points]
    • Krteček (83.49) [3 points]
    • Peter Krčah (20.57) [3 points]
nli3-15Jan 09
15:39

Try solving the Native Language Identification task with highest accuracy possible, ideally beating current state-of-the-art.

The dataset is available under a restrictive license, so the details about how to obtain it have been sent by email to the course participants. If you have not received it, please write me an email and I will send you the instructions directly.

Your goal is to achieve highest accuracy on the test data. The dataset you have does not contain test annotations, so you cannot measure test accuracy directly. Instead, you should measure development accuracy and finally submit test annotations for the model with best development accuracy.

You can load the dataset using the labs09/nli_dataset.py module. You can start with the labs09/nli-skeleton.py file, which uses the labs09/nli_dataset.py module to load the data, passes the data to the network and finally produces test annotations using the model achieving highest development accuracy.

In order to solve the task, send me the test set annotations and also the source code. I will evaluate the test set annotations using the labs09/nli_evaluate.py script. Every working solution will get 3 points, and you will get additional points accordint to your test set accuracy – the best solution will get a total of 15 points, the next one 14, and so on. Also everyone beating state-of-the-art will get a total of 15 points.

    • Peter Krčah (80.73)
    • MET + kokrous (71.18)
    • Tom Kocmi (71.18)
    • Miroslav Olšák (70.18)
    • Jan Hrach (51.36)
monte_carlo2Jan 02
15:39

Implement Monte Carlo reinforcement learning algorithm, computing exact average for every state-action pair. Start with the labs10/monte_carlo-skeleton.py module.

You should be able to reach average reward of 475 on CartPole-v1 environment (using 500 steps).

q_learning2Jan 02
15:39

Implement Q-learning algorithm. Start with the labs10/q_learning-skeleton.py module.

You should be able to reach average reward of 9.7 on Taxi-v1 environment and -150 on MountainCar-v0 environment.

q_network2Jan 02
15:39

Implement Q-learning algorithm, approximating Q-value using a simple linear network. Start with the labs10/q_network-skeleton.py module.

You should be able to reach average reward of 9.7 on Taxi-v1 environment.

reinforce2Jan 09
15:39

Implement REINFORCE algorithm, representing a policy using a neural network with a hidden layer. Start with the labs10/reinforce-skeleton.py module.

You should be able to reach average reward of 475 on CartPole-v1 environment (using 500 steps) and -100 on Acrobot-v1 environment.

reinforce_with_baseline2Jan 09
15:39

Implement REINFORCE algorithm with value function as a baseline, representing both a policy and a value function using (independent) neural networks with a hidden layer. Start with the labs11/reinforce_with_baseline-skeleton.py module.

You should be able to reach average reward of 490 on CartPole-v1 environment (using 500 steps) and -90 on Acrobot-v1 environment.

To observe the effect of the baseline, try comparing your solution to basic reinforce using batch of size 1.

reinforce_with_baseline_pixels3Jan 09
15:39

Note that this task is experimental and may not be easily solvable!

Modify the solution of reinforce_with_baseline to use pixel inputs. Start with the labs11/reinforce_with_baseline_pixels-skeleton.py module.

You will get the points is you can show any improvement at all, reaching for example average reward of 50 on CartPole-v1.

Note that according to papers, it could take hours for the network to converge. Also note that you probably have to use some kind of epsilon-greedy policy (otherwise the policy network usually converges too fast to a wrong solution; in some papers [for example in Asynchronous Methods for Deep Reinforcement Learning] entropy regularization term is used instead).

Mean 1000-episode rewards of submitted solutions:

    • Matěj Kocián: 78.8
    • Bedřich Pišl: 46
a3c3Jan 09
15:39

Note that this task is experimental and may not be easily solvable!

Try implementing Asynchronous Advantage Actor Critic algorithm from Asynchronous Methods for Deep Reinforcement Learning paper. You can start with the labs11/a3c-skeleton.py module.

You will get the points is you can show minor improvement, reaching average reward of at 100 on CartPole-v1. Do not hesitate to send the solution even if it is unstable.

Note that the network frequently diverges – in addition to gradient clipping (present in the skeleton), you could use exponential learning rate decay, or some entropy regularization term (see the paper).

Mean 1000-episode rewards of submitted solutions:

    • Matěj Kocián: 500
vae3Feb 19
23:59

Implement simple Variational Autoencoder which generates MNIST digits. Start with labs12/vae-skeleton.py and proceed according to the instructions.

Note that the skeleton automatically generates several random images each 1000 training batches and stores them in the log dir (i.e., it is not accesible in the TensorBoard). The generated images are random in the upper part and interpolating from left to right (and if dim(z) is 2, also from top to bottom) in the lower part of the generated summary.

Bonus: If you would like to experiment with more complicated dataset, you can use CIFAR-10 Cars, which are images of cars from the CIFAR-10 dataset, cropped and desaturated, and stored in MNIST format – therefore, in order to use it, after unpacking just pass --dataset cifar-cars to the labs12/vae-skeleton.py. Note that you will probably need more complicated encoder (probably using convolutions) and decoder (larger hidden layers, maybe even more). If you are able to generate car images which looks better than with plain labs12/vae-skeleton.py, you will get 2 additional points.

gan3Feb 19
23:59

Implement simple Generative Adversarial Network which generates MNIST digits. Start with labs12/gan-skeleton.py and proceed according to the instructions.

Note that the skeleton automatically generates several random images each 1000 training batches and stores them in the log dir (i.e., it is not accesible in the TensorBoard). The generated images are random in the upper part and interpolating from left to right (and if dim(z) is 2, also from top to bottom) in the lower part of the generated summary.

If you would like to experiment with more complicated dataset, you can use CIFAR-10 Cars, which are images of cars from the CIFAR-10 dataset, cropped and desaturated, and stored in MNIST format – therefore, in order to use it, after unpacking just pass --dataset cifar-cars to the labs12/gan-skeleton.py.