SIS code:
Semester:
winter
E-credits:
4
Examination:
C+Ex
Instructor:

# NPFL099 – Dialogue Systems

This is the new course for the '21/22 Fall semester. You can find slides from last year on the archived old page.

This course presents advanced problems and current state-of-the-art in the field of dialogue systems, voice assistants, and conversational systems (chatbots). After a brief introduction into the topic, the course will focus mainly on the application of machine learning – especially deep learning/neural networks – in the individual components of the traditional dialogue system architecture as well as in end-to-end approaches (joining multiple components together).

This course is a follow-up to the course NPFL123 Dialogue Systems, but can be taken independently – important basics will be repeated. All required deep learning concepts will be explained, but only briefly, so some machine learning background is recommended.

### Logistics

#### Language

The course will be taught in English, but we're happy to explain in Czech, too.

#### Time & Place

In-person lectures and labs take place in the room S10 (Malá Strana, 1st floor).

• Lectures: Mon 15:40
• Labs: Mon 17:20 (every other week, starts on 11 October)

In addition, we plan to stream both lectures and lab instruction over Zoom and make the recordings available on Youtube (under a private link, on request). We'll do our best to provide a useful experience, just note that audio quality might not be ideal.

• Zoom meeting ID: 953 7826 3918
• Password is the SIS code of this course (capitalized)

If you can't access Zoom, email us or text us on Slack.

#### Passing the course

To pass this course, you will need to take an exam and do lab homeworks, which will amount to training an end-to-end neural dialogue system and writing a report on it. See more details here.

### Topics covered

• Brief introduction into dialogue systems
• dialogue systems applications
• basic components of dialogue systems
• knowledge representation in dialogue systems
• data and evaluation
• Language understanding (NLU)
• semantic representation of utterances
• statistical methods for NLU
• Dialogue management
• dialogue representation as a (Partially Observable) Markov Decision Process
• dialogue state tracking
• action selection
• reinforcement learning
• user simulation
• deep reinforcement learning (using neural networks)
• Response generation (NLG)
• introduction to NLG, basic methods (templates)
• generation using neural networks
• End-to-end dialogue systems (one network to handle everything)
• sequence-to-sequence systems
• memory/attention-based systems
• pretrained language models
• Open-domain systems (chatbots)
• generative systems (sequence-to-sequence, hierarchical models)
• information retrieval
• ensemble systems
• Multimodal systems
• component-based and end-to-end systems
• image classification
• visual dialogue

### Lectures

PDFs with lecture slides will appear here shortly before each lecture (more details on each lecture are on a separate tab). You can also check out last year's lecture slides.

### Literature

A list of recommended literature is on a separate tab.

## Lectures

### 1. Introduction

4 October Slides Questions

• What are dialogue systems
• Common usage areas
• Closed domain, multi-domain, open domain
• System vs. user initiative in dialogue
• Standard dialogue systems components
• Research forefront
• TTS audio examples: formant, concatenative, HMMs, neural

### 2. Data & Evaluation

11 October Slides Dataset Exploration Questions

• Types of dialogue datasets
• Dataset splits
• Intrinsic vs. extrinsic evaluation
• Objective vs. subjective evaluation
• Evaluation metrics for dialogue components

### 3. Neural Nets Basics

18 October Slides Questions

• machine learning as function approximation
• machine learning problems (classification, regression, structured prediction)
• input features (embeddings)
• network shapes -- feed forward, CNNs, RNNs, attention, Transformer

### 4. Training Neural Nets

25 October Slides DailyDialogue Loader Questions

• supervised training: gradient descent, backpropagation, cost
• learning rate, schedules & optimizers
• self-supervised: autoencoding, language modelling
• unsupervised: GANs, clustering
• reinforcement learning (short intro)

### 5. Natural Language Understanding

1 November Slides Questions

• problems of NLU
• common meaning representations -- intents + slots
• delexicalization, simple approaches
• various neural approaches to NLU (network shapes & training tasks)
• joint intent & slot models
• pretrained models, less supervision

### 6. Dialogue Management (1)

• dialogue state tracking & action selection/policy
• dialogue state, belief state
• static & dynamic trackers, various approaches
• introduction to policies
• reinforcement learning, user simulator

### 7. Dialogue Management (2)

15 November Slides Questions

• reinforcement learning, value function
• actor, critic, actor-critic
• on-policy & off-policy
• Deep Q Networks
• Policy gradient methods (REINFORCE, Actor-critic)
• learned rewards
• hierarchical RL

### 8. Language Generation

22 November Slides MultiWOZ 2.2 Loader Questions

• template-based generation
• NLG with RNN/transformer, pointer network, pretrained LMs
• decoding approaches
• data treatment
• reranking, combination with NLU
• NLG with content planning

### 9. End-to-end Models

29 November Slides Questions

• pipeline vs. single model, supervised vs. RL training
• models based on joining components
• seq2seq-based approaches, with pretrained LMs
• latent action spaces
• soft DB lookups, memory networks

### 10. Chatbots

6 December Slides Finetuning on MultiWOZ Questions

• rule-based, generative, retrieval chatbots
• problems with seq2seq models
• hybrid/ensemble chatbots
• Alexa Prize

### 11. Linguistics & Ethics

13 December Slides Questions

• dialogue phenomena: turn-taking, grounding, speech acts, conversational maxims
• prediction & alignment in dialogue
• ethical considerations of NLP systems
• robustness, bias, safety
• privacy

### 12. Multimodal systems

• Modalities in dialogue
• Standard virtual agents, embodied systems
• Convolutional networks and transformers for vision
• Neural visual dialogue, visual question answering

## Homework Assignments

There will be 7 homework assignments, typically for a maximum of 10 points (the last one will be for 20 points). Please see details on grading and deadlines on a separate tab.

Assignments should be submitted via Git – see instructions on a separate tab.

### 1. Dataset Exploration

Presented: 11 October, Deadline: 27 October

1. Find out (and mention in your report):
• What kind of data it is (domain, modality)
• How it was collected
• What kind of dialogue system or dialogue system component it's designed for
• What kind of annotation is present (if any at all), how was it obtained (human/automatic)
• What format is it stored in

Here you can use the dataset description/paper that came out with the data. The papers are linked from the dataset webpages or from here. If you can't find a paper, ask us and we'll try to help.

1. Measure (and enter into your report):
• Total data length (dialogues, turns, sentences, words)
• Mean/std dev dialogue lengths (dialogues, turns, sentences, words)
• Vocabulary size
• User/ system entropy (or just overall entropy, if no user/system distinction can be made)

Here you should use your own programming skills.

1. Have a closer look at the data and try to make an impression -- does the data look natural? How difficult do you think this dataset will be to learn from? How usable will it be in an actual system? Do you think there's some kind of problem or limitation with the data? Write a short paragraph about this in your report.

#### Things to submit:

• A short summary detailing all of your findings (basic info, measurement, impressions) in Markdown as hw01/description.md.
• Your code for analyzing the data as hw01/analysis.py or hw01/analysis.ipynb.

#### Datasets to select from

##### Others:

Dataset surveys (broader, but shallower than what we're aiming at):

Presented: 25 October, Deadline: 10 November

In this assignment, you will work with the DailyDialog dataset. Your task is to create a component that will load the dataset and process the data so it is prepared for model training. This will consist of 2 Python classes -- one to hold the data, and one to prepare training batches.

In later assignments, you will train the GPT-2 model using data provided by this component. Note that this means that other assignments depend on this one.

#### Data background

DailyDialog is a chit-chat dialogue dataset labeled with intents and emotions. You can find more details in the paper desccribing the dataset.

Each DailyDialg entry consists of:

• dialog: a list of string features.
• act: a list of classification labels, e.g., question, commisssive, ...
• emotion: a list of classification labels, e.g., anger, happiness, ...

The lists are of the same length and the order matters (it's the order of the turns in the dialogue, i.e. 5th entry in the act list corresponds to the 5th entry in the dialog list).

The data contains train, validation and test splits.

#### Dataset class

Implement a Python class for the dataset (feel free to use Pytorch Dataset, Huggingface datasets, or similar concepts of Tensorflow) that has the following properties:

• It is able to load the data and process it into individual training examples (context + response + emotion + intent).

• Each example should be a dictionary of the folowing structure:

{
'context': list[str],    # list of utterances preceeding the current utterance
'utterance': str, 	    # the string with the current response
'emotion': int,          # emotion index
'intent': int            # intent index
}

• Note that we will work with a model that takes dialogue context as an input.
• Therefore, each dialogue of n turns will yield n examples, each with progressively longer context (starting from an empty context, up to n-1 turns of context).
• It distinguishes between data splits, i.e. it can be parameterized by split type (train, val, test).

• It can truncate long contexts to k last utterances, where k is a parameter of the class.

Implement a data loader Python class (feel free to use Pytorch DataLoader or similar concepts in Tensorflow) that has the following properties:

• It is able to yield a batch of examples (a simple list with examples of your Dataset) of a batch size given in the constructor.
• It will always yield conversations with similar lengths (numbers of tokens) inside the same batch.
• It will not use the original data order, but will shuffle the examples randomly.
• Yielding a batch repeatedly will never include the same example twice before all the examples have been processed.

Machine learning models usually work with numbers and matrices. That is why we also need to convert strings in our batches to integer ids (e.g., tokenize).

Therefore, inside your data loader class, implement a collate function that has the following properties:

• It is able to work with batches of your Data Loader (lists of examples).
• It uses GPT2Tokenizer for the tokenization itself.
• It converts the batches to a single dictionary (output) of the following structure:
output = {
'context': list[list[int]], # tokenized context (list of subword ids from all preceding dialogue turns, separated by the GPT-2 special <|endoftext|> token) for all batch examples
'utterance': list[list[int]], # tokenized utterances (list of subword ids from the current dialogue turn) for all batch examples
'emotion': list[int], # emotion ids for all batch examples
'intent': list[int]   # intent ids for all batch examples
}

where {k : output[k][i] for k in output} should correspond to i-th example of the original input batch

#### General implementation guidelines

You're free to use any library code that you find helpful, just make sure it installs with pip and add the appropriate requirements.txt file.

We will not restrict you to a certain machine learning framework for this course. However, we strongly recommend you to use Huggingface and PyTorch so you can access the pretrained models easily.

It is OK to use also Tensorflow, but we consider PyTorch the preferred framework. This means that some examples in the future might contain PyTorch-specific notes, the reference implementations will be in PyTorch as well. Also, if you run into problems with Tensorflow, we might not be able to help you quickly.

#### Things to submit:

• Your dataset and loader class implementations, both inside data/dailydialog_loader.py
• A testing script -- either hw02.py or hw02.ipynb (your choice), which will use your two classes, will load 3 batches from the training set, each of size 5, and print out both their string and token id representations. Make sure you fix your random seed at the start, so the results are repeatable!
• A requirements.txt file listing all the required libraries.

### 3. Finetuning on DailyDialogue

Presented: 8 November, Deadline: 1 December (extended!)

In this assignment, you will be fine-tuning GPT-2 language model on the DailyDialog dataset that you prepared.

##### Warning: Normalization needed

Maybe you noticed last time that DailyDialog does not have normalized texts and does not treat punctuation and whitespace in a uniform way. Therefore, we require you to update your dataloader from HW2 by adding a text normalization step. Use the following code for normalizing a single utterance:

    from sacremoses import MosesTokenizer, MosesDetokenizer
mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')
utterance = md.detokenize(mt.tokenize(utterance))


This process can be time consuming, so consider caching or precomputing of the normalized texts.

##### Modifications for feeding data to the model

• concatenate contexts and utterances into a single list
• add the token ID corresponding to the <|endoftext|> tokens as a delimiter and as the last token
• convert the contexts + utterances list of lists into a tensor, pad shorter sequences with zeros
• build 2 boolean masks -- one for the context, one for the utterance -- same size as the main tensor, with True for context/utterance tokens only (see the example below)
• convert both masks into tensors

Loader outputs from HW2 looked like this:

contexts = [[3322, 1, 1, 3322, 2, 3, 4, 5], [3322, 6, 7]]
utterances = [[8, 9 , 10], [11, 12, 13, 14]]


What we need is to make them look like this:

labels = [
[3322, 1, 1, 3322, 2,  3,  4,  5,  3322, 8, 9, 10, 3322],
[3322, 6, 7, 3322, 11, 12, 13, 14, 3322, 0, 0, 0,  0]
]
[True, True, True, True, True,  True,  True,  True,  True,  False, False, False, False],
[True, True, True, True, False, False, False, False, False, False, False, False, False]
]
[False, False, False, False, False, False, False, False, False, True,  True,  True, True],
[False, False, False, False, True,  True,  True,  True,  True, False, False, False, False]
]


Notice 3322 as the <|endoftext|> token and the zero padding in labels. Check the positions of True and False for both masks with respect to labels.

#### Model & training

• Load the pre-trained GPT-2 model from the Huggingface Transformers library. More precisely, instantiate the GPT2LMHeadModel class and load weights from pretrained model (see .from_pretrained(...)). Use the smallest version of the model ('gpt2'). If you like experimenting, you can replace the GPT-2 model with a similar model trained on conversational data only, e.g., DialoGPT. You can find and browse all pre-trained Huggingface models here.

• For training the model, i.e., connecting your data pipeline and the loaded model, use whatever you want. We recommend you to use the Huggingface Trainer, or Pytorch Lightning, but you can also write your own training loop and logging routines.

• Fine-tune the model on the response generation task. It means that your objective is to minimize negative log-likelihood (NLL) of the training data with respect to your model. Feed the whole labels tensors into your model, but when computing the loss, only the utterance tokens should be considered (use utterance_mask for the calculation).

• Don't forget to use the attention_mask for GPT-2 training, so you avoid performing attention over padding.

• Feel free to experiment with the optimizer/scheduler and training parameters. A good choice might be the ones preset by Huggingface (AdamW, Linear schedule with warmup).

• Use the largest batch size you can (the largest where your GPU doesn't run out of memory). It might actually be very small (1-4).

• Monitor the training and validation loss and use it to determine the hyperparameters (number of training epochs, learning rate, learning rate schedule, ...).

• First start debugging with very small data, just a few batches (test if the model learns something checking outputs on the training data).

• Fix your random seeds so your results are repeatable, and you can tell if you actually changed something. This must be done separately for Python and Numpy and PyTorch/Tensorflow! In case you're using Pytorch Lightning, you can use pytorch_lightning.utilities.seed.seed_everything.

Note: Training on CPU is usually slow therefore we like GPUs. You can use Google Colab which provides GPUs for free for a limited time span. You can also ask Ondrej for an account on our in-house student computing cluster (please do that ASAP).

#### Decoding

Huggingface provides several options for decoding the outputs of your model. Go through the tutorial and choose a decoding method of your liking (you can go with greedy as the base option). Use it to generate utterances for all contexts available in the test set.

Optional -- bonus: Prepare an interactive script that allows to directly chat with your model. It reads user utterances using a prompt, stores the context, and generates system responses using the trained model. Your efforts will be rewarded with bonus points.

#### Metrics

Besides the training and validation loss, we want you to report the following measures on the test set:

• token accuracy, i.e. the proportion of correctly predicted utterance token ids (apply argmax on the predicted raw logits and compare the result with the ground-truth token ids)
• perplexity, you might find it helpful to figure out the relationship between data perplexity and the objective function used for training

#### Things to submit

• Updated DailyDialog loader class implementation (data/dailydialog_loader.py)
• Updated requirements.txt
• Code for model loading, training, decoding and metrics (model.py, you may use multiple files if you want)
• The code should include your training parameters (you can load them from a JSON/YAML config file if you want)
• Text file (hw03/dailydialog_outputs.txt) containing the generated test set responses, each on a separate line.
• Text file (hw03/dailydialog_scores.txt) containing your token accuracy and perplexity.

Presented: 22 November, Deadline: 13 December (extended)

This assignment is very similar to HW2, except you will work with the MultiWOZ 2.2. dataset, which is task-oriented. This results in some differences and modifications, so read carefully.

Your task is to create a component that will load the task-oriented dataset and process the data so it is prepared for model training. Same as for HW2, it will consist of two Python classes -- one to hold the data, and one to prepare training batches.

In later assignments, you will train the GPT-2 model (similar to SOLOIST) using data provided by this component. Note that this means that the next assignments depend on this one!

#### Data background

MultiWOZ 2.2 is a task-oriented conversational dataset labeled with dialogue acts. It contains around 10k conversations between the user and a Cambridge town info centre (system). The dialogues are about certain topics: restaurants, hotels, trains, taxi, tourist attractions, hospital, and police. You can find more details in the dataset repository.

You can write your own dataset loader from the original format (see the dataset), but it is not as simple as in the case of HW2. Therefore, we recommend using the Huggingface Datasets library version. Note that there's a bug (old checksum) in HF Datasets, so to load the dataset, use ignore_verifications=True -- it'll work fine.

This is how the data looks like if you load it using Huggingface Datasets: Each entry in the dataset represents one dialog. The information we are interested in is contained in the field turns, which is a dictionary with the following important keys:

• speaker: Role associated with the speaker. It's either 0 (user) or 1 (system).
• utterance: String representation of the dialogue utterances.
• dialogue_acts: Structured parse of the system utterances into dialog acts. It contains slot names and corresponding span_info (location of the slot in the utterance, which will come in handy later).
• frames: Present only in user utterances. Structured representation of the user's belief state.

Each of these keys is mapped to a list with labels for the corresponding turns, i.e. turns['speaker'][0] contains information for the speaker of the first turn abd turns['speaker'][-1] of the last one.

Again, the dataset contains train, validation and test splits. Respect them!

#### Database

The dataset is task-oriented and its important part is the database. Database stores entities that are available for each domain and their attributes. You will use the database results when modelling the conversations, therefore you need to implement the database query API. However, some domains are specific and its database queries need to be handled in a special way. Also, the MultiWOZ dataset has a few rather annoying quirks. Therefore, we provide for you a partially implemented class that already handles things that would be too annoying to deal with (see the attached file database.zip).

However, you still need to implement some things:

• Time-based searches which includes conversion of time strings to numbers and getting results before or after the particular time stamp.
• Matching of searched strings to the values in the database.

The bits that are waiting for your implementation are highlighted with # TODO: in the code.

Note that to use the provided code, you'll need to install the fuzzywuzzy library (and add it to your dependencies). It install easily via pip.

#### Dataset class

Implement a Python class for the dataset (feel free to use Pytorch Dataset, Huggingface datasets, or similar concepts for Tensorflow) that has the following properties:

• It is able to load the data and process it into individual training examples (containing context, response, belief state, database results).

• Each example should be a dictionary of the folowing structure:

{
'context': list[str],  # list of utterances preceeding the current utterance
'utterance': str,  # the string with the current response
'delex_utterance': str,  # the string with the current response which is delexicalized, i.e. slot values are
# replaced by corresponding slot names in the text.
'belief_state': dict[str, dict[str, str]],  # belief state dictionary, for each domain a separate belief state dictionary,
# choose a single slot value if more than one option is available
'database_results': dict[str, int] # dictionary containing the number of matching results per domain
}

• Each dialogue of n turns will yield n // 2 examples, each with progressively longer context (starting from a context of length 1, up to n-1 turns of context). We are modelling only system responses!

• It distinguishes between data splits, i.e. it can be parameterized by split type (train, val, test).

• It can truncate long contexts to k last utterances, where k is a parameter of the class.

• It contains delexicalized versions of the utterances (where slot values are replaced with placeholders). You can use the data field dialogue_acts and its fields span_end, span_start for localizing the parts suitable for delexicalization. Replace those parts with the corresponding slot names from act_slot_name enclosed into brackets, e.g., [name] or [pricerange].

• Belief state is a dictionary that contains mapping of domains to their corresponding belief states (slot-value pairs), i.e.

{
'restaurant': {'pricerange': 'ab', 'area': 'cd', ...},
'hotel':      {'parking': 'ef', ...},
...
}


Look into the frames fileds of user utterances to build the belief state!

• Database results represent the counts of database entities matching the current belief state for each domain.

{
'restaurant': 101,
'hotel':      42,
...
}


You must distinguish between the cases where 0 entities are matching and where the domain was not mentioned in the belief state and thus was not queried at all! Don't mention the domain in the results in the latter case.

Implement a data loader Python class (feel free to use Pytorch DataLoader or similar concepts in Tensorflow) that has the following properties:

• It is able to yield a batch of examples (a simple list with examples of your Dataset) of a batch size given in the constructor.
• It will always yield conversations with similar lengths (numbers of tokens) inside the same batch. You should take into account not only the conversation context, but also the size of the belief state!
• It will not use the original data order, but will shuffle the examples randomly.
• Yielding a batch repeatedly will never include the same example twice before all the examples have been processed.

Machine learning models usually work with numbers and matrices. That is why we also need to convert strings in our batches to integer IDs. Therefore, inside your data loader class, implement a collate function that has the following properties:

• It is able to work with batches coming from your data loader (lists of examples).

• It uses GPT2Tokenizer to split all strings into tokens (subwords) and assign them IDs.

• It converts the batches to a single dictionary (output) of the following structure:

output = {
'context': list[list[int]],  # tokenized context (list of subword ids from all preceding dialogue turns,
# system turns prepended with <|system|> token and user turns with <|user|>)
# for all batch examples
'utterance': list[list[int]],  # tokenized utterances (list of subword ids from the current dialogue turn)
# for all batch examples
'delex_utterance': list[list[int]],  # tokenized and delexicalized utterances (list of subword ids
# from the current dialogue turn) for all batch examples
'belief_state': list[list[int]],  # belief state dictionary serialized into a string representation and prepended with
# the <|belief|> special token and tokenized (list of subword ids
# from the current dialogue turn) for all batch examples
'database_results': list[list[int]],  # database result counts serialized into string prepended with the <|database|>
# special token and tokenized (list of subword ids from the current dialogue turn)
# for all batch examples
}


where {k : output[k][i] for k in output} should correspond to i-th example of the original input batch.

• Do not forget to correctly register all the new special tokens into the tokenizer (check out additional_special_tokens argument of the tokenizer)!
• You can choose your own way of the belief and database results serialization, or you can follow the format of SOLOIST.

#### Things to submit:

• Your dataset and loader class implementations, both inside data/multiwoz_loader.py.
• A testing script -- either hw04.py or hw04.ipynb (your choice), which will use your two classes, will load 3 batches from the training set, each of size 5, and print out both their string and token id representations. Make sure you fix your random seed at the start, so the results are repeatable!
• Output of your testing script -- either save hw04.ipynb with the outputs, or include a separate hw04.txt.
• An (update of) requirements.txt file listing all the required libraries -- this should be in the pip readable format, i.e. the file should contain all required libraries and their versions in the following format:
torch==a.b.c
transformers==x.y.z
...



### 5. Finetuning on MultiWOZ

Presented: 6th December, Deadline: 3 January (extended)

In this assignment, you will be fine-tuning GPT-2 language model on the MultiWOZ dataset that you prepared. Basically, we will try to mimic the SOLOIST architecture in a simplified way.

This assignment is very similar to HW3, except you will work with the MultiWOZ 2.2. dataset, which is task-oriented. This results in some differences and modifications, so read carefully.

##### Modifications for feeding data to the model

• Concatenate contexts and delexicalized utterances, as well as belief states and database results, into a single list
• Add the token ID corresponding to the <|endoftext|> token as the last token and as a delimiter betwen the database and delexicalized system utterances. The <|belief|> and <|database|> special tokens should be already present since you added them in HW4. They will serve as delimiters for the belief state and database parts of the input, respectively.
• Convert the list of lists for contexts + belief + db + delexicalized utterances into a tensor, padding shorter sequences with zeros.
• Build four boolean masks -- separate ones for each of: (1) context, (2) delexicalized utterance, (3) belief state and (4) database. They should be the same size as the main tensor, with True for the tokens of the respective part, including the final special token (see the example below).
• Convert all four masks into tensors.

Loader outputs from HW4 looked like this:

<|ENDOFTEXT|> = 3320
<|USER|>      = 3321
<|SYSTEM|>    = 3322
<|BELIEF|>    = 3323
<|DB|>        = 3324

contexts = [[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9], [3321, 10, 11]]
utterances = [[12, 13 , 14], [15, 16, 17, 18]]
delex_utterances = [[12, 1111, 14], [15, 16, 1112, 18]] # some tokens replaced by delex. procedure
beliefs = [[3323, 100, 101], [3323, 102]]
dbs = [[3324, 204], [3324, 207]]


What we need is to make them look like this:

labels = [
[[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9, 3323, 100, 101, 3324, 204, 3320, 12, 1111, 14, 3320],
[3321, 10, 11, 3323, 102, 3324, 207, 3320, 15, 16, 1112, 18, 3320, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
...


Notice the zero padding in labels. Check the positions of True and False for masks with respect to labels. Specifically, note that you want to predict the final token of the belief/utterance and you're not predicting the starting token since that's given as input.

#### Model

We will again work with the pretrained GPT-2 model. Load the pretrained model using .from_pretrained() as in HW3.

For training the model, i.e., connecting your data pipeline and the loaded model, use whatever you want. We recommend you to use the Huggingface Trainer, or Pytorch Lightning, but you can also write your own training loop and logging routines. Feel free to reuse code from HW3.

#### Training

• Fine-tune the model on two tasks simultaneously: dialog state tracking and response generation. We model both the dialog state and the response as a sequence of tokens, so our training objective boils down to minimization of negative log-likelihood (NLL) of the training data with respect to your model. Feed the whole labels tensors into your model, but when computing the loss, consider tokens corresponding to the dialog state and utterance (use belief_mask and utterance_mask for the calculation). You should get two numbers, i.e., the loss associated with the belief state and final response. The total loss is then a sum of those two numbers.
• Don't forget to use the attention_mask for GPT-2 training, so you avoid performing attention over padding.
• Feel free to experiment with the optimizer/scheduler and training parameters. A good choice might be the ones preset by Huggingface (AdamW, Linear schedule with warmup).
• Use the largest batch size you can (the largest where your GPU doesn't run out of memory). It will probably be the same as for HW3, i.e. relatively small (1-4).
• Monitor the training and validation loss and use it to determine the hyperparameters (number of training epochs, learning rate, learning rate schedule, ...).
• First start debugging with very small data, just a few batches (test if the model learns something checking outputs on the training data).
• Fix your random seeds so your results are repeatable, and you can tell if you actually changed something. This must be done separately for Python and Numpy and PyTorch/Tensorflow! In case you're using Pytorch Lightning, you can use pytorch_lightning.utilities.seed.seed_everything.

Same as for HW3, You can use Google Colab which provides GPUs for free for a limited time span, or you can ask Ondrej for an account on our AIC in-house student computing cluster.

#### Decoding

Huggingface provides several options for decoding the outputs of your model. Go through the tutorial and choose a decoding method of your liking (you can go with greedy as the base option). Use it to generate responses for all contexts available in the test set.

However, there is one caveat. During decoding, you have to query the database. To be able to do this, you first need to have a belief state. Therefore, the decoding process will have multiple stages:

1. Decode the belief state based on current context
2. Construct a database query and perform the query using the database API class from HW4 (or your own equivalent)
3. Concatenate both the belief state and database results to the context (do not forget to add the special tokens!), and decode the final response.

NOTE: To be able to construct a query, you need a structured representation of the belief state (i.e. a dict). However, the model decodes strings (same as those you created in HW4). Therefore you'll need to use some kind of parser to get back the structured representation. We provide a sample parsing class that you can use directly or as an inspiration (depending on what you used in HW4) -- have a look at this code.

Optional -- bonus: Prepare an interactive script that allows to directly chat with your model. It reads user utterances using a prompt, stores the context, and generates system responses using the trained model. Unfortunately, this includes also a mechanism for backward lexicalization of predicted delexicalized texts. Your efforts will be rewarded with bonus points.

#### Metrics

Besides the training and validation loss, we want you to report the token accuracy, i.e. the proportion of correctly predicted token ids (apply argmax on the predicted raw logits and compare the result with the ground-truth token ids). Please report token accuracy separately for belief state prediction and response prediction.

#### Things to submit

• Updated MultiWOZ loader class implementation (data/multiwoz_loader.py).
• Updated requirements.txt.
• Code for model loading, training, decoding and metrics (model.py or task_model.py; you may use multiple files if you want).
• The code should include your training parameters (you can load them from a JSON/YAML config file if you want).
• Text file (hw05/multiwoz_outputs.txt) containing your generated test set belief states + responses. The ideal format is one turn per line, with a tab character (\t) between the belief and the response.
• Text file (hw05/multiwoz_scores.txt) containing your token accuracy and loss.

### 6. Evaluation & State Consistency

Presented: 20th December, Deadline: 31st January (but better do it sooner)

In this assignment, you will work with the model trained in HW5 and perform some more experiments. Basically, we will try to answer two questions:

1. How well does your model perform?
2. Can we improve the model perfrormance by modifying the training process?

You might want to take a look at HW7 simultaneously, as it can help you design your experiments.

#### Model Evaluation

To evaluate your model's performance you will report several metrics. Specifically, we want you to report:

• BLEU score,
• dialogue success rate (corpus-based, i.e. you generate each system turn with ground-truth context, then compute success at the end),
• variability of the generated language (number of distinct tokens and conditional bigram entropy).

To be able to compute the metrics, you will need to generate predictions from your model and save them in a machine-readable format, e.g. json. Use test set for genrating the predicitons. For the computation of the scores itself, you are free to use any implementation you like. However, the easiest way is to use the evaluation script that Tomáš has prepared for MultiWOZ. It can be easily installed via pip and allows to measure all the required metrics (and some more).

#### Improving Performance

In this part of the assignment, you will need to modify your model's training process and retrain the model subsequently. The goal of this modification is to improve the belief state tracking performance of your model. To achieve this, we introduce an additional training objective: The model will have to distinguish between the ground-truth belief state and a corrupted version of the belief state. You will need to do the following modifications:

You will corrupt the belief state on-the-fly in your collate function, i.e. before tokenization, building masks and concatenation of subsequent parts (context, state, database results, and response) into a single string. First, you need to decide which examples of the current batch contain corrupted state or not based on probability p_c = 1/3 (i.e., 1/3 of your training examples will contain the corrupted state on average). This decision will be described by a vector of binary flags which should be returned from the collate function too. These flags will be used as target labels during training.

There are many options to corrupting the belief state. You can replace each slot value by a different one with some probability p_v, add or remove some slot name-value pair with probability p_a, p_r, or totally replace it with another state with probability p_t. The corrupted belief state will then be encoded the same way as your ground-truth belief state.

It's best if you treat the probabilities as hyperparameters and keep them configurable. Keeping p_c = 0 will then get you the baseline model (with no state corruption).

##### Model architecture

You will add an additional training objective to detect the consistency of the belief states. To achieve this, the model needs to be slightly modified. You can choose one of the two approaches:

1. Instantiate the GPT2LMHeadModel and add a custom binary classification layer manually.
2. Instantiate the GPT2DoubleHeadsModel and treat the inputs accordingly.
##### Training

Use the additional training objective for training the consistency classification head. You should minimize the binary cross-entropy between the predicted binary flag and the ground truth (i.e., whether you fed in the true state or the corrupted one). Combine the losses as a weighted sum.

##### Experiments

Measure the same metrics as with the base version of the model (without the additional training objective). You don't need to use the additional head or any belief state corruption during the prediction.

#### What to submit

• Updated MultiWOZ loader class implementation (data/multiwoz_loader.py).
• Updated requirements.txt (if necessary).
• Updated code for model loading, training, decoding and metrics (model.py or task_model.py -- what you used for HW5).
• The code should include your training parameters (you can load them from a JSON/YAML config file if you want).
• Text files (hw06/mw_outputs.json,corrupted_mw_outputs.json) containing your generated test set belief states + responses. The ideal format is structured and machine-readable, e.g. json.
• Text file (hw06/multiwoz_metrics.txt) containing the metrics described above (BLEU, success, distinct tokens, conditional bigram entropy) for both model variants -- with and without the state corruption.

### 7. Report

Presented: 20th December, Deadline: 21st February (but better do it sooner!)

This is the last assignment, and it's worth double points! The basic idea is that you write a ca. 3-page report (1500 words), detailing your model and the experiments, so it all looks like an academic paper. The purpose of this is to give you some writing training, which might come in handy for your master's thesis or other projects. It is up to you if you focus on the chitchat model, the MultiWOZ task-oriented model (preferrable), or both.

Have a look at Ondrej's tips for writing reports here before you start writing!

#### What to include in the report text

• A short abstract, summarizing the main features of your model and your main results
• An introduction, motivating the model (feel free to use the lectures for inspiration and other works) and potentially summarizing your key results/findings.
• A short related works section (again, feel free to use the lectures) -- a bit more descriptive about the related works, highlighting key differences from your own
• Model description -- describe how the model(s) operate(s) during training and inference
• Experiments -- compare at least two variants of the model. For the MultiWOZ model, it's best to compare a version with and without the consistency auxiliary task. In any case, you can discuss two versions with different hyperparameter values, or compare to a version that only uses a portion of the training data (say 25%). Describe the datasets and your experimental settings in this section.
• Results -- describe the results. Try to draw some conclusions from the scores you got. Also, do a little error analysis -- compare the outputs for at least 10 dialogues (gold contexts + outputs of your two model variants) and summarize your findings. What kinds of errors do your systems make and how frequently? Is there a difference between the systems? Do you think that the automatic scores reflect the models' perfomance well?
• Conclusion (optional) -- just a short summary of your findings. Not necessary if you did it in the introduction already.

The prescribed format for your report is LaTeX, with the ACL Rolling Review templates. You can get the templates directly on Overleaf or download them for offline use.

#### Things to submit

• The PDF of your report (under hw7/report.pdf)
• All the LaTeX code, including the templates, figures and references (hw7/*.*)
• The dialogues you used for your error analysis (under hw7/error_analysis/*.* -- best as either plain text or JSON)

## Homework Submission Instructions

All homework assignments will be submitted using a Git repository on MFF GitLab.

We provide an easy recipe to set up your repository below:

### Creating the repository

2. Create a new project (e.g. called NPFL099). Choose the Private visibility level.

 New project -> Create blank project

3. Invite us (@duseo7af, @hudecekv, @nekvindt) to your project so we can see it. Please give us "Reporter" access level.

 Members -> Invite Member

4. Clone the newly created repository.

5. Change into the cloned directory and run

git remote show origin


You should see these two lines:

* remote origin


1. You're all set!

### Submitting the homework assignment

1. Make sure you're on your master branch
git checkout master

1. Checkout new branch:
git checkout -b hw-XX

1. Solve the assignment :)

git add hwXX/solution.py
git commit -am "commit message"

1. Push to your origin remote repository:
git push origin hw-XX

1. Create a Merge request in the web interface. Make sure you create the merge request into the master branch in your own forked repository (not into the upstream).

 Merge requests -> New merge request

1. Wait a bit till we check your solution, then enjoy your points :)!
2. Once approved, merge your changes into your master branch – you might need them for further homeworks (but feel free to branch out from the previous homework in your next one if we're too slow with checking).

## AIC Cluster Computation Tips

This is just a short primer for the AIC wiki -- better read that one. But definitely read at least this text before you start working with AIC.

When you log on to AIC, you're at the cluster head node. Do not compute here – this just for launching computation jobs, copying files and such. All of your computation jobs will run in a batch on one of the CPU/GPU nodes.

Commands you might want to use:

• tmux -- for having multiple terminal sessions open. Tmux can survive and keep everything open even if you lose connection (just tmux attach).
• mc or WinSCP -- for copying files over ssh (use F9-“Shell link” in mc).
• miniconda in case you need Python>3.6 (which is the system one).

### Submitting jobs

Use the qsub command to submit your jobs (i.e. shell scripts) into a queue. For running a python command, simply create a shell script that has one line -- your command with all the parameters you need.

Have a look at the AIC wiki for all the command-line parameters.

Here's just an example of a GPU job with 1 CPU, 1 GPU and 16G system RAM (all GPUs have 8G memory):

qsub -q gpu.q -cwd -j y -l act_mem_free=16G,mem_free=16G,h_vmem=16G,h_data=16G,gpu=1  -pe smp 1 script.sh


Parameter guide:

• -q -- the queue name (cpu.q or gpu.q are available)
• -cwd -- run in the current directory, not your home directory
• -j y -- join stderr and stdout into one file (script.oXXXX) where XXXX is the job ID
• -l -- all the requested resources (yes, you need to specify all of these)
• -pe smp X -- number of CPUs to use (separate from other resources)

Notes:

• Rule #1: Always request the resources you'll need, or something will break!
• Rule #2: Don't submit too many jobs at a time (don't overfill the cluster, leave space for others).
• You can't request more than 2 CPUs or 2 GPUs.

### Checking submitted jobs

Use the qstat command to check for jobs. You can run qstat -u '*' to see every job currently running on the cluster, from any user.

### Interactive shell on a computation node

You can get an interactive console for debugging directly with a GPU -- like this:

qrsh -q gpu.q -l act_mem_free=16G,mem_free=16G,h_vmem=16G,h_data=16G,gpu=1 -pe smp 1 -pty yes bash -l


Parameter guide:

• -pty yes means “give me a console”
• bash -l is a bash login shell, which will set CUDA variables for you & start a new bash shell.

Notes:

• You always have to request the resources, same as with qsub.
• qrsh won't wait -- if the cluster is full, it will fail.
• Don't forget to exit the console after use -- you're blocking the GPU and whatever you reserve, as long as the console is open!

## Exam Question Pool

The exam will have 10 questions from the pool below. Each question counts for 10 points. We reserve the right to make slight alterations or use variants of the same questions. Note that all of them are covered by the lectures, and they cover most of the lecture content. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles. See the Grading tab for details on grading.

#### Introduction

• Describe the difference between closed-domain, multi-domain, and open-doman systems.
• Describe the difference between user-initiative, mixed-initiative, and system-initiative systems.
• List the main components of a modular task-oriented dialogue system (text/voice-based)
• What is the task (input/output) of speech recognition in a dialogue system?
• What is the task (input/output) of speech synthesis in a dialogue system?

#### Data & Evaluation

• What are the usual approaches to collecting dialogue data (name at least 2)?
• How does Wizard-of-Oz data collection work?
• What’s the difference between intrinsic and extrinsic evaluation?
• What is the difference between subjective and objective evaluation?
• What are some evaluation metrics for non-task-oriented systems (chatbots)?
• How would you evaluate NLU (both slots & intents)?
• Explain an NLG evaluation metric of your choice.
• Describe how BLEU works (in principle, no details needed).
• Show at least 2 examples of subjective (human) evaluation metrics for dialogue systems.
• What is significance testing and what is it good for?
• Assume you have dialogue systems A and B, and A performs better than B in terms of response BLEU on a dataset of 100 dialogues. Describe how you’d test for significance.
• Why do you need to evaluate on a separate test set?

#### Neural Nets Basics

• What's the difference between classification and regression as a machine learning problem?
• Describe the task of sequence prediction (=autoregressive generation).
• What's the difference between classification and ranking as a machine learning problem?
• What's the difference between sequence labeling and sequence prediction?
• What is an embedding, what units can it relate to, and how can you obtain it?
• What are subwords and why are they useful?
• What's an encoder-decoder model and how does it work?
• How does an attention model work?
• What's the main principle of operation for convolutional networks?
• What's the difference between LSTM/GRU-based and Transformer-based architecture?
• What's a pretrained language model?

#### Training Neural Nets

• Describe the principle of stochastic gradient descent
• Why is it important to choose an appropriate learning rate?
• What is dropout, what is it good for and why does it work?
• What’s a variational autoencoder and how does it differ from a “regular” autoencoder?
• What is a masked language model?
• How do Generative Adversarial Networks work?
• Describe the principle of the pretraining+finetuning approach.
• How does clustering work?

#### Natural Language Understanding

• Design a (sketch) of an NLU neural architecture that joins intent detection and slot tagging.
• Describe language understanding as classification and language understanding as sequence tagging.
• What is delexicalization and why is it helpful in NLU?
• Describe one of the approaches to slot tagging as sequence tagging.
• What is the IOB/BIO format for slot tagging?
• How can you use pretrained language models (e.g. BERT) for NLU?
• How can you combine rules and neural networks in NLU?
• How can an NLU system deal with noisy ASR output? Propose an example solution.

#### Dialogue State Tracking

• What is the point of dialogue state tracking in a dialogue system?
• What is the difference between dialogue state and belief state?
• What's the difference between a static and a dynamic state tracker?
• What's a partially observable Markov decision process?
• Describe a viable architecture for a belief tracker.
• What is the difference between state trackers as classifiers vs. as candidate rankers?
• Describe the principle of state tracking as span selection.

#### Dialogue Policies

• Describe the basic reinforcement learning setup (agent, environment, actions, rewards)
• Why is reinforcement learning preferred over supervised learning for training dialogue managers?
• What are V and Q functions in a reinforcement learning scenario?
• What's the difference between actor and critic methods in reinforcement learning?
• Describe a Deep Q Network.
• Describe the REINFORCE approach.
• What’s the main principled difference between Deep Q-Networks and Policy Gradient methods?
• What are actor-critic reinforcement learning methods?
• What’s the difference between on-policy and off-policy optimization?
• Why do you typically need a user simulator to train a reinforcement learning dialogue policy?
• Given an example of possible turn-level or dialogue-level rewards for RL optimization.
• What is a user simulator? What are some common approaches to building one?

#### Natural Language Generation

• What are the main steps of a traditional NLG pipeline – describe at least 2.
• Describe a rule-based approach to NLG.
• What are the main problems with seq2seq-based NLG systems?
• What is a copy mechanism/pointer network?
• What is delexicalization and why is it helpful in NLG?
• Describe a possible neural approach to NLG with an approach to combat hallucination.
• How can you reduce NLG hallucination by altering the training data?
• How can an NLU system help in training an NLG system (for example)?
• How can you use pretrained language models in NLG?
• What are the typical decoding approaches in NLG? Explain & contrast at least 2.

#### End-to-end Models

• What are some pros and cons of end-to-end models over traditional modular ones?
• Describe an example structure of an end-to-end dialogue system.
• Describe the Sequicity (2-step decoding) model.
• Describe an end-to-end model based on pretrained language models.
• How would you adapt a pretrained language model for an end-to-end dialogue system?
• What are “soft” database lookups in end-to-end dialogue systems?
• How would you use reinforcement learning to train an end-to-end model?
• Why is it a bad idea to train end-to-end dialogue systems only with reinforcement learning on word level?

#### Chatbots

• What are the three main approaches to building chatbots?
• How does the Turing test work? Does it have any weaknesses?
• What are some techniques rule-based chatbots use to convince their users that they're human-like?
• Describe how a retrieval-based chatbot works.
• Why is plain seq2seq-based architecture for chatbots problematic?
• Describe an example approach to improving diversity or coherence in a seq2seq-based chatbot.
• How can you use a pretrained language model in a chatbot?
• Describe a possible architecture of a hybrid/ensemble chatbot.

#### Linguistics & Ethics

• What are turn taking cues/hints in a dialogue? Name a few examples.
• What is a barge-in?
• What is grounding in dialogue?
• Give some examples of grounding signals in dialogue.
• What is alignment/adaptation in dialogue?
• Describe the overgeneralization/overconfidence problem in data-driven NLP models.
• Describe the demographic bias problem in data-driven NLP models.
• Give an example of a user safety concern in dialogue systems.
• What's the problem with training neural models on private data?

#### Multimodality

• How does the structure of (traditional) multimodal dialogue systems differ from non-multimodal ones?
• Give an example of 3 alternative input modalities (i.e. not voice/text).
• Give an example of 3 alternative output modalities (i.e. not voice/text).
• How would you build a multimodal end-to-end neural dialogue system (e.g. for visual dialogue)?
• Explain some problems that may occur when a dialogue system talks to two people at once.
• What’s the difference between image classification and object detection?
• How would you build a neural end-to-end image-based system (consider using pretrained components)?

To pass this course, you will need to:

1. Take an exam (a written test covering important lecture content).
2. Do lab homeworks (implementing an end-to-end dialogue system + other tasks).

### Exam test

• There will be a written exam test at the end of the semester.
• There will be 10 questions, we expect 2-3 sentences as an answer, with a maximum of 10 points per question.
• To pass the course, you need to get at least 50% of the total points from the test.
• We plan to publish a list of possible questions beforehand.
• If needed, there will be exam dates in the summer.

In case the pandemic gets worse by the exam period, there will be a remote alternative for the exam (an essay with a discussion).

### Homework assignments

• There will be 7 homework assignments, introduced every other week.
• You will submit the homework assignments into a private Gitlab repository (where we will be given access).
• For each assignment, you will get a maximum of 10 points (except the last one, which is for double points!).
• All assignments will have a fixed deadline (typically 2-3 weeks).
• The only accepted reason for a deadline extension is a serious problem beyond your own control, such as illness.
• If you submit the assignment after the deadline, you will get:
• up to 50% of the maximum points if it is less than 2 weeks after the deadline;
• 0 points if it is more than 2 weeks after the deadline.
• Any bonus points you get will not be lowered.
• Note that most assignments depend on each other! That means that if you miss a deadline, you still might need to do an assignment without points in order to score on later assignments.
• Once we check the submitted assignments, you will see the points you got and the comments from us on Gitlab, later your points will appear in:
• To be allowed to take the exam (which is required to pass the course), you need to get at least 50% of the total points from the assignments.

The final grade for the course will be a combination of your exam score and your homework assignment score, weighted 3:1 (i.e. the exam accounts for 75% of the grade, the assignments for 25%).

• Grade 1: >=87% of the weighted combination
• Grade 2: >=74% of the weighted combination
• Grade 3: >=60% of the weighted combination
• An overall score of less than 60% means you did not pass.

In any case, you need >50% of points from the test and >50% of points from the homeworks to pass. If you get less than 50% from either, even if you get more than 60% overall, you will not pass.

### No cheating

• Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
• Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
• All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.

You should be able to pass the course just by following the lectures, but here are some hints on further reading. There's nothing ideal on the topic as this is a very active research area, but some of these should give you a broader overview.

Recommended, though slightly outdated:

Recommended, but might be a bit too brief: