This is the new course for the '21/22 Fall semester. You can find slides from last year on the archived old page.
This course presents advanced problems and current state-of-the-art in the field of dialogue systems, voice assistants, and conversational systems (chatbots). After a brief introduction into the topic, the course will focus mainly on the application of machine learning – especially deep learning/neural networks – in the individual components of the traditional dialogue system architecture as well as in end-to-end approaches (joining multiple components together).
This course is a follow-up to the course NPFL123 Dialogue Systems, but can be taken independently – important basics will be repeated. All required deep learning concepts will be explained, but only briefly, so some machine learning background is recommended.
The course will be taught in English, but we're happy to explain in Czech, too.
In-person lectures and labs take place in the room S10 (Malá Strana, 1st floor).
In addition, we plan to stream both lectures and lab instruction over Zoom and make the recordings available on Youtube (under a private link, on request). We'll do our best to provide a useful experience, just note that audio quality might not be ideal.
If you can't access Zoom, email us or text us on Slack.
There's also a Slack workspace you can use to discuss assignments and get news about the course. Please contact us by email if you want to join and haven't got an invite yet.
To pass this course, you will need to take an exam and do lab homeworks, which will amount to training an end-to-end neural dialogue system and writing a report on it. See more details here.
PDFs with lecture slides will appear here shortly before each lecture (more details on each lecture are on a separate tab). You can also check out last year's lecture slides.
1. Introduction Slides Questions
2. Data & Evaluation Slides Dataset Exploration Questions
3. Neural Nets Basics Slides Questions
4. Training Neural Nets Slides DailyDialogue Loader Questions
5. Natural Language Understanding Slides Questions
6. Dialogue Management (1) Slides Finetuning on DailyDialogue Questions
7. Dialogue Management (2) Slides Questions
8. Language Generation Slides MultiWOZ 2.2 Loader Questions
9. End-to-end Models Slides Questions
10. Chatbots Slides Finetuning on MultiWOZ Questions
11. Linguistics & Ethics Slides Questions
12. Multimodal systems Slides Evaluation & State Consistency Report Questions
A list of recommended literature is on a separate tab.
11 October Slides Dataset Exploration Questions
25 October Slides DailyDialogue Loader Questions
8 November Slides Finetuning on DailyDialogue Questions
22 November Slides MultiWOZ 2.2 Loader Questions
6 December Slides Finetuning on MultiWOZ Questions
20 December Slides Evaluation & State Consistency Report Questions
There will be 7 homework assignments, typically for a maximum of 10 points (the last one will be for 20 points). Please see details on grading and deadlines on a separate tab.
Assignments should be submitted via Git – see instructions on a separate tab.
All deadlines are 23:59:59 CET/CEST.
3. Finetuning on DailyDialogue
6. Evaluation & State Consistency
Presented: 11 October, Deadline: 27 October
Your task is to select one dialogue dataset, download and explore it.
Here you can use the dataset description/paper that came out with the data. The papers are linked from the dataset webpages or from here. If you can't find a paper, ask us and we'll try to help.
Here you should use your own programming skills.
hw01/description.md
.hw01/analysis.py
or hw01/analysis.ipynb
.See the submission instructions here (create a MFF Gitlab repo and a new merge request)..
data
subdirectory)Dataset surveys (broader, but shallower than what we're aiming at):
Presented: 25 October, Deadline: 10 November
In this assignment, you will work with the DailyDialog dataset. Your task is to create a component that will load the dataset and process the data so it is prepared for model training. This will consist of 2 Python classes -- one to hold the data, and one to prepare training batches.
In later assignments, you will train the GPT-2 model using data provided by this component. Note that this means that other assignments depend on this one.
DailyDialog is a chit-chat dialogue dataset labeled with intents and emotions. You can find more details in the paper desccribing the dataset.
Each DailyDialg entry consists of:
dialog
: a list of string features.act
: a list of classification labels, e.g., question, commisssive, ...emotion
: a list of classification labels, e.g., anger, happiness, ...The lists are of the same length and the order matters (it's the order of the turns in the dialogue, i.e. 5th entry in the act
list corresponds to the 5th entry in the dialog
list).
The data contains train, validation and test splits.
Implement a Python class for the dataset (feel free to use Pytorch Dataset, Huggingface datasets, or similar concepts of Tensorflow) that has the following properties:
It is able to load the data and process it into individual training examples (context + response + emotion + intent).
Each example should be a dictionary of the folowing structure:
{
'context': list[str], # list of utterances preceeding the current utterance
'utterance': str, # the string with the current response
'emotion': int, # emotion index
'intent': int # intent index
}
It distinguishes between data splits, i.e. it can be parameterized by split type (train, val, test).
It can truncate long contexts to k
last utterances, where k
is a parameter of the class.
Implement a data loader Python class (feel free to use Pytorch DataLoader or similar concepts in Tensorflow) that has the following properties:
yield
a batch of examples (a simple list with examples of your Dataset) of a batch size given in the constructor.Machine learning models usually work with numbers and matrices. That is why we also need to convert strings in our batches to integer ids (e.g., tokenize).
Therefore, inside your data loader class, implement a collate function that has the following properties:
output
) of the following structure:output = {
'context': list[list[int]], # tokenized context (list of subword ids from all preceding dialogue turns, separated by the GPT-2 special `<|endoftext|>` token) for all batch examples
'utterance': list[list[int]], # tokenized utterances (list of subword ids from the current dialogue turn) for all batch examples
'emotion': list[int], # emotion ids for all batch examples
'intent': list[int] # intent ids for all batch examples
}
where {k : output[k][i] for k in output}
should correspond to i-th example of the original input batchYou're free to use any library code that you find helpful, just make sure it installs with pip
and add the appropriate requirements.txt
file.
We will not restrict you to a certain machine learning framework for this course. However, we strongly recommend you to use Huggingface and PyTorch so you can access the pretrained models easily.
It is OK to use also Tensorflow, but we consider PyTorch the preferred framework. This means that some examples in the future might contain PyTorch-specific notes, the reference implementations will be in PyTorch as well. Also, if you run into problems with Tensorflow, we might not be able to help you quickly.
data/dailydialog_loader.py
hw02.py
or hw02.ipynb
(your choice), which will use your two classes, will load 3 batches from the training set, each of size 5, and print out both their string and token id representations. Make sure you fix your random seed at the start, so the results are repeatable!requirements.txt
file listing all the required libraries.Presented: 8 November, Deadline: 1 December (extended!)
In this assignment, you will be fine-tuning GPT-2 language model on the DailyDialog dataset that you prepared.
Maybe you noticed last time that DailyDialog does not have normalized texts and does not treat punctuation and whitespace in a uniform way. Therefore, we require you to update your dataloader from HW2 by adding a text normalization step. Use the following code for normalizing a single utterance:
from sacremoses import MosesTokenizer, MosesDetokenizer
mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')
utterance = md.detokenize(mt.tokenize(utterance))
This process can be time consuming, so consider caching or precomputing of the normalized texts.
You'll need to add a few more steps to your data loader:
<|endoftext|>
tokens as a delimiter and as the last tokenTrue
for context/utterance tokens only (see the example below)Loader outputs from HW2 looked like this:
contexts = [[3322, 1, 1, 3322, 2, 3, 4, 5], [3322, 6, 7]]
utterances = [[8, 9 , 10], [11, 12, 13, 14]]
What we need is to make them look like this:
labels = [
[3322, 1, 1, 3322, 2, 3, 4, 5, 3322, 8, 9, 10, 3322],
[3322, 6, 7, 3322, 11, 12, 13, 14, 3322, 0, 0, 0, 0]
]
context_mask = [
[True, True, True, True, True, True, True, True, True, False, False, False, False],
[True, True, True, True, False, False, False, False, False, False, False, False, False]
]
utterance_mask = [
[False, False, False, False, False, False, False, False, False, True, True, True, True],
[False, False, False, False, True, True, True, True, True, False, False, False, False]
]
Notice 3322
as the <|endoftext|>
token and the zero padding in labels
. Check the positions of True
and False
for both masks with respect to labels
.
Load the pre-trained GPT-2 model from the Huggingface Transformers library. More precisely, instantiate the GPT2LMHeadModel
class and load weights from pretrained model (see .from_pretrained(...)
). Use the smallest version of the model ('gpt2'
). If you like experimenting, you can replace the GPT-2 model with a similar model trained on conversational data only, e.g., DialoGPT
. You can find and browse all pre-trained Huggingface models here.
For training the model, i.e., connecting your data pipeline and the loaded model, use whatever you want. We recommend you to use the Huggingface Trainer, or Pytorch Lightning, but you can also write your own training loop and logging routines.
Fine-tune the model on the response generation task. It means that your objective is to minimize negative log-likelihood (NLL) of the training data with respect to your model. Feed the whole labels
tensors into your model, but when computing the loss, only the utterance tokens should be considered (use utterance_mask
for the calculation).
Don't forget to use the attention_mask
for GPT-2 training, so you avoid performing attention over padding.
Feel free to experiment with the optimizer/scheduler and training parameters. A good choice might be the ones preset by Huggingface (AdamW, Linear schedule with warmup).
Use the largest batch size you can (the largest where your GPU doesn't run out of memory). It might actually be very small (1-4).
Monitor the training and validation loss and use it to determine the hyperparameters (number of training epochs, learning rate, learning rate schedule, ...).
First start debugging with very small data, just a few batches (test if the model learns something checking outputs on the training data).
Fix your random seeds so your results are repeatable, and you can tell if you actually changed something. This must be done separately for Python and Numpy and PyTorch/Tensorflow! In case you're using Pytorch Lightning, you can use pytorch_lightning.utilities.seed.seed_everything
.
Note: Training on CPU is usually slow therefore we like GPUs. You can use Google Colab which provides GPUs for free for a limited time span. You can also ask Ondrej for an account on our in-house student computing cluster (please do that ASAP).
Huggingface provides several options for decoding the outputs of your model. Go through the tutorial and choose a decoding method of your liking (you can go with greedy as the base option). Use it to generate utterances for all contexts available in the test set.
Optional -- bonus: Prepare an interactive script that allows to directly chat with your model. It reads user utterances using a prompt, stores the context, and generates system responses using the trained model. Your efforts will be rewarded with bonus points.
Besides the training and validation loss, we want you to report the following measures on the test set:
argmax
on the predicted raw logits and compare the result with the ground-truth token ids)data/dailydialog_loader.py
)requirements.txt
model.py
, you may use multiple files if you want)
hw03/dailydialog_outputs.txt
) containing the generated test set responses, each on a separate line.hw03/dailydialog_scores.txt
) containing your token accuracy and perplexity.Presented: 22 November, Deadline: 13 December (extended)
This assignment is very similar to HW2, except you will work with the MultiWOZ 2.2. dataset, which is task-oriented. This results in some differences and modifications, so read carefully.
Your task is to create a component that will load the task-oriented dataset and process the data so it is prepared for model training. Same as for HW2, it will consist of two Python classes -- one to hold the data, and one to prepare training batches.
In later assignments, you will train the GPT-2 model (similar to SOLOIST) using data provided by this component. Note that this means that the next assignments depend on this one!
MultiWOZ 2.2 is a task-oriented conversational dataset labeled with dialogue acts. It contains around 10k conversations between the user and a Cambridge town info centre (system). The dialogues are about certain topics: restaurants, hotels, trains, taxi, tourist attractions, hospital, and police. You can find more details in the dataset repository.
You can write your own dataset loader from the original format (see the dataset), but it is not as simple as in the case of HW2. Therefore, we recommend using the Huggingface Datasets library version. Note that there's a bug (old checksum) in HF Datasets, so to load the dataset, use ignore_verifications=True
-- it'll work fine.
This is how the data looks like if you load it using Huggingface Datasets: Each entry in the dataset represents one dialog. The information we are interested in is contained in the field turns
, which is a dictionary with the following important keys:
speaker
: Role associated with the speaker. It's either 0 (user) or 1 (system).utterance
: String representation of the dialogue utterances.dialogue_acts
: Structured parse of the system utterances into dialog acts. It contains slot names and corresponding span_info
(location of the slot in the utterance, which will come in handy later).frames
: Present only in user utterances. Structured representation of the user's belief state.Each of these keys is mapped to a list with labels for the corresponding turns, i.e. turns['speaker'][0]
contains information for the speaker of the first turn abd turns['speaker'][-1]
of the last one.
Again, the dataset contains train, validation and test splits. Respect them!
The dataset is task-oriented and its important part is the database. Database stores entities that are available for each domain and their attributes. You will use the database results when modelling the conversations, therefore you need to implement the database query API. However, some domains are specific and its database queries need to be handled in a special way. Also, the MultiWOZ dataset has a few rather annoying quirks. Therefore, we provide for you a partially implemented class that already handles things that would be too annoying to deal with (see the attached file database.zip
).
However, you still need to implement some things:
The bits that are waiting for your implementation are highlighted with # TODO:
in the code.
Note that to use the provided code, you'll need to install the fuzzywuzzy
library (and add it to your dependencies). It install easily via pip
.
Implement a Python class for the dataset (feel free to use Pytorch Dataset, Huggingface datasets, or similar concepts for Tensorflow) that has the following properties:
It is able to load the data and process it into individual training examples (containing context, response, belief state, database results).
Each example should be a dictionary of the folowing structure:
{
'context': list[str], # list of utterances preceeding the current utterance
'utterance': str, # the string with the current response
'delex_utterance': str, # the string with the current response which is delexicalized, i.e. slot values are
# replaced by corresponding slot names in the text.
'belief_state': dict[str, dict[str, str]], # belief state dictionary, for each domain a separate belief state dictionary,
# choose a single slot value if more than one option is available
'database_results': dict[str, int] # dictionary containing the number of matching results per domain
}
Each dialogue of n
turns will yield n // 2
examples, each with progressively longer context (starting from a context of length 1, up to n-1
turns of context). We are modelling only system responses!
It distinguishes between data splits, i.e. it can be parameterized by split type (train, val, test).
It can truncate long contexts to k
last utterances, where k
is a parameter of the class.
It contains delexicalized versions of the utterances (where slot values are replaced with placeholders). You can use the data field dialogue_acts
and its fields span_end
, span_start
for localizing the parts suitable for delexicalization. Replace those parts with the corresponding slot names from act_slot_name
enclosed into brackets, e.g., [name]
or [pricerange]
.
Belief state is a dictionary that contains mapping of domains to their corresponding belief states (slot-value pairs), i.e.
{
'restaurant': {'pricerange': 'ab', 'area': 'cd', ...},
'hotel': {'parking': 'ef', ...},
...
}
Look into the frames
fileds of user utterances to build the belief state!
Database results represent the counts of database entities matching the current belief state for each domain.
{
'restaurant': 101,
'hotel': 42,
...
}
You must distinguish between the cases where 0 entities are matching and where the domain was not mentioned in the belief state and thus was not queried at all! Don't mention the domain in the results in the latter case.
Implement a data loader Python class (feel free to use Pytorch DataLoader or similar concepts in Tensorflow) that has the following properties:
yield
a batch of examples (a simple list with examples of your Dataset) of a batch size given in the constructor.Machine learning models usually work with numbers and matrices. That is why we also need to convert strings in our batches to integer IDs. Therefore, inside your data loader class, implement a collate function that has the following properties:
It is able to work with batches coming from your data loader (lists of examples).
It uses GPT2Tokenizer to split all strings into tokens (subwords) and assign them IDs.
It converts the batches to a single dictionary (output
) of the following structure:
output = {
'context': list[list[int]], # tokenized context (list of subword ids from all preceding dialogue turns,
# system turns prepended with `<|system|>` token and user turns with `<|user|>`)
# for all batch examples
'utterance': list[list[int]], # tokenized utterances (list of subword ids from the current dialogue turn)
# for all batch examples
'delex_utterance': list[list[int]], # tokenized and delexicalized utterances (list of subword ids
# from the current dialogue turn) for all batch examples
'belief_state': list[list[int]], # belief state dictionary serialized into a string representation and prepended with
# the `<|belief|>` special token and tokenized (list of subword ids
# from the current dialogue turn) for all batch examples
'database_results': list[list[int]], # database result counts serialized into string prepended with the `<|database|>`
# special token and tokenized (list of subword ids from the current dialogue turn)
# for all batch examples
}
where {k : output[k][i] for k in output}
should correspond to i-th example of the original input batch.
additional_special_tokens
argument of the tokenizer)!data/multiwoz_loader.py
.hw04.py
or hw04.ipynb
(your choice), which will use your two classes, will load 3 batches from the training set, each of size 5, and print out both their string and token id representations. Make sure you fix your random seed at the start, so the results are repeatable!hw04.ipynb
with the outputs, or include a separate hw04.txt
.requirements.txt
file listing all the required libraries -- this should be in the pip
readable format, i.e. the file should contain all required libraries and their versions in the following format:torch==a.b.c
transformers==x.y.z
...
Presented: 6th December, Deadline: 3 January (extended)
In this assignment, you will be fine-tuning GPT-2 language model on the MultiWOZ dataset that you prepared. Basically, we will try to mimic the SOLOIST architecture in a simplified way.
This assignment is very similar to HW3, except you will work with the MultiWOZ 2.2. dataset, which is task-oriented. This results in some differences and modifications, so read carefully.
Same as for HW3 vs. HW2, you'll need to add a few more steps to your data loader you made in HW4:
<|endoftext|>
token as the last token and as a delimiter betwen the database and delexicalized system utterances. The <|belief|>
and <|database|>
special tokens should be already present since you added them in HW4. They will serve as delimiters for the belief state and database parts of the input, respectively.True
for the tokens of the respective part, including the final special token (see the example below).Loader outputs from HW4 looked like this:
<|ENDOFTEXT|> = 3320
<|USER|> = 3321
<|SYSTEM|> = 3322
<|BELIEF|> = 3323
<|DB|> = 3324
contexts = [[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9], [3321, 10, 11]]
utterances = [[12, 13 , 14], [15, 16, 17, 18]]
delex_utterances = [[12, 1111, 14], [15, 16, 1112, 18]] # some tokens replaced by delex. procedure
beliefs = [[3323, 100, 101], [3323, 102]]
dbs = [[3324, 204], [3324, 207]]
What we need is to make them look like this:
labels = [
[[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9, 3323, 100, 101, 3324, 204, 3320, 12, 1111, 14, 3320],
[3321, 10, 11, 3323, 102, 3324, 207, 3320, 15, 16, 1112, 18, 3320, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
context_mask = [
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
belief_mask = [
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
database_mask = [
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
utterance_mask = [
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
...
Notice the zero padding in labels
. Check the positions of True
and False
for masks with respect to labels
. Specifically, note that you want to predict the final token of the belief/utterance and you're not predicting the starting token since that's given as input.
We will again work with the pretrained GPT-2 model. Load the pretrained model using .from_pretrained()
as in HW3.
For training the model, i.e., connecting your data pipeline and the loaded model, use whatever you want. We recommend you to use the Huggingface Trainer, or Pytorch Lightning, but you can also write your own training loop and logging routines. Feel free to reuse code from HW3.
labels
tensors into your model, but when computing the loss, consider tokens corresponding to the dialog state and utterance (use belief_mask
and utterance_mask
for the calculation). You should get two numbers, i.e., the loss associated with the belief state and final response. The total loss is then a sum of those two numbers.attention_mask
for GPT-2 training, so you avoid performing attention over padding.pytorch_lightning.utilities.seed.seed_everything
.Same as for HW3, You can use Google Colab which provides GPUs for free for a limited time span, or you can ask Ondrej for an account on our AIC in-house student computing cluster.
Huggingface provides several options for decoding the outputs of your model. Go through the tutorial and choose a decoding method of your liking (you can go with greedy as the base option). Use it to generate responses for all contexts available in the test set.
However, there is one caveat. During decoding, you have to query the database. To be able to do this, you first need to have a belief state. Therefore, the decoding process will have multiple stages:
NOTE: To be able to construct a query, you need a structured representation of the belief state (i.e. a dict). However, the model decodes strings (same as those you created in HW4). Therefore you'll need to use some kind of parser to get back the structured representation. We provide a sample parsing class that you can use directly or as an inspiration (depending on what you used in HW4) -- have a look at this code.
Optional -- bonus: Prepare an interactive script that allows to directly chat with your model. It reads user utterances using a prompt, stores the context, and generates system responses using the trained model. Unfortunately, this includes also a mechanism for backward lexicalization of predicted delexicalized texts. Your efforts will be rewarded with bonus points.
Besides the training and validation loss, we want you to report the token accuracy, i.e. the proportion of correctly predicted token ids (apply argmax
on the predicted raw logits and compare the result with the ground-truth token ids). Please report token accuracy separately for belief state prediction and response prediction.
data/multiwoz_loader.py
).requirements.txt
.model.py
or task_model.py
; you may use multiple files if you want).hw05/multiwoz_outputs.txt
) containing your generated test set belief states + responses. The ideal format is one turn per line, with a tab character (\t
) between the belief and the response.hw05/multiwoz_scores.txt
) containing your token accuracy and loss.Presented: 20th December, Deadline: 31st January (but better do it sooner)
In this assignment, you will work with the model trained in HW5 and perform some more experiments. Basically, we will try to answer two questions:
You might want to take a look at HW7 simultaneously, as it can help you design your experiments.
To evaluate your model's performance you will report several metrics. Specifically, we want you to report:
To be able to compute the metrics, you will need to generate predictions from your model and save them in a machine-readable format, e.g. json
. Use test set for genrating the predicitons.
For the computation of the scores itself, you are free to use any implementation you like. However, the easiest way is to use the evaluation script that Tomáš has prepared for MultiWOZ.
It can be easily installed via pip
and allows to measure all the required metrics (and some more).
In this part of the assignment, you will need to modify your model's training process and retrain the model subsequently. The goal of this modification is to improve the belief state tracking performance of your model. To achieve this, we introduce an additional training objective: The model will have to distinguish between the ground-truth belief state and a corrupted version of the belief state. You will need to do the following modifications:
You will corrupt the belief state on-the-fly in your collate function, i.e. before tokenization, building masks and concatenation of subsequent parts (context, state, database results, and response) into a single string. First, you need to decide which examples of the current batch contain corrupted state or not based on probability p_c = 1/3
(i.e., 1/3 of your training examples will contain the corrupted state on average). This decision will be described by a vector of binary flags which should be returned from the collate function too. These flags will be used as target labels during training.
There are many options to corrupting the belief state. You can replace each slot value by a different one with some probability p_v
, add or remove some slot name-value pair with probability p_a
, p_r
, or totally replace it with another state with probability p_t
. The corrupted belief state will then be encoded the same way as your ground-truth belief state.
It's best if you treat the probabilities as hyperparameters and keep them configurable. Keeping p_c = 0
will then get you the baseline model (with no state corruption).
You will add an additional training objective to detect the consistency of the belief states. To achieve this, the model needs to be slightly modified. You can choose one of the two approaches:
Use the additional training objective for training the consistency classification head. You should minimize the binary cross-entropy between the predicted binary flag and the ground truth (i.e., whether you fed in the true state or the corrupted one). Combine the losses as a weighted sum.
Measure the same metrics as with the base version of the model (without the additional training objective). You don't need to use the additional head or any belief state corruption during the prediction.
data/multiwoz_loader.py
).requirements.txt
(if necessary).model.py
or task_model.py
-- what you used for HW5).
hw06/mw_outputs.json,corrupted_mw_outputs.json
) containing your generated test set belief states + responses. The ideal format is structured and machine-readable, e.g. json
.hw06/multiwoz_metrics.txt
) containing the metrics described above (BLEU, success, distinct tokens, conditional bigram entropy) for both model variants -- with and without the state corruption.Presented: 20th December, Deadline: 21st February (but better do it sooner!)
This is the last assignment, and it's worth double points! The basic idea is that you write a ca. 3-page report (1500 words), detailing your model and the experiments, so it all looks like an academic paper. The purpose of this is to give you some writing training, which might come in handy for your master's thesis or other projects. It is up to you if you focus on the chitchat model, the MultiWOZ task-oriented model (preferrable), or both.
Have a look at Ondrej's tips for writing reports here before you start writing!
The prescribed format for your report is LaTeX, with the ACL Rolling Review templates. You can get the templates directly on Overleaf or download them for offline use.
hw7/report.pdf
)hw7/*.*
)hw7/error_analysis/*.*
-- best as either plain text or JSON)All homework assignments will be submitted using a Git repository on MFF GitLab.
We provide an easy recipe to set up your repository below:
Log into your MFF gitlab account. Your username and password should be the same as in the CAS, see this.
Create a new project (e.g. called NPFL099). Choose the Private visibility level.
New project -> Create blank project
Invite us (@duseo7af, @hudecekv, @nekvindt) to your project so we can see it. Please give us "Reporter" access level.
Members -> Invite Member
Clone the newly created repository.
Change into the cloned directory and run
git remote show origin
You should see these two lines:
* remote origin
Fetch URL: git@gitlab.mff.cuni.cz:your_username/NPFL099.git
Push URL: git@gitlab.mff.cuni.cz:your_username/NPFL099.git
git checkout master
git checkout -b hw-XX
Solve the assignment :)
Add new files (if applicable) and commit your changes:
git add hwXX/solution.py
git commit -am "commit message"
git push origin hw-XX
Create a Merge request in the web interface. Make sure you create the merge request into the master branch in your own forked repository (not into the upstream).
Merge requests -> New merge request
This is just a short primer for the AIC wiki -- better read that one. But definitely read at least this text before you start working with AIC.
When you log on to AIC, you're at the cluster head node. Do not compute here – this just for launching computation jobs, copying files and such. All of your computation jobs will run in a batch on one of the CPU/GPU nodes.
Commands you might want to use:
tmux attach
).Use the qsub
command to submit your jobs (i.e. shell scripts) into a queue. For running a python
command, simply create a shell script that has one line -- your command with all the parameters
you need.
Have a look at the AIC wiki for all the command-line parameters.
Here's just an example of a GPU job with 1 CPU, 1 GPU and 16G system RAM (all GPUs have 8G memory):
qsub -q gpu.q -cwd -j y -l act_mem_free=16G,mem_free=16G,h_vmem=16G,h_data=16G,gpu=1 -pe smp 1 script.sh
Parameter guide:
-q
-- the queue name (cpu.q
or gpu.q
are available)-cwd
-- run in the current directory, not your home directory-j y
-- join stderr and stdout into one file (script.oXXXX
) where XXXX
is the job ID-l
-- all the requested resources (yes, you need to specify all of these)-pe smp X
-- number of CPUs to use (separate from other resources)Notes:
Use the qstat
command to check for jobs. You can run qstat -u '*'
to see every job currently
running on the cluster, from any user.
You can get an interactive console for debugging directly with a GPU -- like this:
qrsh -q gpu.q -l act_mem_free=16G,mem_free=16G,h_vmem=16G,h_data=16G,gpu=1 -pe smp 1 -pty yes bash -l
Parameter guide:
-pty yes
means “give me a console”bash -l
is a bash login shell, which will set CUDA variables for you & start a new bash shell.Notes:
qsub
.qrsh
won't wait -- if the cluster is full, it will fail.exit
the console after use -- you're blocking the GPU and whatever you reserve,
as long as the console is open!The exam will have 10 questions from the pool below. Each question counts for 10 points. We reserve the right to make slight alterations or use variants of the same questions. Note that all of them are covered by the lectures, and they cover most of the lecture content. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles. See the Grading tab for details on grading.
To pass this course, you will need to:
In case the pandemic gets worse by the exam period, there will be a remote alternative for the exam (an essay with a discussion).
The final grade for the course will be a combination of your exam score and your homework assignment score, weighted 3:1 (i.e. the exam accounts for 75% of the grade, the assignments for 25%).
Grading:
In any case, you need >50% of points from the test and >50% of points from the homeworks to pass. If you get less than 50% from either, even if you get more than 60% overall, you will not pass.
You should be able to pass the course just by following the lectures, but here are some hints on further reading. There's nothing ideal on the topic as this is a very active research area, but some of these should give you a broader overview.
Recommended, though slightly outdated:
Recommended, but might be a bit too brief:
Further reading: