This is the new course for the '22/23 Fall semester. You can find slides from last year on the archived old page.
This course presents advanced problems and current state-of-the-art in the field of dialogue systems, voice assistants, and conversational systems (chatbots). After a brief introduction into the topic, the course will focus mainly on the application of machine learning – especially deep learning/neural networks – in the individual components of the traditional dialogue system architecture as well as in end-to-end approaches (joining multiple components together).
This course is a follow-up to the course NPFL123 Dialogue Systems, but can be taken independently – important basics will be repeated. All required deep learning concepts will be explained, but only briefly, so some machine learning background is recommended.
The course will be taught in English, but we're happy to explain in Czech, too.
In-person lectures and labs take place in the room S8 (Malá Strana, 1st floor).
In addition, we plan to stream both lectures and lab instruction over Zoom and make the recordings available on Youtube (under a private link, on request). We'll do our best to provide a useful experience, just note that the quality might not be ideal.
If you can't access Zoom, email us or text us on Slack.
There's also a Slack workspace you can use to discuss assignments and get news about the course. Please contact us by email if you want to join and haven't got an invite yet.
To pass this course, you will need to take an exam and do lab homeworks, which will amount to training an end-to-end neural dialogue system and writing a report on it. See more details here.
PDFs with lecture slides will appear here shortly before each lecture (more details on each lecture are on a separate tab). You can also check out last year's lecture slides.
1. Introduction Slides Questions
2. Data & Evaluation Slides Dataset Exploration Questions
3. Neural Nets Basics Slides Questions
4. Training Neural Nets Slides MultiWOZ 2.2 Loader Questions
5. Natural Language Understanding Slides Questions
6. Dialogue Management (1) Slides MultiWOZ 2.2 DB + State Questions
7. Dialogue Management (2) Slides Questions
8. Language Generation Slides Finetuning GPT-2 on MultiWOZ Questions
9. End-to-end Models Slides Questions
10. Chatbots Slides Two-stage decoding Questions
11. Multimodal systems Slides Questions
12. Linguistics & Ethics Slides Experiment with your model Bonus 1: Training on DailyDialog Bonus 2: Report Questions
A list of recommended literature is on a separate tab.
10 October Slides Dataset Exploration Questions
24 October Slides MultiWOZ 2.2 Loader Questions
7 November Slides MultiWOZ 2.2 DB + State Questions
21 November Slides Finetuning GPT-2 on MultiWOZ Questions
6 December Slides Two-stage decoding Questions
19 December Slides Experiment with your model Bonus 1: Training on DailyDialog Bonus 2: Report Questions
There will be 6 homework assignments + 2 bonuses, each for a maximum of 10 points. Please see details on grading and deadlines on a separate tab.
Assignments should be submitted via Git – see instructions on a separate tab.
All deadlines are 23:59:59 CET/CEST.
Note: If you don't have a faculty Gitlab account yet, please create one as soon as possible (see the instructions). Don't wait until the deadline! It takes 5 minutes, and if you don't do it, you won't have any way of submitting.
4. Finetuning GPT-2 on MultiWOZ
7. Bonus 1: Training on DailyDialog
Presented: 10 October, Deadline: 27 October
Your task is to select one dialogue dataset, download and explore it.
Here you can use the dataset description/paper that came out with the data. The papers are linked from the dataset webpages or from here. If you can't find a paper, ask us and we'll try to help.
Here you should use your own programming skills. If your dataset has a train/dev/test split, use the training set. If there's no clear separation between a user and a system (e.g. human-human chitchat data, or NLU-only data), provide just the overall numbers.
hw1/README.md
.hw1/analysis.py
or hw1/analysis.ipynb
.See the submission instructions here (clone your Gitlab repo and add a new merge request).
train_none_original
data)data
subdirectory)Dataset surveys (broader, but shallower than what we're aiming at):
Presented: 24 October, Deadline: 10 November
Your task is to create a component that will load the task-oriented dataset MultiWOZ 2.2. and process the data so it is prepared for model training. It will consist of two Python classes -- one to hold the data, and one to prepare the training batches.
In later assignments, you will train a GPT-2 based model (similar to SOLOIST) using the data provided by this loader. Note that this means that the next assignments depend on this one!
We prepared a set of templates for you to guide your implementation. You should not need to modify the templates, but if you feel you need to, you can do so, but please comment on your code changes in the MR. Do not modify the file run.py
, under any circumstances (contact us if you really think you need to).
The bits that are waiting for your implementation are highlighted with # TODO:
in the code.
Note that to use the provided code, you'll need to install the dependencies provided in the requirements.txt
. They can be installed easily via pip install -r requirements.txt
.
MultiWOZ 2.2 is a task-oriented conversational dataset labeled with dialogue acts. It contains around 10k conversations between the user and a Cambridge town info centre (system). The dialogues are about certain topics: restaurants, hotels, trains, taxi, tourist attractions, hospital, and police. You can find more details in the dataset repository.
You can write your own dataset loader from the original format (see the dataset) but we recommend using the Huggingface Datasets library version. Note that there's a bug (old checksum) in HF Datasets, so to load the dataset, use ignore_verifications=True
-- it'll work fine.
This is how the data looks like if you load it using Huggingface Datasets: Each entry in the dataset represents one dialog. The information we are interested in is contained in the field turns
, which is a dictionary with the following important keys:
speaker
: Role associated with the speaker. It's either 0 (user) or 1 (system).utterance
: String representation of the dialogue utterances.dialogue_acts
: Structured parse of the system utterances into dialog acts (only in system utterances). It contains slot names and corresponding span_info
(location of the slot in the utterance, which will come in handy later).frames
: Present only in user utterances. Structured representation of the user's belief state.Each of these keys is mapped to a list with labels for the corresponding turns, i.e. turns['speaker'][0]
contains information for the speaker of the first turn and turns['speaker'][-1]
of the last one.
The dataset contains the train, validation and test splits. Please respect them!
Note that MultiWOZ also contains a database (and you need database queries for your system to work correctly), but we'll address that later.
You need to implement the following properties for the Dataset class:
{
'context': list[str], # list of utterances preceeding the current utterance
'utterance': str, # the string with the current response
'delex_utterance': str, # the string with the current response which is delexicalized, i.e. slot values are
# replaced by corresponding slot names in the text.
}
n
turns will yield n // 2
examples, each with progressively longer context (starting from a context of length 1, up to n-1
turns of context). We are modelling only system responses!k
last utterances, where k
is a parameter of the class.dialogue_acts
and its fields span_end
, span_start
for localizing the parts suitable for delexicalization. Replace those parts with the corresponding slot names from act_slot_name
enclosed into brackets, e.g., [name]
or [pricerange]
.Implement a data loader Python class that has the following properties:
yield
a batch of examples (a simple list with examples of your Dataset) of a batch size given in the constructor.Machine learning models usually work with numbers and matrices. That is why we also need to convert strings in our batches to integer IDs. Therefore, inside your data loader class, implement a collate function that has the following properties:
It is able to work with batches coming from your data loader (lists of examples).
It uses GPT2Tokenizer to split all strings into tokens (subwords) and assign them IDs.
It converts the batches to a single dictionary (output
) of the following structure:
output = {
'context': list[list[int]], # tokenized context (list of subword ids from all preceding dialogue turns,
# system turns prepended with `<|system|>` token and user turns with `<|user|>`)
# for all batch examples
'utterance': list[list[int]], # tokenized utterances (list of subword ids from the current dialogue turn)
# for all batch examples
'delex_utterance': list[list[int]], # tokenized and delexicalized utterances (list of subword ids
# from the current dialogue turn) for all batch examples
}
where {k : output[k][i] for k in output}
should correspond to i-th example of the original input batch.
additional_special_tokens
argument of the tokenizer)!diallama/mw_loader.py
.hw2/test.py
run on your data (test set is used by default), as hw2/results_test.txt
. Have a look at what the script is doing.Presented: 7 November, Deadline: 25 November
This assignment is a continuation of HW2.
Your task will be to extend your previously created DataLoader with the belief state and database information.
When you update your repo from the upstream base repo, you should be able to merge
our added code into your HW2 implementation, and continue working on HW3 code.
In later assignments, you will train the GPT-2 model (similar to SOLOIST) using the data provided by the loader you develop here. Note that this means that the next assignments depend on this one!
The implementation includes changes to the MultiWOZDatabase
class (database search handling), the Dataset
class (including database results and the belief state),
and the DataLoader
class (also including database results and the belief state).
The MultiWOZ dataset is task-oriented and the database is an important part of it. The database stores entities that are available for each domain, along with their attributes. You will use the database results when modelling the conversations, and therefore you need to implement the database query API. However, some domains are specific and their database queries need to be handled in a special way. Also, the MultiWOZ dataset has a few rather annoying quirks. Therefore, we provide for you a partially implemented database class, which already handles things that would be too annoying to deal with. You still need to implement some things, though:
3pm -> 15:00
noon -> 12:00
three forty five -> 15:45
etc.
diallama/database.py
).The bits that are waiting for your implementation are highlighted with # TODO:
in the code.
Note that to use the provided code, you need to install fuzzywuzzy
.
It is listed in the requirements.txt
file, so if you followed the installation instructions, you probably have it already.
We recommended to use it for partial matches, e.g., it allows you to match "London" to "London King's Cross" and similar situations.
This is an extension of the class from HW2. You'll need to implement code in the same spots as for HW2, just add some more.
belief_state
and database_results
fields, so each example will look like this:{
'context': list[str], # list of utterances preceeding the current utterance
'utterance': str, # the string with the current response
'delex_utterance': str, # the string with the current response which is delexicalized, i.e. slot values are
# replaced by corresponding slot names in the text.
'belief_state': dict[str, dict[str, str]], # belief state dictionary, for each domain a separate belief state dictionary,
# choose a single slot value if more than one option is available
'database_results': dict[str, int] # dictionary containing the number of matching results per domain
}
belief_state
is a dictionary that contains mapping of domains to their corresponding belief states (slot-value pairs), i.e.
{ 'restaurant': {'pricerange': 'ab', 'area': 'cd', ...}, 'hotel': {'parking': 'ef', ...}, ... }
Look into the frames
fields of user utterances in the dataset to build the belief state.database_results
represent the counts of database entities matching the current belief state for each domain.
{ 'restaurant': 101, 'hotel': 42, ... }
You need to distinguish between the cases where 0 entities are matching and where the domain was not mentioned in the belief state and thus was not queried at all! Don't mention the domain in the results in the latter case.Again, you just need to extend your previously implemented class, so all the previous features (yield
ing batches, grouping similar lengths, shuffling..) still apply.
And again, you'll need to implement code in the same spots as for HW2, just add a little more.
Here you need to extend your collate function:
The output of the function should now look like this -- note the new belief_state
and database_results
fields:
output = {
'context': list[list[int]], # tokenized context (list of subword ids from all preceding dialogue turns,
# system turns prepended with `<|system|>` token and user turns with `<|user|>`)
# for all batch examples
'utterance': list[list[int]], # tokenized utterances (list of subword ids from the current dialogue turn)
# for all batch examples
'delex_utterance': list[list[int]], # tokenized and delexicalized utterances (list of subword ids
# from the current dialogue turn) for all batch examples
'belief_state': list[list[int]], # belief state dictionary serialized into a string representation and prepended with
# the `<|belief|>` special token and tokenized (list of subword ids
# from the current dialogue turn) for all batch examples
'database_results': list[list[int]], # database result counts serialized into string prepended with the `<|database|>`
# special token and tokenized (list of subword ids from the current dialogue turn)
# for all batch examples
}
where {k : output[k][i] for k in output}
should correspond to i-th example of the original input batch.
additional_special_tokens
argument of the tokenizer)!<|belief|> { restaurant { area : center , pricerange : cheap } attraction { area : south } } <|database|> { restaurant 45 , attraction 23 }
diallama/mw_loader.py
.diallama/database.py
hw3/test.py
run on your data (test set is used by default), as hw3/results_test.txt
. Have a look at what the script is doing.Presented: 21st November, Deadline: 5 January (extended)
In this assignment, you will be fine-tuning the GPT-2 language model on the MultiWOZ dataset that you prepared. We'll ignore the state tracking and database for now, that will come later on.
You'll need to add a few more steps to your data loader:
You will work with the diallama/mw_loader.py
and modify the collate()
method in the following way:
<|endoftext|>
tokens as a delimiter and as the last tokenTrue
for context/utterance tokens only (see the example below)Loader outputs (collated) from HW2 looked like this:
<|ENDOFTEXT|> = 3320
<|USER|> = 3321
<|SYSTEM|> = 3322
contexts = [[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9], [3321, 10, 11]]
delex_utterances = [[12, 13 , 14], [15, 16, 17, 18]]
What we need is to make them look like this:
input_ids = [
[[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9, 3320, 12, 13, 14, 3320],
[3321, 10, 11, 3320, 15, 16, 17, 18, 3320, 0, 0, 0, 0, 0, 0, 0, 0]]
] # concatenation and padding
context_mask = [
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
utterance_mask = [
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0]]
]
attention_mask = [
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
]
...
Notice 3322
as the <|endoftext|>
token and the zero padding in input_ids
. Check the positions of True
and False
for both masks with respect to input_ids
.
For the model training we have prepared the script train.py
that uses the class Trainer
from trainer.py
.
Your task will be to fill the TODO
s there to implement the training loop and validation step.
You will also need to create an optimizer and scheduler.
Load the pre-trained GPT-2 model from the Huggingface Transformers library. More precisely, instantiate the GPT2LMHeadModel
class and load weights from pretrained model (see .from_pretrained(...)
). Use the smallest version of the model ('gpt2'
). If you like experimenting, you can replace the GPT-2 model with a similar model trained on conversational data only, e.g., DialoGPT
. You can find and browse all pre-trained Huggingface models here.
Fine-tune the model on the response generation task. It means that your objective is to minimize negative log-likelihood (NLL) of the training data with respect to your model. Feed the whole input_ids
tensors into your model, but when computing the loss, only the utterance tokens should be considered (use utterance_mask
for the calculation).
Don't forget to use the attention_mask
for GPT-2 training, so you avoid performing attention over padding.
Feel free to experiment with the optimizer/scheduler and training parameters. A good choice might be the ones preset by Huggingface (AdamW, Linear schedule with warmup).
Use the largest batch size you can (the largest where your GPU doesn't run out of memory). It might actually be very small (1-4).
Monitor the training and validation loss and use it to determine the hyperparameters (number of training epochs, learning rate, learning rate schedule, ...).
First start debugging with very small data, just a few batches (test if the model learns something by checking outputs on the training data).
Fix your random seeds so your results are repeatable, and you can tell if you actually changed something (must be done separately for Python and Numpy and PyTorch/Tensorflow!).
Note: Training on CPU is usually slow, therefore we like GPUs. You can use Google Colab which provides GPUs for free for a limited time span. You can also ask Ondrej for an account on our in-house student computing cluster (please do that ASAP). The student cluster is now undergoing an upgrade, that's why the deadline is a week later. But you can prepare and debug your setup even without a GPU, then only run on the full data once you have access to a GPU.
Huggingface provides several options for decoding the outputs of your model. Go through the tutorial and choose a decoding method of your liking (you can go with greedy as the base option). Use it to generate utterances for all contexts available in the test set.
We have prepared the class GenerationWrapper
, which you will need to complete to implement generation from the model.
Optional -- bonus points: Implement batch decoding as well. This is completely optional, if you are interested in the implementation, let us know.
Besides the training and validation loss, we want you to report the following measures on the test set:
argmax
on the predicted raw logits and compare the result with the ground-truth token ids)diallama/mw_loader.py
)diallama/trainer.py
, diallama/train.py
)
hw4/multiwoz_outputs.txt
) containing the generated test set responses, each on a separate line.hw4/multiwoz_scores.txt
) containing your token accuracy and perplexity on the validation set.Presented: 5th December, Deadline: 5th January (extended)
In this assignment, you will continue fine-tuning the GPT-2 language model on the MultiWOZ dataset you prepared.
This time, your model will be enhanced with a belief tracking component and database access using 2-stage decoding.
You'll need to add a few final modifications to your data loader.
You will work with the diallama/mw_loader.py
and modify the collate()
method in the following way:
input_ids
(concatenate it between the context and utterance)True
for relevant tokens only (see the example below)Loader outputs (collated) from HW3 looked like this:
<|ENDOFTEXT|> = 3320
<|USER|> = 3321
<|SYSTEM|> = 3322
<|BELIEF|> = 3323
<|DB|> = 3324
contexts = [[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9], [3321, 10, 11]]
utterances = [[12, 13 , 14], [15, 16, 17, 18]]
delex_utterances = [[12, 1111, 14], [15, 16, 1112, 18]] # some tokens replaced by delex. procedure
beliefs = [[3323, 100, 101], [3323, 102]]
dbs = [[3324, 204], [3324, 207]]
What we need is to make them look like this:
labels = [
[[3321, 1, 2, 3322, 3, 4, 5, 6, 3321, 7, 8, 9, 3323, 100, 101, 3324, 204, 3320, 12, 1111, 14, 3320],
[3321, 10, 11, 3323, 102, 3324, 207, 3320, 15, 16, 1112, 18, 3320, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
context_mask = [
[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
belief_mask = [
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
database_mask = [
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
utterance_mask = [
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
]
...
Notice the special tokens and the zero padding in input_ids
. Check the positions of True
and False
(1
and 0
) for both masks with respect to input_ids
.
For the model training,
you will extend the code in train.py
and in the Trainer
class in trainer.py
that you created in HW4, so the model performs belief tracking correctly.
When training the model, you will train for both belief tracking and response generation simultaneously.
It means that your objective is to minimize negative log-likelihood (NLL) of the training data with respect to your model.
Feed the whole input_ids
tensors into your model, but when computing the loss, only the utterance and belief state tokens should be considered (use utterance_mask
and belief_mask
for the calculation).
For general tips about the training, see the HW4 assignment.
Here, you will implement the 2-stage decoding.
Extend your implementation of the GenerationWrapper
in the following way:
MultiWOZDatabase
into the constructor and save it as a model's property.<|endoftext|>
token is generated.Dict[Text, Dict[Text, Text]]
collate()
functionBesides the training and validation loss, we want you to report the following measures on the test set:
argmax
on the predicted raw logits and compare the result with the ground-truth token ids). Report the token accuracy for belief tracking and response generation separately as two different metrics.diallama/mw_loader.py
)diallama/trainer.py
, diallama/train.py
)
hw5/multiwoz_outputs.txt
) containing the generated responses from test subset T, each on a separate line.hw5/multiwoz_scores.txt
) containing your token accuracies and perplexity on the test set.Presented: 19th December, Deadline: 27 January
In this assignment, you will work with the model trained in HW5 and perform some more experiments. Basically, we will try to answer two questions:
To evaluate your model's performance, you will report several metrics. Specifically, we want you to report:
To be able to compute the metrics, you will need to generate predictions from your model and save them in a machine-readable format, e.g. json
. Use a subset of the test set (first 100 dialogues) for generating the predictions.
For the computation of the scores itself, you are free to use any implementation you like. However, the easiest way is to use this evaluation script.
It can be easily installed via pip
and allows to measure all the required metrics (and some more).
The script has been added to the requirements in the repository (if you update from upstream). For usage instructions, see its GitHub page.
In this part of the assignment, you will need to modify your model's training process and retrain the model subsequently. The goal of this modification is to improve the belief state tracking performance of your model. To achieve this, we introduce an additional training objective: The model will have to distinguish between the ground-truth belief state and a corrupted version of the belief state. You will need to do the following modifications:
In diallama/mw_loader.py:Dataset
, you will add a new field to each generated example. This field will contain a corrupted belief state. You will create the corrupted belief state by replacing each slot value by a different one with a probability p_r = 1/3
, i.e. approximately 1/3 of slot values will be changed.
The corrupted belief state will then be encoded in the same way as the ground truth belief state.
When you concatenate the context, belief state and database results, you will determine whether this example uses the corrupted state or not based on probability p_c = 1/4
, i.e. 1/4 of the training examples will contain the corrupted state.
You will also need to add a binary flag determining if a particular example was corrupted.
To sum up, you should follow this procedure for each example when preparing the data:
Dataset
) when the belief state dict is ready, make a corrupted version of it by randomly replacing each slot value with a probability 1/3
(independently)Dataset
) add this corrupted belief state to the example dictDataLoader
) decide if the generated example will use the corrupted belief state with p=1/4
.DataLoader
) if yes, use the corrupted bs when concatenating the input_ids
, use the ground truth otherwiseDataLoader
) add a binary flag corrupted
that is set to True
iff you used the corrupted BS.It's important to do steps 3-5 in the DataLoader
so one example can appear corrupted in one epoch but normal in others. Alternatively, all could be done in the DataLoader
, but it isn't necessary.
In this assignment, you will implement an additional training objective. Specifically, you will add a classification module that will a perform binary decision to detect belief state consistency (i.e. whether it was corrupted or not). To achieve this, the model has to be slightly modified. You can choose one of the two approaches:
GPT2LMHeadModel
and add a custom classification layer manually.At this point, we have 2 classification heads on top of the pretrained transformer model.
You will train the LM head the same way as before (i.e. with cross-entropy loss using labels derived from input_ids
).
To train the consistency classification head, gather logits from the new classification head from all time steps and sum them up.
Then apply linear+softmax layer on top of the sum and minimize the binary cross-entropy between the predicted binary flag and the ground truth (i.e., whether you fed in the true state or the corrupted one).
Combine the losses (LM and consistency) as a weighted sum.
Note that for the corrupted examples, you should not backpropagate the LM loss. That's where we use the corrupted flag
.
Again, you have multiple options. You either
(1) do not backpropagate the belief loss at all (you have to split the computation of losses for belief state and response)
or (2) treat the labels for the belief state accordingly.
The (2) is easier to implement and is actually more efficient as well. Nevertheless, (1) can give you more control over the training. The decision is up to you.
Measure the same set of metrics as with the previous version of the model (see HW5), plus the metrics mentioned above (use the evaluation script). You don't need to use the additional head or any belief state corruption during the prediction. Use a subset of the test set for the experiments (first 100 dialogues) so you save time with the inference.
diallama/mw_loader.py
).diallama/trainer.py
; you may use multiple files if you want).
hw6/mw_outputs.json
containing your generated test set belief states + responses (on the subset of first 100 dialogues). The ideal format is structured and machine-readable, e.g. json
.hw06/multiwoz_metrics.txt
) containing the metrics described above (BLEU, success, distinct tokens, conditional bigram entropy) for both model variants -- with and without the state corruption. Compute them on the same subset of the first 100 dialogues in the test set.Presented: 19th December, Deadline: 15 September
The first bonus assignment is just making your system work on a different dataset. We will be using the DailyDialog data. Your task is to adapt the loader your created in HW2 so that it works on tihs dataset. You then need to run your model in the version from HW4 on this data -- the two-stage decoding is not applicable here.
DailyDialog is a chit-chat dialogue dataset labeled with intents and emotions. You can find more details in the paper desccribing the dataset.
Each DailyDialog entry consists of:
dialog
: a list of string features.act
: a list of classification labels, e.g., question, commisssive, ...emotion
: a list of classification labels, e.g., anger, happiness, ...The lists are of the same length and the order matters (it's the order of the turns in the dialogue, i.e. 5th entry in the act
list corresponds to the 5th entry in the dialog
list).
The data contains train, validation and test splits.
Create your own version of the Dataset
class (either by duplicating the code, or preferrably by using a derived class)
that is able to load DailyDialog data and process it into individual training examples. You don't need to care about acts or emotions,
the only important thing is contexts and responses.
Each resulting example should be a dictionary of the folowing structure:
{
'context': list[str], # list of utterances preceeding the current utterance
'utterance': str, # the string with the current response
}
Note that in this case, you don't treat user and system turns differently -- you'll create a training example from each utterance (using the preceding utterances as context).
Make any necessary adjustments in your Data Loader and Model classes, so they are able to handle DailyDialog input (these should be minimal).
Train your model on the new data exactly in the same way you did for HW4. You probably don't need as much debugging here, assuming the input looks reasonable, but you may need to change the training parameters.
Measure the same metrics you did for HW4, i.e. token accuracy and perplexity on the DailyDialog test set.
Name your branch hw7
for this submission. Include the following:
diallama/dd_loader.py
) and updated training code (diallama/trainer.py
, diallama/train.py
)
hw7/dd_outputs.txt
) containing the generated test set responses, each on a separate line.hw7/dd_scores.txt
) containing your token accuracy and perplexity on the validation set.Presented: 19th December, Deadline: 15th September
The basic idea of the second bonus assignment is that you write a ca. 3-page report (1500 words), detailing your model and the experiments, so it all looks like an academic paper. The purpose of this is to give you some writing training, which might come in handy for your master's thesis or other projects.
Have a look at Ondrej's tips for writing reports here before you start writing!
The prescribed format for your report is LaTeX, with the ACL Rolling Review templates. You can get the templates directly on Overleaf or download them for offline use.
Name your branch hw8
for this submission. Include this:
hw8/report.pdf
)hw8/*.*
)hw8/error_analysis/*.*
-- best as either plain text or JSON)All homework assignments will be submitted using a Git repository on MFF GitLab.
We provide an easy recipe to set up your repository below:
git remote show origin
You should see these two lines:
* remote origin
Fetch URL: git@gitlab.mff.cuni.cz:teaching/NPFL099/2022/your_username.git
Push URL: git@gitlab.mff.cuni.cz:teaching/NPFL099/2022/your_username.git
upstream
:git remote add upstream https://gitlab.mff.cuni.cz/teaching/NPFL099/base.git
git checkout master
git checkout -b hwX
Solve the assignment :)
Add new files (if applicable) and commit your changes:
git add hwX/solution.py
git commit -am "commit message"
git push origin hwX
Create a Merge request in the web interface. Make sure you create the merge request into the master branch in your own forked repository (not into the upstream).
Merge requests -> New merge request
You'll probably need to update from the upstream base repository every once in a while (most probably before you start implementing each assignment). We'll let you know when we make changes to the base repo.
To upgrade from upstream, do the following:
git checkout master
git fetch upstream master
git merge upstream/master master
You can run some basic sanity checks for homework assignments -- they are included in your repository
(make sure to upgrade from upstream first).
Note that the tests require stuff from requirements.txt
to be installed in your Python environment.
The tests assume checking in the current directory, they assume you have the correct branches set up.
For instance, to check hw1
, run:
./run_tests.py hw1
By default, this will just check your local files. If you want to check whether you have
your branches set up correctly, use the --check-git
parameter.
Note that this will run git checkout hw1
and git pull
, so be sure to save any
local changes beforehand!
Always update from upstream before running tests, we're adding checks for new assignments as we go. Some may only be available at the last minute, we're sorry for that!
This is just a short primer for the AIC wiki – better read that one, too. But definitely read at least this text before you start working with AIC.
Use the command
ssh LOGIN@aic.ufal.mff.cuni.cz
where LOGIN is your SIS username.
When you log on to AIC, you're at the cluster head node. Do not compute here – this just for launching computation jobs, copying files and such. All of your computation jobs will run on one of the CPU/GPU nodes. (You can run the terminal multiplexing program on the head node.)
There are two ways to compute on the cluster:
You should use a batch script for running longer computations. The interactive shell is useful for debugging.
Use the sbatch
command to submit your jobs (i.e. shell scripts) into a queue. For running a python command, simply create a shell script that has one line – your command with all the parameters
you need.
You can either specify the parameters in the script or on the command line.
Here are two equivalent ways of specifying a GPU job with 2 CPU cores, 1 GPU and 16G system RAM (all GPUs have 11G memory):
job_script.sh
:#!/bin/bash
#SBATCH -J hello_world # name of job
#SBATCH -p gpu # name of partition or queue (if not specified default partition is used)
#SBATCH --cpus-per-task=2 # number of cores/threads per task (default 1)
#SBATCH --gpus=1 # number of GPUs to request (default 0)
#SBATCH --mem=16G # request 16 gigabytes memory (per node, default depends on node)
# here start the actual commands
sleep 5
echo "Hello I am running on cluster!"
sbatch job_script.sh
job_script.sh
:#!/bin/bash
sleep 5
echo "Hello I am running on cluster!"
sbatch -J hello_world -p gpu -c2 -G1 --mem 16G job_script.sh
Have a look at the AIC wiki or man sbatch
for all the command-line parameters.
(Note: long / short flags can be used interchangeably for both approaches.)
You can get an interactive console using srun
.
The following command will run bash
with the same resources as in the previous example:
srun -J hello_world -p gpu -c2 -G1 --mem=16G --pty bash
exit
the console after use – you're blocking the GPU and whatever you reserve as long as the console is open!sinfo
to list the available queues.squeue --me
or squeue -u LOGIN
(where LOGIN is your username) to check your jobs.squeue
to see every job currently running on the cluster.scancel JOB_ID
to cancel a job.sftp://LOGIN@aic.ufal.mff.cuni.cz
The exam will have 10 questions from the pool below. Each question counts for 10 points. We reserve the right to make slight alterations or use variants of the same questions. Note that all of them are covered by the lectures, and they cover most of the lecture content. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles. See the Grading tab for details on grading.
To pass this course, you will need to:
In case the pandemic gets worse by the exam period, there will be a remote alternative for the exam (an essay with a discussion).
The final grade for the course will be a combination of your exam score and your homework assignment score, weighted 3:1 (i.e. the exam accounts for 75% of the grade, the assignments for 25%).
Grading:
In any case, you need >50% of points from the test and 40+ points (i.e. 66%) from the homeworks to pass. If you get less than the minimum from either, even if you get more than 60% overall, you will not pass.
You should be able to pass the course just by following the lectures, but here are some hints on further reading. There's nothing ideal on the topic as this is a very active research area, but some of these should give you a broader overview.
Recommended, though slightly outdated:
Recommended, but might be a bit too brief:
Further reading: