This course is a detailed introduction into the architecture of spoken dialogue systems, voice assistants and conversational systems (chatbots). We will introduce the main components of dialogue systems (speech recognition, language understanding, dialogue management, language generation and speech synthesis) and show alternative approaches to their implementation.
The lab sessions will be dedicated to implementing a simple dialogue system and selected components (via weekly homework assignments). We will use Python and a version of our Dialmonkey framework for this.
6 March: The lecture goes on as normal, but the lab is cancelled today. Don't worry, we'll catch up. HW2 deadline is extended by a week.
The course will be taught in English, but we're happy to explain in Czech, too.
In-person lectures and labs take place in the Malá Strana building.
In addition, we plan to stream both lectures and lab instruction over Zoom and make the recordings available on Youtube (under a private link, on request, sent out to enrolled students at the start of the semester). We'll do our best to provide a useful experience.
There's also a Slack workspace you can use to discuss assignments and get news about the course. Invite links will be sent out to all enrolled students by the start of the semester. Please contact us by email if you want to join and haven't got an invite yet. (Also, if you know of a better platform, let Ondřej know, Slack's 3-month history sucks.)
To pass this course, you will need to take an exam and do lab homeworks. There's a 60% points minimum for the exam and 50% for the homeworks to pass the course. See more details here.
PDFs with lecture slides will appear here shortly before each lecture (more details on each lecture are on a separate tab). You can also check out last year's lecture slides.
1. Introduction Slides Domain selection Questions
2. What happens in a dialogue? Slides Dataset exploration Questions
3. Data & Evaluation Slides Questions
4. Natural Language Understanding Slides Questions Rule-based Natural Language Understanding
5. Neural NLU + State Tracking Slides Questions Statistical Natural Language Understanding
A list of recommended literature is on a separate tab.
20 February Slides Domain selection Questions
27 February Slides Dataset exploration Questions
13 March Slides Questions Rule-based Natural Language Understanding
20 March Slides Questions Statistical Natural Language Understanding
There will be 12 homework assignments, each for a maximum of 10 points. Please see details on grading and deadlines on a separate tab. Note that there's a 50% minimum requirement to pass the course.
Assignments should be submitted via Git – see instructions on a separate tab. Please take special care about naming your Git branches and files the way they're given in the assignments. If our automatic checks don't find your files, you'll lose points!
You should run automatic checks before submitting -- have a look at TESTS.md. Code that crashes during the automatic checks will not get any points. You may fail the checks and still get full points, or ace the checks and get no points (especially if your code games the checks). Note that you should update your checkout since the code for the assignments might be changed during the semester.
All deadlines are 23:59:59 CET/CEST.
3. Rule-based Natural Language Understanding
4. Statistical Natural Language Understanding
Presented: 20 February, Deadline: 10 March
You will be building a task-oriented dialogue system in (some of) the homeworks for this course. Your first task is to choose a domain and imagine how your system will look like and work like. Since you might later find that you don't like the domain, you are now required to pick two, so you have more/better ideas later and can choose only one of them for building the system.
The required steps for this homework are:
Pick two domains of your liking that are suitable for building a task-oriented dialogue system. Think of a reasonable backend (see below).
Write 5 example system-user dialogues for both domains, which are at least 5+5 turns long (5 sentences for both user and system). This will make sure that your domain is interesting enough. You do not necessarily have to use English here (but it's easier if we understand the language you're using -- ask us if unsure, Czech & Slovak are perfectly fine).
Create a flowchart for your two domains, with labels such as “ask about phone number”, “reply with phone number”, “something else” etc. It should cover all of your example dialogues. You can use e.g. Mermaid to do this, but the format is not important. Feel free to draw this by hand and take a photo, as long as it's legible.
Please stick to the file naming conventions -- you will lose points if you don't!
hw1/README.md
with short commentary on both domains (ca. 10-15 sentences) -- what they are, what features you'd like to include, what will be the backend.
hw1/examples-<domain1>.txt
, hw1/examples-<domain2>.txt
-- 5 example dialogues for each of the domains (as described above). Use a short domain name, best with just letters and underscores.
hw1/flowchart-<domain1>.{pdf,jpg,png}
, hw1/flowchart-<domain2>.{pdf,jpg,png}
-- the flowcharts for each of the domains, as described above.
See the instructions on submission via Git -- create a branch and a merge request with your changes. Make sure to name your branch hw1
so we can find it easily.
You may choose any domain you like, be it tourist information, information about culture events/traffic, news, scheduling/agenda, task completion etc. You can take inspiration from stuff presented in the first lecture, or you may choose your own topic.
Since your domain will likely need to be connected to some backend database, you might want to make use of some external public APIs -- feel free to choose under one of these links:
You can of course choose anything else you like as your backend, e.g. portions of Wikidata/DBPedia or other world knowledge DBs, or even a handwritten “toy” database of a meaningful size, which you'll need to write to be able to test your system.
Presented: 27 February, Deadline: 24 March (extended!)
The task in this lab is to explore dialogue datasets and find out more about them. Your job will thus be to write a script that computes some basic statistics about datasets, and then try to interpret the script's results.
Take a look at the Dialog bAbI Tasks Data 1-6 dataset. Read the description of the data format in the readme.txt file. You'll be working with Tasks 5 and 6 (containing full generated dialogues and DSTC2 data). Use the training sets for Task 5 and Task 6.
Write a script that will read all turns in the data and separate the user and system utterances in the training set.
<SILENCE>
), it should concatenate the
system response from this turn to the previous turn. Note that this may happen on multiple consecutive turns,
and the script should join all of these together into one system response.
<SILENCE>
is the first word in the dialogue, just delete it.word_tokenize
function from the nltk package.Implement a routine that will compute the following statistics for both bAbI tasks for system and user turns (separately, i.e., 4 sets of statistics altogether):
Commit this file as hw2/stats.py
.
Along with your script, submit also a dump of the results. The results should be formatted as JSON file with the following structure:
{
"task5_user":
{
"dialogues_total": XXX,
"turns_total": XXX,
"words_total": XXX,
"mean_dialogue_turns": XXX,
"stddev_dialogue_turns": XXX,
"mean_dialogue_words_per_turn": XXX,
"stddev_dialogue_words_per_turn": XXX,
"vocab_size": XXX,
"entropy": XXX
"cond_entropy": XXX
},
"task5_system": ...
"task6_user": ...
"task6_system": ...
}
(Create a dict
and use json.dump
for this.)
Commit the json file as hw2/results.json
.
Add your own comments, comparing the results between the two bAbI Tasks. 3-5 sentences is enough, but try to explain why you think the vocabulary and entropy numbers are different.
Put your comments in Markdown as hw2/stats.md
.
There are empty files ready for you in the repo the right places, you just need to fill them with information.
Just to sum up, the files are:
hw2/stats.py
-- your data analysis scripthw2/results.json
-- JSON results of the analysishw2/stats.md
-- your commentsCreate a branch and a merge request containing (changes to) all requested files. Please keep the filenames and directory structure.
Don't worry too much about the exact numbers you get, slight variations in implementation may cause them to change. We won't penalize it if you don't get the exact same numbers as us, the main point is that your implementation should be reasonable (and you shouldn't be off by orders of magnitude).
Don't worry about system api_call
s, just treat them as a normal system turn.
Presented: 13 March, Deadline: 31 March
In this assignment, you will design a dialogue system component for Natural Language Understanding in a specific domain. For completing it, you will use our prepared Dialmonkey dialogue framework (which is the base of your Gitlab repository), so you can test the outcome directly.
Language understanding means converting user utterances (such as “I'm looking for a Chinese restaurant in the city center”) into some formal representation used by the dialogue manager.
We'll use dialogue acts as our representation – so the example sentence would convert to something like inform(food=Chinese,area=centre)
or find_restaurant(food=Chinese,area="city center")
,
depending on how you define the intents, slots and values within the dialogue acts for your own domain.
Note: We're not thinking about how to reply just yet! The only thing we're concerned with is representing user inputs in our domain with reasonable intents, slots, and values.
Make yourself familiar with the Dialmonkey-npfl123 repository you cloned for the homeworks. Read the README and look around a bit to get familiar with the code. Have a look at the 101 Jupyter notebook to see some examples.
Recall the domains you picked in the first homework assignment and choose one of them. If you've changed your mind in the meantime, you can even pick a different domain.
Think of the set of dialogue acts suitable to describe this domain, i.e., list all intents, slots and values that will be needed
(some slots may have open sets of values, e.g. “restaurant name”, “artist name”, “address” etc.). List them, give a description and examples in Markdown under hw3/README.md
.
Create a component in the dialogue system (as Python code) that:
dialmonkey.component.Component
dialmonkey.nlu.rule_<your-domain>
DA
objects).Please only use the core Python libraries and those listed in requirements.txt
. If you have a good reason to use a different library from PyPi, let us know and we can discuss adding it into the requirements (but this will be global for everyone).
Create a YAML config file for your domain in the conf
directory.
You can use the sample_conf.yaml
file
or nlu_test.yaml
as a starting point. These files are almost identical,
just have a look at the I/O setup if you're interested. Note that instead of a policy and NLG components, a system with this config will simply reply with the NLU result.
Write at least 15 distinct utterances that demonstrate the functionality of your class (a tab-separated file with input + corresponding NLU result, one-per-line). Make sure your NLU gives you the same results.
The test utterances can (but don't have to) be taken over from the example dialogues you wrote earlier for your domain. Save them as hw3/examples.tsv
.
There will be empty files ready for you in the right place, just rename them according to your domain (e.g. change <your_domain>
to restaurant
, bus
etc.) and fill them with the required content.
hw3/README.md
dialmonkey/nlu/rule_<your_domain>.py
conf/nlu_<your_domain>.yaml
hw3/examples.tsv
Create a branch and a merge request containing (changes to) all requested files. Please keep the filenames and directory structure.
Use regular expressions or keyword matching to find the intents and slot values (based on the value, you'll know which slot it belongs to).
If you haven't ever used regular expressions, have a look some tutorials:
Note that you might later need to improve your NLU to handle contextual requests, but you don't need to worry about this now. For instance, the system may ask What time do you want to leave? and the user replies just 7pm. From just 7pm (without the preceding question), you don't know if that's a departure or arrival time. Once you have your dialogue policy ready and know how the system questions look like (which will be the 6th homework), you'll be able to look at the last system question and disambiguate. For now, you can keep these queries ambiguous (e.g. just mark the slot as “time”).
Presented: 20 March, Deadline: 7 April
In this assignment, you will build and evaluate a statistical Natural Language Understanding component on the DSTC2 restaurant information data. For completing it, you will use the Dialmonkey framework in your code checkout so you can test the outcome directly.
Locate the data in your Dialmonkey-NPFL123 repository in data/hw4/
.
This is what we'll work with for this assignment.
Implement a script that trains statistical models to predict DAs. We'll use classifiers here. By default, your approach shouldn't be to predict the whole DA as a single classifier, rather you should classify the correct value for each intent-slot pair where applicable (e.g. inform(food) has multiple possible values) and classify a binary 0-1 for each intent-slot pair that can't have different values (e.g. request(price) or bye() ).
Don't forget that for the multi-value slots (including 2-valued slots with Y/N), you'll need a “null” value too. Have a look at examples here to get a better idea.
You can use any kind of statistical classifier you like (e.g. logistic regression, SVM, neural network, including pretrained models or LLMs), with any library of your choice (e.g. Scikit-Learn, Tensorflow, Pytorch ).
Using binary classifiers for everything may be an option too (especially if you use neural networks with shared layers), but the number of outputs will be rather high, hence the recommendation for multi-class classifiers. Note that we can't do slot tagging here, as the individual words in the texts aren't tagged with slot values.
In case you want to use an LLM, we prefer using a (small!) model locally to prompting ChatGPT or the like. In this case, you don't have to stick to individual classifications and may want to have the model predict everything -- just make sure the output gets the whole DA from the inputs you have available.
Train this model on the training set. You can use the development set for parameter tuning.
Using dialmonkey.DA.parse_cambridge_da()
should help you get the desired DA values out of the textual representation.
Do not look at the test set at this point!
In case you're using LLMs, this would mean prompt tuning -- try out at least 3 different variations of the prompt.
Evaluate your model on the test set and report the overall precision, recall and F1 over dialogue act items (triples of intent-slot-value).
Use the script provided in dialmonkey.evaluation.eval_nlu
.
You can run it directly from the console like this:
./dialmonkey/evaluation/eval_nlu.py -r data/hw4/dstc2-nlu-test.json -p hw4/predicted.txt
The script expects reference JSON in the same format as your data here, and a system output with one DA per line. You can have a look at
conf/nlu_test.yaml
to see how to get one-per-line DA output.
For the purpose of our evaluation script F1 computation, non-null values count as positives, null values count as negatives. Whether they're true or false depends on whether they're correctly predicted.
Implement a module in Dialmonkey that will load your NLU model and work with inputs in the restaurant domain. Create a copy of the nlu_test.yaml config file to work with your new NLU.
dialmonkey/nlu/stat_dstc.py
and your trained model. The filename for the model will depend on your implementation,
it just needs to be loaded automatically for your model to work. Preferrably store in the same directory, but it could be anywhere else if needed.
If the file is too big for Git (>10MB), share it using a cloud service (e.g. CESNET OwnCloud) and make your NLU module download it automatically in __init__()
.conf/nlu_dstc.yaml
.hw4/train_nlu.py
.hw4/predicted.txt
and a short evaluation report under hw4/README.md
(including your F1 scores).Note 1: There are templates for you to get you started quickly with the training script and the prediction Dialmonkey module. Feel free to change them in any way you like or need, they're not binding (just make sure your code does what's expected).
Note 2: Please do not use any Python libraries other than the ones in requirements.txt
,
plus the following ones: torch
(Pytorch), tensorflow
, pytorch-lightning
, torchtext
, transformers
(Huggingface). Note that scikit-learn
is included already. If you need any others, please let us know beforehand.
Note 3: Please make sure that your code doesn't take more than a minute to load + classify the first 20 entries in the test data. For the sake of model storage, it's better to choose a smaller one, especially if you choose to play with pretrained language models.
Note 4: And this one is for all further assignments -- Do not use absolute file paths in your solution! You may want to try out os.path.dirname(__file__)
, which gets you the directory of the current source file. If you need to use slashes in a (relative!) path, use either os.path.join
instead, or use forward slashes and a pathlib.Path
object (simply create it using filename = Path(filename)
). This will handle the slashes properly both on Windows and Linux. Note that we're mainly checking on Linux, so any backslashes in file paths will break our workflow and we may deduce points.
Start playing with the classifier separately, only integrate it into Dialmonkey after you've trained a model and can load it.
If you have never used a machine learning tool, have a look at the Scikit-Learn tutorial. It contains most of what you'll need to finish this exercise.
You'll need to convert your texts into something your classifier understands (i.e., some input numerical features). You can probably do very well with just “bag-of-words” as input features to the classifier -- that means that you'll have a binary indicator for each word from the training data (e.g. word “restaurant”). The feature for the word “restaurant” will be 1 if the word “restaurant” appears in the sentence, 0 if it doesn't. You can also try using the same type of features for bigrams. Have a look at the DictVectorizer class in Scikit-Learn. You may also want to consider CountVectorizer, which could speed up things even more.
For Scikit-Learn, you can use pickle to store your trained models. If you want to pickle classes, use the dill library instead (since pickle can't do that).
To easily load JSON files, you can use the SimpleJSONInput class.
You better don't use the naive Bayes classifier, it doesn't work well on this data – basically anything else works better (you won't lose any points if you use naive Bayes, just don't expect good performance).
All homework assignments will be submitted using a Git repository on MFF GitLab.
We provide an easy recipe to set up your repository below:
git remote show origin
You should see these two lines:
* remote origin
Fetch URL: git@gitlab.mff.cuni.cz:teaching/NPFL123/2025/your_username.git
Push URL: git@gitlab.mff.cuni.cz:teaching/NPFL123/2025/your_username.git
upstream
:git remote add upstream https://gitlab.mff.cuni.cz/teaching/NPFL123/base.git
git checkout master
git checkout -b hwX
Solve the assignment :)
Add new files and commit your changes -- make sure to name your files as required, or you won't pass our automatic checks!
git add hwX/solution.py
git commit -am "commit message"
git push origin hwX
Create a Merge request in the web interface. Make sure you create the merge request into the master branch in your own forked repository (not into the upstream).
Merge requests -> New merge request
You might need to update from the upstream base repository every once in a while (most probably before you start implementing each assignment). We'll let you know when we make changes to the base repo.
To upgrade from upstream, do the following:
git checkout master
git fetch upstream master
git merge upstream/master master
The exam will have 10 questions, mostly from this pool. Each counts for 10 points. We reserve the right to make slight alterations or use variants of the same questions. Note that all of them are covered by the lectures, and they cover most of the lecture content. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles. See the Grading tab for details on grading.
To pass this course, you will need to:
In case the pandemic does not get better by the exam period, there will be a remote alternative for the exam (an essay with a discussion).
The final grade for the course will be a combination of your exam score and your homework assignment score, weighted 3:1 (i.e. the exam accounts for 75% of the grade, the assignments for 25%).
Grading:
In any case, you need >50% of points from the test and >50% of points from the homeworks to pass. If you get less than 50% from either, even if you get more than 60% overall, you will not pass.
You should pass the course just by following the lectures, but here are some hints on further reading. There's nothing ideal on the topic as this is a very active research area, but some of these should give you a broader overview.
Basic (good and up-to-date, but very brief, available online):
More detailed (very good but slightly outdated, available as e-book from our library):
Further reading (mostly outdated but still informative):