This is an archived version of the 2020/2021 run of the course. See the current version here.

NPFL123 – Dialogue Systems


This course is a detailed introduction into the architecture of spoken dialogue systems, voice assistants and conversational systems (chatbots). We will introduce the main components of dialogue systems (speech recognition, language understanding, dialogue management, language generation and speech synthesis) and show alternative approaches to their implementation. The lab sessions will be dedicated to implementing a simple dialogue system and selected components (via weekly homework assignments).



The course will be taught in English, but we're happy to explain in Czech, too.

Time & Place

Both lectures and labs take place on Zoom.

  • Lectures: Tue 10:40 Zoom (starts on 2 March)

  • Labs: Tue 9:00 Zoom (starts on 9 March at 9:50 – no labs in the 1st week!)

    • The first part of the labs (9:00-9:45) will be for discussions
    • The second part of the labs (9:50-10:30) will be for introducing new homeworks
  • Zoom meeting ID: 981 1966 7454

  • Password is the SIS code of this course (capitalized)

There's also a Slack workspace you can use to discuss assignments and get news about the course. Please contact us by email if you want to join and haven't got an invite yet.

Passing the course

To pass this course, you will need to take an exam and do lab homeworks. See more details here.

Topics covered

Dialogue systems schema
  • Dialogue system types & formats (open/closed domain, task/chat-oriented)
  • What happens in a dialogue (linguistic background)
  • Dialogue system components
    • speech recognition
    • language understanding, dialogue state tracking
    • dialogue management
    • language generation
    • speech synthesis
  • Dialogue authoring tools (IBM Watson Assistant/Google Assistant/Amazon Alexa)
  • Voice assistants & question answering
  • Chatbots
  • Data for dialogue systems
  • Dialogue systems evaluation


PDFs with lecture slides will appear here shortly before each lecture (more details on each lecture are on a separate tab). You can also check out last year's lecture slides.

1. Introduction Slides Questions

2. What happens in a dialogue? Slides Domain selection Questions

3. Data & Evaluation Slides Dataset exploration Questions

4. Voice Assistants & Question Answering Slides Rule-based Natural Language Understanding Questions

5. Dialogue Tooling Slides Statistical Natural Language Understanding Questions

6. Natural Language Understanding Slides Belief State Tracking Questions

7. Neural NLU + State Tracking Slides Dialogue Policy Questions

8. Dialogue Policy (non-neural) Slides API/Backend Calls Questions

9. Neural policies & Natural Language Generation Slides Template NLG Questions

10. Speech Recognition Slides Service Integration Questions

11. Speech Synthesis Slides Digits ASR Questions

12. Chatbots Slides Grapheme-to-phoneme conversion Retrieval chatbot Questions


A list of recommended literature is on a separate tab.


1. Introduction

 2 March Slides Questions

  • What are dialogue systems
  • Common usage areas
  • Task-oriented vs. non-task oriented systems
  • Closed domain, multi-domain, open domain
  • System vs. user initiative in dialogue
  • Standard dialogue systems components

2. What happens in a dialogue?

 9 March Slides Domain selection Questions

  • Dialogue turns
  • Utterances as acts, pragmatics
  • Grounding, grounding signals
  • Deixis
  • Conversational maxims
  • Prediction and adaptation

3. Data & Evaluation

 16 March Slides Dataset exploration Questions

  • How to get data for building dialogue systems
  • Available corpora/datasets
  • Annotation
  • Data splits
  • Evaluation metrics -- subjective & objective, intrinsic & extrinsic
  • Significance checks

4. Voice Assistants & Question Answering

 23 March Slides Rule-based Natural Language Understanding Questions

  • What are voice assistants
  • Where, how and how much are they used
  • What are their features and limitations
  • What is question answering
  • Basic question answering techniques
  • Knowledge graphs
  • A few remarks on machine learning (will be useful later)

5. Dialogue Tooling

 30 March Slides Statistical Natural Language Understanding Questions

  • What are the standard tools for building dialogue systems on various platforms
  • IBM Watson, Google Dialogflow, Alexa Skills Kit
  • How to define intents, slots, and values
  • How to build your own basic dialogues

6. Natural Language Understanding

 6 April Slides Belief State Tracking Questions

  • What needs to be handled to understand the user
  • How to represent meaning: grammars, frames, graphs, dialogue acts (“shallow parsing”)
  • Rule-based NLU
  • Classification-based NLU (features, logistic regression, SVM)
  • Sequence tagging (HMM, MEMM, CRF)
  • Handling speech recognition noise

7. Neural NLU + State Tracking

 13 April Slides Dialogue Policy Questions

  • Some basics about neural networks
  • How to use neural networks for NLU: neural classifiers and sequence taggers
  • What is dialogue state, what is belief state, and what they're good for
  • Dialogue as a Markov decision process (MDP)
  • Dialogue trackers: generative and discriminative
  • Static and dynamic trackers

8. Dialogue Policy (non-neural)

 20 April Slides API/Backend Calls Questions

  • What's a dialogue policy -- how to choose the next action
  • Finite-state, frame-based, and rule-based policies
  • Reinforcement learning basics
  • Value and policy optimization (SARSA, Q-learning, REINFORCE)
  • Mapping to POMPDs (partially observable MDPs)
  • Summary space (making it tractable)
  • User simulation

9. Neural policies & Natural Language Generation

 27 April Slides Template NLG Questions

  • Deep reinforcement learning
  • Deep Q-Networks, Policy Networks
  • Natural language generation
  • Sentence planning & Surface realization
  • Templates
  • Rule-based approaches
  • Neural: seq2seq, RNNs, Transformers

10. Speech Recognition

 4 May Slides Service Integration Questions

  • Basics of how speech recognition works
  • Main pipeline: speech activity detection, preprocessing, acoustic model, decoder
  • Features -- MFCCs
  • Acoustic model with neural nets
  • Decoding -- language model
  • End-to-end speech recognition

11. Speech Synthesis

 11 May Slides Digits ASR Questions

  • Human articulation
  • Phones, phonemes, consonants, vowels
  • Spectrum, F0, formants
  • Stress and prosody
  • Standard TTS pipeline
  • Segmentation
  • Grapheme-to-phoneme conversion
  • Formant-based, concatenative, HMM parametric synthesis
  • Neural synthesis

12. Chatbots

 18 May Slides Grapheme-to-phoneme conversion Retrieval chatbot Questions

  • Non-task-oriented systems and their specifics
  • rule-based, retrieval, generative, hybrid approaches
  • Turing test, Alexa Prize

Homework Assignments

There will be 12 homework assignments, each for a maximum of 10 points. Please see details on grading and deadlines on a separate tab.

Assignments should be submitted via Git – see instructions on a separate tab.

All deadlines are 23:59:59 CET/CEST.


1. Domain selection

2. Dataset exploration

3. Rule-based Natural Language Understanding

4. Statistical Natural Language Understanding

5. Belief State Tracking

6. Dialogue Policy

7. API/Backend Calls

8. Template NLG

9. Service Integration

10. Digits ASR

11. Grapheme-to-phoneme conversion

12. Retrieval chatbot

1. Domain selection

 Presented: 9 March, Deadline: 23 March

You will be building a dialogue system in (at least some of) the homeworks for this course. Your first task is to choose a domain and imagine how your system will look like and work like. Since you might later find that you don't like the domain, you are now required to pick two, so you have more/better ideas later and can choose only one of them for building the system.


The required steps for this homework are:

  1. Pick two domains of your liking that are suitable for building a dialogue system. Think of a reasonable backend (see below).

  2. Write 5 example system-user dialogues for both domains, which are at least 5+5 turns long (5 sentences for both user and system). This will make sure that your domain is interesting enough. You do not necessarily have to use English here (but it's easier if we understand the language you're using -- ask us if unsure).

  3. Create a flowchart for your two domains, with labels such as “ask about phone number”, “reply with phone number”, “something else” etc. It should cover all of your example dialogues. Feel free to draw this by hand and take a photo, as long as it's legible.

    • It's OK if your example dialogues don't go all in a straight line (e.g. some of them might loop or go back to the start).
  4. In your repository, create a directory called hw1/ and save both the example dialogues and the flowchart into this directory. Create a branch and a merge request with your changes.


You may choose any domain you like, be it tourist information, information about culture events/traffic, news, scheduling/agenda, task completion etc. You can take inspiration from stuff presented in the first lecture, or you may choose your own topic.

Since your domain will likely need to be connected to some backend database, you might want to make use of some external public APIs -- feel free to choose under one of these links:

You can of course choose anything else you like as your backend, e.g. portions of Wikidata/DBPedia or other world knowledge DBs, or even a handwritten “toy” database of a meaningful size, which you'll need to write to be able to test your system.

2. Dataset exploration

 Presented: 16 March, Deadline: 30 March

The task in this lab is to explore dialogue datasets and find out more about them. Your job will thus be to write a script that computes some basic statistics about datasets, and then try to interpret the script's results.


  1. Download the Dialog bAbI Tasks Data 1-6 dataset. Read the description of the data format on the website. You'll be working with Tasks 5 and 6 (containing full generated dialogues and DSTC2 data). Use the training sets for Task 5 and Task 6.

  2. Write a script that will read all turns in the data and separate the user and system utterances in the training set.

    • Make the script ignore any search results lines in the data (they don't contain a tab character).
    • If the script finds a turn where the user is silent (the user turn contains only <SILENCE>), it should concatenate the system response from this turn to the previous turn. Note that this may happen on multiple consecutive turns, and the script should join all of these together into one system response.
      • If <SILENCE> is the first word in the dialogue, just delete it.
    • Don't worry too much about tokenization (word segmentation) -- tokenizing on whitespace is OK.
  3. Implement a routine that will compute the following statistics for both bAbI tasks for system and user turns (separately, i.e., 4 sets of statistics altogether):

    • data length (total number of dialogues, turns, words)
    • mean and standard deviations for individual dialogue lengths (number of turns in a dialogue, number of words in a turn)
    • vocabulary size
    • Shannon entropy and bigram conditional entropy, i.e. entropy conditioned on 1 preceding word (see lecture 2 slides)

    Commit this file as hw2/ Alternatively, you can use a Jupyter Notebook for the implementation and commit the notebook itself as hw2/stats.ipynb.

  4. Along with your script, submit also a printout of the results along with your own comments, comparing the results between the two bAbI Tasks. 3-5 sentences is enough, but try to explain why you think the vocabulary and entropy numbers are different.

    Commit the printout and your comments in Markdown as hw2/ If you've used Jupyter to compute the statistics, you can store the results and the commentary directly in your notebook (hw2/stast.ipynb) and the separate commentary file is not needed.

Create a branch and a merge request containing both files.


Don't worry about system api_calls, just treat them as a normal system turn.

3. Rule-based Natural Language Understanding

 Presented: 23 March, Deadline: 13 April

In this assignment, you will design a dialogue system component for Natural Language Understanding in a specific domain. For completing it, you will use our prepared Dialmonkey dialogue framework so you can test the outcome directly.


  1. Make yourself familiar with the Dialmonkey repository you cloned for the homeworks. Read the README and look around a bit to get familiar with the code. Have a look at the 101 Jupyter notebook to see some examples.

  2. Recall the domains you picked in the first homework assignment and choose one of them.

  3. Think of the set of dialogue acts suitable to describe this domain, i.e., list all intents, slots and values that will be needed (some slots may have open sets of values, e.g. “restaurant name”, “artist name”, “address” etc.). List them, give a description and exampels in Markdown under hw3/

  4. Create a component in the dialogue system (as Python code) that:

    • inherits from dialmonkey.component.Component
    • is placed under dialmonkey.nlu.your_domain_nlu
    • implements a rule-based NLU for your domain -- i.e., given user utterance, finds its intent, slots and values
    • yields Dialogue Act items you designed in step 3 (as DA objects).
  5. Create a YAML config file for your domain in the conf directory. You can use the nlu_test.yaml file as a starting point (since you don't have any reasonable dialogue policy at this point).

  6. Write at least 15 utterances that demonstrate the functionality of your class (a tab-separated file with input + corresponding desired NLU result, one-per-line). The test utterances can (but don't have to) be taken over from the example dialogues you wrote earlier for your domain. Save them as hw3/examples.tsv.

Files to include in your merge request

  • Lists of intents, slots and values in hw3/
  • Your NLU component in dialmonkey/nlu/<your_domain>.py
  • Your configuration file in conf/<
  • Example test utterances & outputs in hw3/examples.tsv


Use regular expressions or keyword matching to find the intents and slot values (based on the value, you'll know which slot it belongs to).

If you haven't ever used regular expressions, have a look some tutorials:

Note that you might later need to improve your NLU to handle contextual requests, but you don't need to worry about this now. For instance, the system may ask What time do you want to leave? and the user replies just 7pm. From just 7pm (without the preceding question), you don't know if that's a departure or arrival time. Once you have your dialogue policy ready and know how the system questions look like (which will be the 6th homework), you'll be able to look at the last system question and disambiguate. For now, you can keep these queries ambiguous (e.g. just mark the slot as “time”).

4. Statistical Natural Language Understanding

 Presented: 30 March, Deadline: 13 April

In this assignment, you will build and evaluate a statistical Natural Language Understanding component on the DSTC2 data. For completing it, you will use the Dialmonkey framework in your code checkout so you can test the outcome directly.


  1. Download the training, development and test data from here.

  2. Implement a script that trains a statistical model to predict DAs. It shouldn't predict the whole DA as a single classifier, rather it should classify the correct value for each intent-slot pair where applicable (e.g. inform(food) has multiple possible values) and classify a binary 0-1 for each intent-slot pair that can't have different values (e.g. request(price) or bye() ).

    Don't forget that for the multi-value slots, you'll need a “null” value too.

    You can use any kind of statistical classifier you like (e.g. logistic regression, SVM, neural network), with any library of your choice (e.g. Scikit-Learn, Tensorflow, Pytorch ).

    Note that we're not doing slot tagging since the words in the texts aren't tagged with slot values.

  3. Train this model on the training set you downloaded. You can use the development set for parameter tuning. Using dialmonkey.DA.parse_cambridge_da() should help you get the desired DA values out of the textual representation. Do not look at the test set at this point!

  4. Evaluate your model on the test set and report the overall precision, recall and F1 over dialogue act items (triples of intent-slot-value).

    Use the script provided in dialmonkey.evaluation.eval_nlu. You can run it directly from the console like this:

    ./dialmonkey/evaluation/ -r dstc2-nlu-test.json -p predicted.txt

    The script expects reference JSON in the same format as your data here, and a system output with one DA per line. You can have a look at conf/nlu_test.yaml to see how to get one-per-line DA output. Note that you can override the input/output stream types and input/output file names directly from the console (see ./ -h).

    For the purpose of our evaluation script F1 computation, non-null values count as positives, null values count as negatives. Whether they're true or false depends on whether they're correctly predicted.

  5. Implement a module in Dialmonkey that will load your NLU model and work with inputs in the restaurant domain. Create a copy of the nlu_test.yaml config file to work with your new NLU.

Files to include in your merge request

  • Your statistical NLU module under dialmonkey/nlu/ and your YAML config file under conf/.
  • Your training script and evaluation report under a new directory called hw4/.


  • Start playing with the classifier separately, only integrate it into Dialmonkey after you've trained a model and can load it.

  • If you have never used a machine learning tool, have a look at the Scikit-Learn tutorial. It contains most of what you'll need to finish this exercise.

  • You'll need to convert your texts into something your classifier understands (i.e., some input numerical features). You can probably do very well with just “bag-of-words” as input features to the classifier -- that means that you'll have a binary indicator for each word from the training data (e.g. word “restaurant”). The feature for the word “restaurant” will be 1 if the word “restaurant” appears in the sentence, 0 if it doesn't. You can also try using the same type of features for bigrams. Have a look at the DictVectorizer class in Scikit-Learn.

  • For Scikit-Learn, you can use pickle to store your trained models.

5. Belief State Tracking

 Presented: 6 April, Deadline: 20 April

This week, you will build a simple probabilistic dialogue/belief state tracker to work with NLU for both your own domain of choice (3rd homework) and DSTC2 (4th homework).


  1. Implement a dialogue state tracker that works with the dial['state'] structure and fills it with a probability distribution of values over each slot (assume slots are independent), updated during the dialogue. Don't forget None is a valid value, meaning “we don't know/user didn't say anything about this”.

    At the beginning of the dialogue, each slot should be initialized with the distribution {None: 1.0}.

    The update rule for a slot, say food, should go like this:

    • Take all mentions of food in the current NLU, with their probabilities. Say you got Chinese with a probability of 0.7 and Italian with a probability of 0.2. This means None has a probability of 0.1.
    • Use the probability of None to multiply current values with it (e.g. if the distribution was {'Chinese': 0.2, None: 0.8}, it should be changed to {'Chinese': 0.02, None: 0.08}.
    • Now add the non-null values with their respective probabilities from the NLU. This should result in {'Chinese': 0.72, 'Italian': 0.2, None: 0.08}.
  2. Add the tracker into configuration files both for your own rule-based domain and for DSTC2. In addition, replace dialmonkey.policy.dummy.ReplyWithNLU with dialmonkey.policy.dummy.ReplyWithState.

  3. Run your NLU + tracker over your NLU examples from 3rd homework and the first 20 lines from the DSTC2 development data (file dstc2-nlu-dev.json) from the 4th homework data package and save the outputs to a text file.

Example for the update rule

You can have a look at an example dialogue with commentary for the update rule in a separate file (it's basically the same stuff as above, just more detailed).

Files to include in your merge request

  • Your tracker code under dialmonkey/dst/
  • Your updated configuration files for both domains under conf/dst_<my-domain>.yaml and conf/dst_dstc2.yaml.
  • Your text file with the outputs into hw3/outputs.txt.

6. Dialogue Policy

 Presented: 13 April, Deadline: 4 May

This week, you will build a rule-based policy for your domain.


  1. Implement a rule-based policy that uses the current NLU intent (or intents) and the dialogue state probability distribution to produce system action dialogue acts. You can use just the most probable value from each state, assuming its probability is higher than a certain threshold (e.g. 0.7).

    The policy should:

    • Check the current intent(s) and split the action according to that
    • Given the intent, check that the state contains all necessary slots to respond:
      • if it does, fill in a response system DA into the dial.action field.
      • if it doesn't, fill in a system DA requesting more information into the dial.action field.

    Use the flowcharts you built in HW1 to guide the policy's decisions.

    For now, skip any API queries and hardcode the responses (you will build the actual backend a week later).

  2. Save the policy under dialmonkey.policy.<your_domain> and add it into the configuration file for your own domain.

  3. In your configuration file from the last homework, replace dialmonkey.policy.dummy.ReplyWithState with dialmonkey.policy.dummy.ReplyWithSystemAction (don't worry that you have two things from the policy package in your pipeline).

  4. Check that your policy returns reasonable system actions for each of your NLU test utterances, if you treat that utterance as a start of a dialogue (it's OK if it's not exactly what is meant, the utterances will be taken out of context).

    Run your policy over your NLU test utterances, each taken as a start of a dialogue, and save the outputs to a text file.

Files to include in your repository

  • Commit your policy (dialmonkey/policy/<your_domain>.py) and your updated configuration file (conf/<your_domain>.yaml).
  • Commit the outputs of your policy on your NLU test utterances as hw6/outputs.txt.


  • Your policy will most probably be a long bunch of if-then-else statements. Don't worry about it. You may want to structure the file a bit so it's not a huge long function though -- e.g., add the handling into separate functions, if it's not just 1 line.

  • You may want to complete this homework together with the next one, which will be about backend integration. That's why the deadline is in 3 weeks, not 2.

7. API/Backend Calls

 Presented: 20 April, Deadline: 4 May

This week is basically a continuation of the last one -- filling in the blank API call placeholders you created last time. If you want to, you can complete the 6th and 7th homework at the same time (the deadline is the same).


  1. Implement all API/backend queries you need for the policy you implemented in the last homework. The implementation can be directly inside dialmonkey.policy.<your_domain>, or you can create a sub-package (a subdirectory, where you put the main policy inside and any auxiliary stuff into other files in the same directory).

  2. Test your policy with outputs on at least 3 of the test dialogues you created in the 1st homework. You can of course make slight alterations if your policy doesn't behave exactly as you imagined the first time.

    Also, don't worry that the output is just dialogue acts at the moment.

Files to include in your repository

  • Commit your policy (dialmonkey/policy/<your_domain>.py), updated with API calls.
  • Logs of your test dialogues as hw7/outputs.txt (with user inputs and system output acts).


  • If you haven't accessed an external API using Python, check out the requests library. It makes it easy to call external APIs using JSON. You can get the result with just a few lines of code.

8. Template NLG

 Presented: 27 April, Deadline: 11 May

In this homework, you will complete the chatbot for your domain by creating a template-based NLG component.


  1. Implement a template-based NLG with the following features:

    • The NLG system is (mostly) generic, templates for your domain are listed in a JSON or YAML file, showing a DA -> template mapping.

    • The NLG system is able to prioritize a mapping for a specific value -- e.g. inform(price=cheap) -> “You'll save money here.” should get priority over inform(price={price}) -> “This place is {price}.”

    • The NLG system is able to put together partial templates (by concatenating), so you can get a result for e.g. inform(price=cheap,rating=low) even if you only have templates defined for inform(price={price}) and inform(rating={rating}), not the specific slot combination. This doesn't need to search for best coverage, just take anything that fits, such as templates for single slots if you don't find the correct combination.

    • The system is able to produce multiple variations for certain outputs, e.g. bye() -> Goodbye. or Thanks, bye!

  2. Create templates that cover your domain well.

  3. Save your NLG system under dialmonkey.nlg.<your-domain> and add it into your conf/<your-domain>.yaml configuration file.

  4. Test your NLG system with the test dialogues you used in the previous homework.

Files to include in your merge request

  • Your NLG implementation in dialmonkey/nlg/<your-domain>.py
  • Your templates file in dialmonkey/nlg/<your-domain>.<yaml|json>
  • Your updated configuration file, which includes the NLG system, under conf/<your-domain>.yaml
  • Logs of the test dialogues, now with NLG output, in hw8/outputs.txt

9. Service Integration

 Presented: 4 May, Deadline: 18 May

In this homework, you will integrate the chatbot for your domain into an online assistant of your choice.


  1. Choose a service that you want to use for this homework. We prepared some instructions for Google Dialogflow, Alexa Skills, Facebook Messenger, and Telegram, but you're free to use IBM Watson Assistant as shown in the 5th lecture, or any other platform of your liking. Note that Dialogflow and Alexa are unfortunately not available for Czech.

  2. Implement the frontend on your selected platform. You can either carry over intents, slots & values from your NLU directly into Dialogflow/Alexa/Watson, or you can work with free text and run the NLU in the backend. For Messenger and Telegram, that's the only option (but their frontend basically comes for free).

  3. Implement a backend that will connect to your frontend – handle its calls and route them to your dialogue system in Dialmonkey (either with NLU already processed, or with free text as input). You can use the get_response method in dialmonkey.conversation_handler.ConversationHandler to control the system outside of the console. Don't forget to save context in between requests. However, you can assume, that the bot will always have just one dialogue, i.e. you do not have to care about parallel conversations.

  4. Link your frontend to your backend (see Hints below).

Detailed instructions

Amazon Alexa
  • Alexa allows you tou run NLU directly in your backend but it's a bit tricky -- the only way to get free text is to use the SearchQuery built-in slot. You can set up an intent where the only part of the utterance is this slot.

  • For implementing backend, you can use the Flask-Ask package as a base. You can have a look at an Alexa Skill Ondrej made for inspiration (not many docs, though, sorry).

  • Set your backend address under “Endpoint” (left menu).

Google Dialogflow
  • To make use of you NLU, you can make use of the Default fallback intent which gets triggered whenever no other intent is recognized.
  • For backend implementation, you can use Flask-Assistant.
  • Set your backend address under “Webhooks” for the individual intents (under the “Fulfillment” menu of each intent). If you want to get free text of the requests, have a look at this snippet.
Facebook Messenger

For Messenger, you need to perform several steps, however, it allows you to work with textual inputs directly. In general, you can follow the tutorial. Here are the important steps you need to complete:

  • Implement a webserver of your liking. You can use Flask. It might be a good idea to start from the example implementation (feel free to reuse it).
    See the tutorial here for details.
  • You need to implement GET method handler to verify the token and POST method handler to receive message and send the reply. You can also use pymessenger.
  • Deploy your app.
  • Visit the dev page. Create an account, add a Messenger app and create a sample page ( see lecture video ).
  • Link the page to your app and obtain a token for use in your webserver (ACCESS_TOKEN).
  • Add a callback URL pointing to your webserver and the verification (arbitrary) token that you created.

Telegram also allows direct text input. There's also a handy Python-Telegram-Bot library that has a webserver built in.

  • You need to get a telegram bot API token from the BotFather -- it's pretty straightforward, you talk to this “one bot to rule them all”, ask it for /newbot and it'll guide you through the process. You can find more info on their documentation page.
  • With the use of your API token, you need to create the telegram bot using Python-Telegram-Bot. These examples should be a good way of starting it. In general, you'll need to implement a command handler that'll pass on the message from Telegram to your system. No web server is necessary, Python-Telegram-Bot has one built in.
  • Start up your bot and you can talk to it on Telegram.

Files to include in your merge request

  • Commit your frontend export into hw9/:
    • In Alexa, go to “JSON Editor” in the left menu and copy out the contents into a file intent_schema.json.
    • In Dialogflow, go to your agent settings (cogged wheel next your agent/skill/app name on the top left), then select the “Export & Import” tab and choose “Export as ZIP”. Please commit the resulting subdirectory structure, not the ZIP file.
    • Nothing is required for Messenger at this point.
  • Commit your bot server code into hw9/ This code will probably import a lot from Dialmonkey and require it to be installed -- that's expected.
  • Add a short README telling us how to run your bot. You don't need to commit any API tokens (we can get our own for testing), but let us know in the readme where to add the token.


  • Use a webserver of your choice, Flask is just a suggestion, but it's easy to use.
  • Heroku is a nice and free service that allows you to deploy your apps. See the tutorial.
  • Alternatively, you can use ngrok for testing purposes.

10. Digits ASR

 Presented: 11 May, Deadline: 25 May

This time, your task will be to train and evaluate an ASR system on a small dataset. We will be using Kaldi for these instructions since this is an advanced neural toolkit fairly similar in setup to what IBM is using (i.e. separate acoustic & language models).

If you like, you can try out the same thing out with the end-to-end neural ESPNet toolkit instead, but you'd probably need a GPU to train it fast enough.

Note that all of this will most probably work on Linux and similar systems only. On Windows, you can use the Windows Subsystem for Linux to run it. You can also use the computers in the Malá Strana lab over SSH (u-pl[0-21] Also note that even though we don't need a GPU, installing Kaldi takes a lot of time.


  1. Clone and install the Kaldi ASR toolkit

  2. Clone and install the KenLM language modelling toolkit

  3. Clone the Easy-Kaldi repository

  4. Clone Ondrej's fork of the Free Spoken Digits Dataset (FSDD) and adjust sampling rate in the configuration

  5. Prepare data for Easy-Kaldi based on the FSDD, with a per-speaker split

  6. Train the HMM Kaldi model, which serves as alignment on phoneme level

  7. Train the neural Kaldi model

  8. Repeat with a random split

Detailed Instructions

  1. Download and install Kaldi.

    • Clone the Kaldi git repo
    • Since the current master version is broken, use a slightly earlier commit -- just run git checkout 7f57eaa08093ba148e3d2abdf6c337212c130214.
    • Go to tools/ and follow the INSTALL guide to install all required tools.
    • Go to src/ and follow INSTALL again. If you don't (want to) have Intel MKL installed, you can use ./configure --mathlib=ATLAS to use ATLAS as the math backend (just install the libatlas-base-dev package on Ubuntu for that).
      • If you work in the computer lab where ATLAS is not installed, you need to go back to tools/ and run ./extras/, then use --mathlib=OPENBLAS. This is an option in general if you don't want to or can't install ATLAS.
      • Unless you have a GPU and CUDA installed, use --use-cuda=no.
  2. Install the KenLM language modelling toolkit. Clone the repo inside the tools/ Kaldi directory, then follow the build instructions.

    • Note that you need Boost and some other libraries installed, as shown in the build instructions.
    • Everything needed for KenLM is available in the computer lab.
  3. Get the Easy-Kaldi repository -- this code will make training Kaldi much easier for us. Just clone it into the egs/ subdirectory of Kaldi.

  4. We will be working with (Ondrej's fork of) the Free Spoken Digits Dataset. Just clone the repo inside egs/easy-kaldi/easy-kaldi.

    • Since Easy-Kaldi is set-up to work with 16 kHz audio data and this dataset is recorded with 8 kHz sampling, you'll need to set a few configuration options. Change --sample-frequency=16000 to --sample-frequency=8000 in the following files: config/mfcc.conf, config/mfcc_hires.conf, config/pitch.conf, config/plp.conf.
  5. Now create an input_fsdd directory and prepare all the needed input files for Easy-Kaldi (see the EasyKaldi README):

    • lexicon.txt and lexicon_nosil.txt are basically pronunciation dictionaries for all words. Since your words are only English digits 0-9, you need to know the pronunciation for them. Look them up in the CMU Pronouncing Dictionary and put them into the lexicon file, one per line. In addition to that, add <unk> (unknown phoneme) and !SIL (silence). Their "pronunciation" can be anything that doesn't match the other phonemes. The final lexicon.txt file should look something like this:

      !SIL sil
      <unk> spn
      one W AH N
      two T UW

      The lexicon_nosil.txt file is exactly the same -- just remove the !SIL line. Note that you can't have empty lines in either of these files.

    • task.arpabo is your language model. Usually, you would need a text file with "all transcriptions" to build it, so in this case, a text file with one digit per line (like the one provided in Ondrej's fork of the dataset) will be fine. We'll use KenLM to build the language model:

      ../../../../tools/kenlm/build/bin/lmplz -o 3 --discount_fallback < ../free-spoken-digit-dataset/digits.txt > task.arpabo

      Adjust the path to KenLM if you put it somewhere else than tools/ or if you're running this from somewhere else than the input_fsdd directory itself. Notice that we're building a trigram model even though our "sentences" always have one word -- have a look inside the task.arpabo file to see why :-).

    • The files train_audio_path and test_audio_path in your input_fsdd directory should contain just one line each -- absolute paths to training and test audio files. We'll use free-spoken-digit-dataset/per_speaker_split/train and free-spoken-digit-dataset/per_speaker_split/test, respectively.

    • To go with the audio files, you need transcripts. They're prepared in free-spoken-digit-dataset/per_speaker_split/transcripts.{train,test}. Just copy them over.

  6. Now that your files are done, you need to first align the data on the phoneme level. This is actually done by training a GMM-HMM ASR model in Kaldi. Run this:

    ./ fsdd 001

    The fsdd corresponds to the name of your input_fsdd directory, the 001 parameter can be anything (indicates experiment number). After you run this, you can read the GMM-HMM model's WER in WER_triphones_fsdd_001.txt.

  7. Now you can finally train and evaluate the neural ASR model -- it's using the HMM phoneme alignments. Run this:

    ./ fsdd 300 2

    You can play around with the parameters (300 is the network hidden state dimension, 2 is the number of epochs). Here we just show some values that seemed to give reasonable performance for this data. After you run it, you can check the WER in WER_nnet3_easy.txt.

    Have a look at the recognized outputs for each file, they're hidden in exp_fsdd/nnet3/easy/decode/log/decode.*.log -- look for lines starting with filenames, they contain the decoded text.

    Keep your file with WER since it'll get overwritten in the next step :-).

  8. Now repeat the experiment, but with random files split instead of a split by speaker. Make a copy of the input_fsdd directory (cp -r input_fsdd input_fsdd2) and change stuff around:

    • You can keep the lexicons and the language model.

    • Update the train_audio_path, test_audio_path, transcripts.train and transcripts.test files to use the free-spoken-digit-dataset/random_split (see step 5).

    • Now rerun the both steps of training and evaluation.

Files to include in your merge request

  • Your lexicons and language model in hw10/{lexicon.txt,lexicon_nosil.txt,task.arpabo}.

  • A Markdown file hw10/, with:

    • Your WER for four settings:
      • Per-speaker split, GMM-HMM
      • Per-speaker split, neural
      • Random split, GMM-HMM
      • Random split, neural
    • A short report -- please try to explain:
      • why the WER results ended up the way they did,
      • which digits are most difficult to recognize for the per-speaker neural model and why.

Further reading

If you want the real thing, not the Easy-Kaldi abstraction:

11. Grapheme-to-phoneme conversion

 Presented: 18 May, Deadline: 1 June

This time, your task will be to create a grapheme-to-phoneme conversion that works with the MBROLA concatenative speech synthesis system. By default, we'll assume you'll use Czech or English with this homework. Since different languages have very different grapheme-to-phoneme ratios, talk to us if you want to try it out for any other language instead.


  1. Install MBROLA. On Debian-based Linuxes, this should be as simple as sudo apt install mbrola. Building for any other Linux shouldn't be too hard either. On Windows, you can use WSL, or you can get native Windows binaries here.

  2. Install a MBROLA voice for your language. On Debian-derivatives (incl. WSL), you can go with sudo apt install mbrola-<voice> and your voices will into /usr/share/mbrola, otherwise you just need to download the voice somewhere.

    • For Czech, cz2 is a good voice, cz1 is lacking some rather common diphone combinations.

    • For English, you can go with en1.

    You can try out that it's working by running MBROLA through one of the test files included with the voice. There's always a “test” subdirectory with some “.pho” files.

    mbrola -e /path/to/cz2  path/to/test/some_file.pho output.wav
  3. Implement a simple normalization script. It should be able to expand numbers (just single digits) and abbreviations from a list.

    • Ignore the fact that you sometimes need context for the abbreviations.
    • Add the following abbreviations to your list to test it: Dr, Prof, kg, km, etc/atd.
  4. Add a grapheme-to-phoneme conversion to your script that produces a phoneme sequence like this:


    a    100
    h\   50
    o    70
    j    50
    _    200


    h    50
    @    70
    l    50
    @U   200
    _    200

    It's basically a two-column tab/space-separated file. The first column is a phoneme, the 2nd column denotes the duration in milliseconds.

    The available phonemes for each language are defined in the voices' README files (cs, en). MBROLA uses the SAMPA phonetic notation. The _ denotes a pause in any language.

    Use the following simple rules for phoneme duration:

    • Consonant – 50 ms
    • Short vowel (any vowel without “:” in Czech SAMPA, any 1-character vowel in English SAMPA): stressed – 100 ms, unstressed – 70 ms
    • Long vowel (vowels with “:” in Czech SAMPA, 2-character vowels in English SAMPA): stressed – 200 ms, unstressed – 150 ms

    If you inspect the MBROLA test files or the description here, you'll see that there's an optional third column for voice melody, saying which way F0 should develop during each phoneme. For our exercise, we'll ignore it. It'll give you a rather robotic, but understandable voice. What you should do, though is:

    • Add a 200 ms pause after each comma or dash.
    • Add a 500 ms pause after sentence-final punctuation (full stop, exclamation or question mark).

    Finally, the actual grapheme-to-phoneme rules are very different for both languages.

    • For Czech, you can do almost everything by starting from ortography and applying some relatively simple rules.

      • You should also add a dictionary for exceptions – add 10 random foreign words to it with their correct SAMPA pronunciations, to test that it works correctly.
    • For English, you can't do without a dictionary. Use the CMU Pronouncing Dictionary which you can download as a whole.

      • Since the dictionary uses Arpabet and you want SAMPA for MBROLA, you'll need to create an Arpabet-to-SAMPA mapping to use it.
      • The dictionary has stress marks (“1”, “2”, “3” etc. after vowels, so you can treat vowels with “1” or “2” as stressed, the rest as unstressed).
      • Let the system spell out any word that it doesn't find in the dictionary (get the pronunciation of each letter).
  5. Take a random Wikipedia article (say, “article of the day”) in your target language, produce the g2p conversion for the first paragraph, then run MBROLA on it and save a WAV file.

What to include in your merge request

Create a directory hw11/ and put into it:

  • Your normalization and grapheme-to-phoneme Python script that will take plain text input and outputs a MBROLA instructions file.
  • The text of the paragraph on which you tried your conversion system.
  • Your script's output on the paragraph (the “.pho” file for MBROLA).
  • Your resulting MBROLA-produced WAV file.

12. Retrieval chatbot

 Presented: 18 May, Deadline: 8 June

This time, you will implement a basic information-retrieval-based chatbot. We'll just use TF-IDF for retrieval, with no smart reranking.


  1. Get the DailyDialogue dataset. This is a dataset of basic day-to-day dialogues, so it's great for chatbots. Have a look at the data format, it's not very hard to parse.

  2. Implement an IR chatbot module (recommended approach, alternatives and extensions welcome):

    • Load the DialyDialogue data into memory so that you know which turn follows which. Use for this.

    • From your data, create a Keys dataset, containing all turns in all dialogues except the last one. Then create a Values dataset, which always contain the immediately next turn for each dialogue.

      • Say there's just 1 dialogue with 5 turns (represented just by numbers here). Keys should contain [0, 1, 2, 3] and the corresponding Values are [1, 2, 3, 4].
    • Use TfidfVectorizer from Scikit-Learn as the main “engine”.

      • Create a vectorizer object and call fit_transform on the Keys set to train your matching TF-IDF matrix (store this matrix for later). Feel free to play around with this method's parameters, especially with the ngram_range -- setting it slightly higher than the default (1,1) might give you better results.
    • For any input sentence, what your chatbot should do is:

      • Call transform on your vectorizer object to obtain TF-IDF scores.

      • Get the cosine similarity of the sentence's TF-IDF to all the items in the Keys dataset.

      • Find the top 10 Keys' indexes using numpy.argpartition (see the example here). Now get the corresponding top 10 Values (at the same indexes). Choose one of them at random and use it as output.

        • Instead of choosing at random, you could do some smart reranking, but we'll skip that in this exercise.
  3. Integrate your chatbot into DialMonkey. Create a module inside dialmonkey.policy and add a corresponding YAML config file. You can call it ir_chatbot.

    • Note that DailyDialog training data are already stored in data/dailydialog/dialogues_train.txt and you can use this file, but you need Git-LFS to download the file contents properly.
  4. Take the 1st sentence of the first 10 DailyDialogue validation dialogues ( and see what your chatbot tells you.

Files to include in your merge request

  • Your chatbot module under dialmonkey/policy/ and your configuration file under conf/ir_chatbot.yaml.
  • The first 10 DailyDialogue validation opening lines along with your chatbot's responses under hw12/samples.txt.

Further reading

More low-level stuff on TF-IDF:

Homework Submission Instructions

All homework assignments will be submitted using a Git repository on MFF GitLab.

Since we will be using dialmonkey a lot, please make your repository a fork of the dialmonkey repository.

We provide an easy recipe to set up your repository below:

Creating the repository

  1. Log into your MFF gitlab account. Your username and password should be the same as in the CAS, see this.

  2. Import Dialmonkey as a new project. Choose the Private visibility level.

     New project -> Import project -> Repo by URL
Import the project

Use the Dialmonkey Git URL for the import (including the ".git" extension): If there is an error with that repo, you can use a copy at the MFF Gitlab:

  1. Invite us (@duseo7af, @hudecekv) to your project so we can see it. Please give us "Reporter" access level.

     Members -> Invite Member
  2. Clone the newly created repository.

  3. Change into the cloned directory and run

git remote show origin

You should see these two lines:

* remote origin
  Fetch URL:
  Push  URL:

  1. Add the original repository as your upstream:
git remote add upstream
  1. You're all set!

Submitting the homework assignment

  1. Make sure you're on your master branch
git checkout master
  1. Checkout new branch:
git checkout -b hw-XX
  1. Solve the assignment :)

  2. Add new files (if applicable) and commit your changes:

git add hwXX/
git commit -am "commit message"
  1. Push to your origin remote repository:
git push origin hw-XX
  1. Create a Merge request in the web interface. Make sure you create the merge request into the master branch in your own forked repository (not into the upstream).

     Merge requests -> New merge request
Merge request
  1. Wait a bit till we check your solution, then enjoy your points :)!
  2. Once approved, merge your changes into your master branch – you might need them for further homeworks.

Upgrading your repository

It might happen that we'll make changes in the upstream dialmonkey repository. We'll notify you of this.

To upgrade from upstream, do the following:

  1. Make sure you're on your master branch
git checkout master
  1. Fetch the changes
git fetch upstream master
  1. Apply the diff
git merge upstream master

Exam Question Pool

The exam will have 10 questions, mostly from this pool. Each counts for 10 points. We reserve the right to make slight alterations or use variants of the same questions. Note that all of them are covered by the lectures, and they cover most of the lecture content. In general, none of them requires you to memorize formulas, but you should know the main ideas and principles. See the Grading tab for details on grading.


  • What's the difference between task-oriented and non-task-oriented systems?
  • Describe the difference between closed-domain, multi-domain, and open-doman systems.
  • Describe the difference between user-initiative, mixed-initiative, and system-initiative systems.

Linguistics of Dialogue

  • What are turn taking cues/hints in a dialogue? Name a few examples.
  • Explain the main idea of the speech acts theory.
  • What is grounding in dialogue?
  • Give some examples of grounding signals in dialogue.
  • What is deixis? Give some examples of deictic expressions.
  • What is coreference and how is it used in dialogue?
  • What does Shannon entropy and conditional entropy measure? No need to give the formula, just the principle.
  • What is alignment/entrainment in dialogue?

Data & Evaluation

  • What are the typical options for collecting dialogue data?
  • How does Wizard-of-Oz data collection work?
  • What is corpus annotation, what is inter-annotator agreement?
  • What is the difference between intrinsic and extrinsic evaluation?
  • What is the difference between subjective and objective evaluation?
  • What are the main extrinsic evaluation techniques for task-oriented dialogue systems?
  • What are some evaluation metrics for non-task-oriented systems (chatbots)?
  • What's the main metric for evaluating ASR systems?
  • What's the main metric for NLU (both slots and intents)?
  • Explain an NLG evaluation metric of your choice.
  • Why do you need to check for significance?
  • Why do you need to evaluate on a separate test set?

Voice assistants & Question Answering

  • What is a smart speaker made of and how does it work?
  • Briefly describe a viable approach to question answering.
  • What is document retrieval and how is it used in question answering?
  • What is a knowledge graph?

Dialogue Tooling

  • What is a dialogue flow?
  • What are intents and entities/slots?
  • How can you improve a chatbot in production?

Natural Language Understanding

  • What are some alternative semantic representations of utterances, in addition to dialogue acts?
  • Describe language understanding as classification and language understanding as sequence tagging.
  • How do you deal with conflicting slots or intents in classification-based NLU?
  • What is delexicalization and why is it helpful in NLU?
  • Describe one of the approaches to slot tagging as sequence tagging.
  • What is the IOB/BIO format for slot tagging?
  • What is the label bias problem?
  • How can an NLU system deal with noisy ASR output? Propose an example solution.

Neural NLU & Dialogue State Tracking

  • Describe a neural architecture for NLU.
  • What is the dialogue state and what does it contain?
  • What is an ontology in task-oriented dialogue systems?
  • Describe the task of a dialogue state tracker.
  • What's a partially observable Markov decision process?
  • Describe a viable architecture for a belief tracker.
  • What is the difference between dialogue state and belief state?
  • What's the difference between a static and a dynamic state tracker?

Dialogue Policies

  • What are the non-statistical approaches to dialogue management/action selection?
  • Why is reinforcement learning preferred over supervised learning for training dialogue managers?
  • Describe the main idea of reinforcement learning (agent, environment, states, rewards).
  • What are deterministic and stochastic policies in dialogue management?
  • What's a value function in a reinforcement learning scenario?
  • What's the difference between actor and critic methods in reinforcement learning?
  • What's the difference between model-based and model-free approaches in RL?
  • What are the main optimization approaches in reinforcement learning?
  • Why do you typically need a user simulator to train a reinforcement learning dialogue policy?

Neural Policies & Natural Language Generation

  • How do you involve neural networks in reinforcement learning (describe a Q network or a policy network)?
  • What are the main steps of a traditional NLG pipeline – describe at least 2.
  • Describe one approach to NLG of your choice.
  • Describe how template-based NLG works.
  • What are some problems you need to deal with in template-based NLG?
  • Describe a possible neural networks based NLG architecture.

Automatic Speech Recognition

  • What is a speech activity detector?
  • Describe the main components of an ASR pipeline system.
  • How do input features for a traditional ASR model look like?
  • What is the function of the acoustic model in an ASR system?
  • What's the function of a decoder/language model in an ASR system?

Text-to-speech Synthesis

  • How do humans produce sounds of speech?
  • What's the difference between a vowel and a consonant?
  • What is F0 and what are formants?
  • What is a spectrogram?
  • What are main distinguishing characteristics of consonants?
  • What is a phoneme?
  • What are the main distinguishing characteristics of different vocal phonemes (both how they're produced and perceived)?
  • What are the main approaches to grapheme-to-phoneme conversion in TTS?
  • Describe the main idea of concatenative speech synthesis.
  • Describe the main ideas of statistical parametric speech synthesis.
  • How can you use neural networks in speech synthesis?


  • What are the three main approaches to building chatbots?
  • How does the Turing test work? Does it have any weaknesses?
  • What are some techniques rule-based chatbots use to convince their users that they're human-like?
  • Describe how a retrieval-based chatbot works.
  • How can you use neural networks for chatbots? Does that have any problems?
  • Describe a possible architecture of an ensemble chatbot.

Course Grading

To pass this course, you will need to:

  1. Take an exam (a written test covering important lecture content).
  2. Do lab homeworks (various dialogue system implementation tasks).

Exam test

  • There will be a written exam test at the end of the semester.
  • There will be 10 questions, we expect 2-3 sentences as an answer, with a maximum of 10 points per question.
  • To pass the course, you need to get at least 50% of the total points from the test.
  • We plan to publish a list of possible questions beforehand.

In case the pandemic does not get better by the exam period, there will be a remote alternative for the exam (an essay with a discussion).

Homework assignments

  • There will be 12 homework assignments, introduced every week, starting on the 2nd week of the semester.
  • You will submit the homework assignments into a private Gitlab repository (where we will be given access).
  • For each assignment, you will get a maximum of 10 points.
  • All assignments will have a fixed deadline of two weeks.
  • If you submit the assignment after the deadline, you will get:
    • up to 50% of the maximum points if it is less than 2 weeks after the deadline;
    • 0 points if it is more than 2 weeks after the deadline.
  • Once we check the submitted assignments, you will see the points you got and the comments from us in:
  • To be allowed to take the exam (which is required to pass the course), you need to get at least 50% of the total points from the assignments.


The final grade for the course will be a combination of your exam score and your homework assignment score, weighted 3:1 (i.e. the exam accounts for 75% of the grade, the assignments for 25%).


  • Grade 1: >=87% of the weighted combination
  • Grade 2: >=74% of the weighted combination
  • Grade 3: >=60% of the weighted combination
  • An overall score of less than 60% means you did not pass.

In any case, you need >50% of points from the test and >50% of points from the homeworks to pass. If you get less than 50% from either, even if you get more than 60% overall, you will not pass.

No cheating

  • Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
  • Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
  • All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.

Recommended Reading

You should pass the course just by following the lectures, but here are some hints on further reading. There's nothing ideal on the topic as this is a very active research area, but some of these should give you a broader overview.

Basic (good but very brief, available online):

More detailed (very good, available as e-book from our library):

Further reading:

  • Janarthanam: Hands-On Chatbots and Conversational UI Development. Packt 2017.
    • practical guide on developing dialogue systems for current platforms, virtually no theory
  • Gao et al.: Neural Approaches to Conversational AI. arXiv:1809.08267
    • an advanced, good overview of the latest neural approaches in dialogue systems
  • McTear et al.: The Conversational Interface: Talking to Smart Devices. Springer 2016.
    • practical, for current platforms, more advanced and more theory than Janarthanam
  • Jokinen & McTear: Spoken dialogue systems. Morgan & Claypool 2010.
    • good but slightly outdated, some systems very specific to particular research projects
  • Rieser & Lemon: Reinforcement learning for adaptive dialogue systems. Springer 2011.
    • advanced, slightly outdated, project-specific
  • Lemon & Pietquin: Data-Driven Methods for Adaptive Spoken Dialogue Systems. Springer 2012.
    • ditto
  • Skantze: Error Handling in Spoken Dialogue Systems. PhD Thesis 2007, Chap. 2.
    • good introduction into dialogue systems in general, albeit slightly dated
  • McTear: Spoken Dialogue Technology. Springer 2004.
    • good but dated
  • Psutka et al.: Mluvíme s počítačem česky. Academia 2006.
    • virtually the only book in Czech, good for ASR but dated, not a lot about other parts of dialogue systems