SIS code: 
Semester: 
both
E-credits: 
9/6

List of ÚFAL's Research Projects (NPRG070) and Company Projects (NPRG071)

Link to the projects' rules: https://www.ksi.mff.cuni.cz/teaching/projects-web/rules.html

Student Type Title (click for an abstract and the final report) supervisor defence date defence result
Barbora Štěpánková NPRG070 Metaphor detection in both prose and poetry for the EduPo project Mgr. Rudolf Rosa, Ph.D., UFAL MFF UK    
Vojtěch Dvořák NPRG070 mung2musicxml: towards a unified Optical Music Recognition software ecosystem Mgr. Jan Hajič, Ph.D., UFAL MFF UK    
Kornelia Skorupińska NPRG070 Grounded Language Understanding for Robotic Manipulation Using Large Language and Vision Models Prof. Piotr Skrzypczyński, PhD., DSc., Institute of Robotics and Machine Intelligence,
Poznań University of Technology
   
Hugo Hrbáň NPRG070 Predicting Protein Folding in Reduced Alphabet Protein Sequences doc. RNDr. David Hoksza, Ph.D., KSI MFF UK 6.11.2025 defended
Anna Dvořáková NPRG070 PyCantus: a library for computational research of Gregorian chant Mgr. Jan Hajič, Ph.D., UFAL MFF UK 6.11.2025 defended
Rishu Kumar NPRG070 Summarization of theatre scripts within THEaiTRE project Mgr. Rudolf Rosa, Ph.D., UFAL MFF UK 18.7.2023 defended
Nalin Kumar NPRG070 Dialogue alignment for end-to-end task-oriented dialogue models Mgr. et Mgr. Ondřej Dušek, Ph.D., UFAL MFF UK 18.7.2023 defended
Kirill Semenov NPRG071 Designing automatic conversational testing for task-oriented voice bots Mgr. et Mgr. Ondřej Dušek, Ph.D., UFAL MFF UK 18.7.2023 defended
Goutham Venkatesh NPRG070 Modelling character personalities within THEaiTRE project Mgr. Rudolf Rosa, Ph.D., UFAL MFF UK 22.6.2023 defended
Aditya Kurniawan NPRG071 Incremental Learning with Adapters Deniz Gunceler, PhD., M.S. Anna Piunova, Amazon, Inc. 24.3.2022 defended
Michael Hanna NPRG070 Reconstruction of Lexical Resources using Large Language Models RNDr. David Mareček, Ph.D., UFAL MFF UK 21.9.2021 defended
Niyati Bafna NPRG070 Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi Prof. Ing. Zdeněk Žabokrtský, Ph.D., UFAL MFF UK 21.9.2021 defended

Abstracts of the projects

Hugo Hrbáň: Predicting Protein Folding in Reduced Alphabet Protein Sequences
Today, all proteins consist of 20 standard amino acids. However, during early stages of life’s formation on Earth, over 4.5 billion years ago, it is hypothesized that only a subset of 10 of these amino acids were available. This project was done in collaboration with Klára Hlouchová’s research group at the Faculty of Science (PřF UK), which focuses on protein evolution and the effect of the amino acid alphabet on protein structure, trying to answer the question whether contemporary proteins can be built using only the early amino acids. This project was divided into two main parts. In the first one, we analyzed a dataset of protein sequences from an experimental assay, provided to us by Klára Hlouchová. In the second part of this project, we developed a method for translating a given protein sequence into the prebiotic alphabet by iteratively substituting the non-prebiotic residues, while keeping the protein structure as similar to the original structure as possible. We analyzed how the method performs on a small dataset of proteins with diverse folds, and finally designed a few candidate translations of a particular protein of interest, which will be experimentally created and analyzed.

Anna Dvořáková: PyCantus: a library for computational research of Gregorian chant
Digital Gregorian chant scholarship has for decades enjoyed the privilege of a large digital resource cataloguing chant sources: the Cantus ecosystem, with nearly 900,000 chants catalogued across more than 2000 sources. The Cantus Database data model and the Cantus ID mechanism has been adopted by 18 more chant databases, jointly accessible through the Cantus Index interface. However, this data has only been available piecemeal via the individual online user interfaces and by exports pre-computed by some individual databases (notably Cantus DB); so, computational methods have so far had only limited opportunity to process these immense resources. To mitigate this hurdle, we collected CantusCorpus v1.0, a dataset that combines everything that was available across the Cantus Index-centered network of databases as of mid-2025 and we provide also the code make updating this data as the databases grow easier. We then created the lightweight PyCantus library for working with this data, decoupling the data model from the Cantus codebase and thus allowing integration of further chant data sources, which we illustrate with harmonising pilot data from the Corpus Monodicum project.

Rishu Kumar: Summarization of theatre scripts within THEaiTRE project
The THEaiTRE project focuses on automatically generating theatre play scripts. A shortcoming of the current solution is the limited window of the generation model: the GPT2 model only looks at most at 1024 tokens at once, which the theatrical scripts exceed considerably, leading to issues with maintaining long-distance consistency. The current approach in the project is to circumvent this by applying simple extractive summarization, which is too simple an approach and leads to unsatisfactory results. The project aims to enrich the generation process by employing specifically designed abstractive summarization, trained for dialogue summarization. The plan is to build upon meeting summarization research which the student was involved in within the ELITR project and adapt that to dialogue summarization or more specifically to theatre play script summarization. This could then be used within the generation process to ensure both short-distance and long-distance consistency of the generated scripts.

Nalin Kumar: Dialogue alignment for end-to-end task-oriented dialogue models
The Humane AI microproject plans to take the MultiWOZ 2.2 dataset [Budzianowski et al., 2018; Zang et al., 2020], which is text-only, and record voice for some of the data using crowdsourcing, or record real voice-based dialogues in the same domain. This will produce data sets for all three components – ASR, TTS, and NLG/end-to-end dialog systems, and can be used to investigate the benefits of sharing context. Specific evaluation metrics will be proposed and baselines created for all three components. These techniques can be seen as an interactive grounding in two senses: (1) grounding between the user and the dialog system -- the system can react better thanks its context-awareness, (2) grounding among system components that have a better expectation and ability to react. Nalin will work on the NLG/end-to-end dialogue part of the project. The focus here will be on dialogue alignment/entrainment [Nenkova et al., 2008; Ostrand & Chodroff, 2021], i.e. aligning the system’s responses to the preceding user utterances by reusing the same vocabulary and potentially also syntactic expressions. This kind of alignment happens naturally in human-human dialogues, and it has been shown to improve user experience in dialogue systems [Lopes et al., 2015]. However, most current dialogue systems have no specific support or provision for alignment, they are trained simply by cross-entropy on the training data or additional training objectives mostly focused on dialogue content, not specific phrasing [e.g., Peng et al., 2021].

Kirill Semenov: Designing automatic conversational testing for task-oriented voice bots
Automated bot testing depends on the different bots’ specific abilities, intents and entities1. Moreover, the conversational testing systems can span from the most “surface” ones which would just test the presence of the keywords or phrases in the bot’s outputs at each step, to “deeper” systems that can test the consistency of the bot over the dialogue (the variety of the aspects of evaluating bots is represented in (Li et al., 2021)). Thus, it appears logical to subdivide this aim into the subtasks which would incrementally develop the automated testing of bots at Mama AI. Therefore, my project aims at the first step in the development of automated testing, which covers the generally used functions of the voice bots. The project aims to make the bot that tests the basic functions of the voice bots in Mama AI. The scope of languages for this project is restricted to English and Czech.

Goutham Venkatesh: Modelling character personalities within THEaiTRE project
The THEaiTRE project focuses on automatically generating theatre play scripts. A shortcoming of the current solution is the lack of persona modelling: each line is generated by the same model, not conditioned on the personality of the character speaking the line. The project aims to enrich the script generation with an explicitly modelled character personality conditioning the generation.

Aditya Kurniawan: Incremental Learning with Adapters
Incremental learning means adapting a deep learning model (RNN-T or NLM in this particular case) to new data. A naive adaptation just on the new data leads to catastrophic forgetting and the model degrades on the old data. Performing a joint training with the old and new dataset is expensive and should be avoided. The objective of incremental learning is to incorporate new data without forgetting the old and without massive retraining costs. Adapters are small parameterized modules introduced into a pre-trained model (originally proposed for Transformers) that can be used for parameter and cost-efficient model adaptation. Seed model is firstly pre-trained on source domain data / source task. During fine-tuning stage only adapter parameters are trained, while pre-trained model parameters are frozen, that helps to efficiently overcome catastrophic forgetting problem and perform simultaneous model adaptation towards different tasks / domains by training adapters in parallel.

Michael Hanna: Reconstruction of Lexical Resources using Large Language Models
Large language models (LLMs), typically neural networks pre-trained on vast amounts of unlabeled data, have become a standard tool in the NLP toolkit. They owe their ubiquity in large part to their high performance on downstream NLP tasks, which seems to imply a high degree of language understanding. Despite this, the amount of linguistic knowledge these LLMs capture is unclear. Studies have shown that LLMs capture semantic and syntactic relationships known to linguists; however, these LLMs have also been shown to ignore linguistic information in favor of heuristics when performing certain NLP tasks. Thus, much work remains to be done to discover the exact types of linguistic information learned by LLMs. In this project, I will examine LLMs’ knowledge of relationships from lexical semantics. Specifically, I will focus on its knowledge of hypernymy (X is a type of Y), and will potentially also examine holonymy (X is a part of Y) and synonymy. I have chosen these relationships because they are contained in WordNet, a linguistic resource that encodes, for each English word, its lexicosemantic relationships with other words. The goal for this project will be to extract these aforementioned relationships from LLMs. That is, given a LLM such as BERT, and a word such as “blue”, I plan to extract the word that BERT believes is its hypernym; in this case, we hope that BERT predicts “color”. Via repeated application of this technique, we can construct a graph of hypernym relations, in which words are connected to their hypernyms. In essence, I plan to reconstruct WordNet.

Niyati Bafna: Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi
Subword-level embeddings are useful for many tasks, but require large amounts of monolingual data to train. While about 14 Indian languages such as Hindi, Bengali, Tamil, and Marathi have the required magnitudes of data and resources, most Indian languages are highly under-resourced; they have very little monolingual data and almost no parallel data, low internet presence and not much digitization. Some examples are Marwadi, Dogri, and Mundari. However, many of these languages have very close syntactic, morphological, and lexical connections to surrounding languages including the mentioned high-resource languages. Our approach aims to develop a method of bilingual transfer for subword-level embeddings, which leverages these connections, from high-resource languages to low-resource languages. We hope that developing methods to build embeddings for low-resource languages will aid further development of other NLP tools such as MT or speech tools for them. In this project, we work with Hindi and Marathi as our high-resource language (HRL) and low-resource language (LRL) respectively. We simulate a low-resource environment for Marathi; we are constrained in this project by the need for evaluation resources for the resulting embeddings. We hope to eventually apply this work to truly low-resource languages. To this purpose, we assume rich resources for Hindi, including large monolingual data (we use up to 2M sentences, containing 36M tokens), taggers, and robust embeddings. For Marathi, we only assume and use small monolingual data (50K sentences, containing 0.8M tokens). We evaluate the resulting embeddings using the publicly available Word Similarity dataset for Marathi. We also perform a second evaluation on Wordnet-Based Synonymy Tests (WBST), which we generate from the public Marathi Wordnet. This is intended to be a pilot work to a broader study that applies and perhaps adapts our given approach to a much larger coverage of different typologies of Indian languages and language pairs, in the hope of making it generalizable to truly low-resource languages.