SIS code: 
Semester: 
both
E-credits: 
9/6

List of ÚFAL's Research Projects (NPRG070) and Company Projects (NPRG071)

Link to the projects' rules: https://www.ksi.mff.cuni.cz/teaching/projects-web/rules.pdf

Student Type Title (click for an abstract) supervisor defence date defence result
Aditya Kurniawan NPRG071 Incremental Learning with Adapters Deniz Gunceler 24.3.2022 defended
Michael Hanna NPRG070 Reconstruction of Lexical Resources using Large Language Models David Mareček 21.9.2021 defended
Niyati Bafna NPRG070 Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi Zdeněk Žabokrtský 21.9.2021 defended

Abstracts of the projects

Aditya Kurniawan: Incremental Learning with Adapters
Incremental learning means adapting a deep learning model (RNN-T or NLM in this particular case) to new data. A naive adaptation just on the new data leads to catastrophic forgetting and the model degrades on the old data. Performing a joint training with the old and new dataset is expensive and should be avoided. The objective of incremental learning is to incorporate new data without forgetting the old and without massive retraining costs. Adapters are small parameterized modules introduced into a pre-trained model (originally proposed for Transformers) that can be used for parameter and cost-efficient model adaptation. Seed model is firstly pre-trained on source domain data / source task. During fine-tuning stage only adapter parameters are trained, while pre-trained model parameters are frozen, that helps to efficiently overcome catastrophic forgetting problem and perform simultaneous model adaptation towards different tasks / domains by training adapters in parallel.

Michael Hanna: Reconstruction of Lexical Resources using Large Language Models
Large language models (LLMs), typically neural networks pre-trained on vast amounts of unlabeled data, have become a standard tool in the NLP toolkit. They owe their ubiquity in large part to their high performance on downstream NLP tasks, which seems to imply a high degree of language understanding. Despite this, the amount of linguistic knowledge these LLMs capture is unclear. Studies have shown that LLMs capture semantic and syntactic relationships known to linguists; however, these LLMs have also been shown to ignore linguistic information in favor of heuristics when performing certain NLP tasks. Thus, much work remains to be done to discover the exact types of linguistic information learned by LLMs. In this project, I will examine LLMs’ knowledge of relationships from lexical semantics. Specifically, I will focus on its knowledge of hypernymy (X is a type of Y), and will potentially also examine holonymy (X is a part of Y) and synonymy. I have chosen these relationships because they are contained in WordNet, a linguistic resource that encodes, for each English word, its lexicosemantic relationships with other words. The goal for this project will be to extract these aforementioned relationships from LLMs. That is, given a LLM such as BERT, and a word such as “blue”, I plan to extract the word that BERT believes is its hypernym; in this case, we hope that BERT predicts “color”. Via repeated application of this technique, we can construct a graph of hypernym relations, in which words are connected to their hypernyms. In essence, I plan to reconstruct WordNet.

Niyati Bafna: Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi
Subword-level embeddings are useful for many tasks, but require large amounts of monolingual data to train. While about 14 Indian languages such as Hindi, Bengali, Tamil, and Marathi have the required magnitudes of data and resources, most Indian languages are highly under-resourced; they have very little monolingual data and almost no parallel data, low internet presence and not much digitization. Some examples are Marwadi, Dogri, and Mundari. However, many of these languages have very close syntactic, morphological, and lexical connections to surrounding languages including the mentioned high-resource languages. Our approach aims to develop a method of bilingual transfer for subword-level embeddings, which leverages these connections, from high-resource languages to low-resource languages. We hope that developing methods to build embeddings for low-resource languages will aid further development of other NLP tools such as MT or speech tools for them. In this project, we work with Hindi and Marathi as our high-resource language (HRL) and low-resource language (LRL) respectively. We simulate a low-resource environment for Marathi; we are constrained in this project by the need for evaluation resources for the resulting embeddings. We hope to eventually apply this work to truly low-resource languages. To this purpose, we assume rich resources for Hindi, including large monolingual data (we use up to 2M sentences, containing 36M tokens), taggers, and robust embeddings. For Marathi, we only assume and use small monolingual data (50K sentences, containing 0.8M tokens). We evaluate the resulting embeddings using the publicly available Word Similarity dataset for Marathi. We also perform a second evaluation on Wordnet-Based Synonymy Tests (WBST), which we generate from the public Marathi Wordnet. This is intended to be a pilot work to a broader study that applies and perhaps adapts our given approach to a much larger coverage of different typologies of Indian languages and language pairs, in the hope of making it generalizable to truly low-resource languages.