Research Project/Company Project

SIS code:

NPRG070

NPRG071

Semester:

both

E-credits:

9/6

List of ÚFAL's Research Projects (NPRG070) and Company Projects (NPRG071)

Link to the projects' rules: https://www.ksi.mff.cuni.cz/teaching/projects-web/rules.pdf

Student	Type	Title (click for an abstract)	supervisor	defence date	defence result
Aditya Kurniawan	NPRG071	Incremental Learning with Adapters	Deniz Gunceler	24.3.2022	defended
Michael Hanna	NPRG070	Reconstruction of Lexical Resources using Large Language Models	David Mareček	21.9.2021	defended
Niyati Bafna	NPRG070	Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi	Zdeněk Žabokrtský	21.9.2021	defended

Abstracts of the projects

Aditya Kurniawan: Incremental Learning with Adapters: Incremental learning means adapting a deep learning model (RNN-T or NLM in this particular case) to new data. A naive adaptation just on the new data leads to catastrophic forgetting and the model degrades on the old data. Performing a joint training with the old and new dataset is expensive and should be avoided. The objective of incremental learning is to incorporate new data without forgetting the old and without massive retraining costs. Adapters are small parameterized modules introduced into a pre-trained model (originally proposed for Transformers) that can be used for parameter and cost-eﬃcient model adaptation. Seed model is ﬁrstly pre-trained on source domain data / source task. During ﬁne-tuning stage only adapter parameters are trained, while pre-trained model parameters are frozen, that helps to eﬃciently overcome catastrophic forgetting problem and perform simultaneous model adaptation towards diﬀerent tasks / domains by training adapters in parallel.

Michael Hanna: Reconstruction of Lexical Resources using Large Language Models: Large language models (LLMs), typically neural networks pre-trained on vast amounts of unlabeled data, have become a standard tool in the NLP toolkit. They owe their ubiquity in large part to their high performance on downstream NLP tasks, which seems to imply a high degree of language understanding. Despite this, the amount of linguistic knowledge these LLMs capture is unclear. Studies have shown that LLMs capture semantic and syntactic relationships known to linguists; however, these LLMs have also been shown to ignore linguistic information in favor of heuristics when performing certain NLP tasks. Thus, much work remains to be done to discover the exact types of linguistic information learned by LLMs. In this project, I will examine LLMs’ knowledge of relationships from lexical semantics. Specifically, I will focus on its knowledge of hypernymy (X is a type of Y), and will potentially also examine holonymy (X is a part of Y) and synonymy. I have chosen these relationships because they are contained in WordNet, a linguistic resource that encodes, for each English word, its lexicosemantic relationships with other words. The goal for this project will be to extract these aforementioned relationships from LLMs. That is, given a LLM such as BERT, and a word such as “blue”, I plan to extract the word that BERT believes is its hypernym; in this case, we hope that BERT predicts “color”. Via repeated application of this technique, we can construct a graph of hypernym relations, in which words are connected to their hypernyms. In essence, I plan to reconstruct WordNet.

Niyati Bafna: Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi: Subword-level embeddings are useful for many tasks, but require large amounts of monolingual data to train. While about 14 Indian languages such as Hindi, Bengali, Tamil, and Marathi have the required magnitudes of data and resources, most Indian languages are highly under-resourced; they have very little monolingual data and almost no parallel data, low internet presence and not much digitization. Some examples are Marwadi, Dogri, and Mundari. However, many of these languages have very close syntactic, morphological, and lexical connections to surrounding languages including the mentioned high-resource languages. Our approach aims to develop a method of bilingual transfer for subword-level embeddings, which leverages these connections, from high-resource languages to low-resource languages. We hope that developing methods to build embeddings for low-resource languages will aid further development of other NLP tools such as MT or speech tools for them. In this project, we work with Hindi and Marathi as our high-resource language (HRL) and low-resource language (LRL) respectively. We simulate a low-resource environment for Marathi; we are constrained in this project by the need for evaluation resources for the resulting embeddings. We hope to eventually apply this work to truly low-resource languages. To this purpose, we assume rich resources for Hindi, including large monolingual data (we use up to 2M sentences, containing 36M tokens), taggers, and robust embeddings. For Marathi, we only assume and use small monolingual data (50K sentences, containing 0.8M tokens). We evaluate the resulting embeddings using the publicly available Word Similarity dataset for Marathi. We also perform a second evaluation on Wordnet-Based Synonymy Tests (WBST), which we generate from the public Marathi Wordnet. This is intended to be a pilot work to a broader study that applies and perhaps adapts our given approach to a much larger coverage of different typologies of Indian languages and language pairs, in the hope of making it generalizable to truly low-resource languages.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

List of ÚFAL's Research Projects (NPRG070) and Company Projects (NPRG071)

Abstracts of the projects