Principal investigator (ÚFAL): 
Project Manager (ÚFAL): 
Provider: 
Grant id: 
302425
Duration: 
2025-2028

The primary objective is to explore efficient training and adaption methods to create general-purpose LMs for low-resource languages. To start off, we will conduct our initial experiments for the English language in a simulated low-resource setting, i.e., with limited data. We will explore efficient strategies to tune the model parameters. We will further experiment with different datasets with varied linguistic features. Using the results, we will try to adapt such a linguistically rich dataset to the target low-resource language, along with other tuning strategies. The project will cover both theoretical and experimental aspects of the problem. Using the project output, one can work on NLP tasks for low-resource languages. The contributions of this project are (1) novel efficient pretraining strategies for languages with limited data, (2) releasing intermediate synthetic silver data, and (3) releasing created models for low-resource languages.