ÚFAL PhD Research Forum – Session 1

Monday, 27 April, 2026 - 14:00

Room:

ÚFAL PhD Research Forum – Session 1

14:00–14:20

Nam Hoang Luu

Sheer Luck or Linguistic Properties Behind LLM Pre-Pretraining? An Investigation on Multiple Languages

Abstract: Pretraining LLMs on artificial languages (“pre-pretraining”) is a technique that could reportedly increase token efficiency by 30%, i.e., save up to 30% of training tokens needed to reach a certain performance. We validate this prior result for English on a larger set of natural languages across four language families, using two different tokenizers and varying model sizes. We relate the observed gains (or losses) in token efficiency to quantified linguistic properties of the languages, such as sentence length, morphological richness, and features of dependency syntactic trees (tree depth, maximum number of children, number of crossing dependencies). Our empirical results indicate that the reported gains should be primarily attributed to luck in random seed choice, although we can confirm the trend of stable gains with 128-Dyck pretraining of small models with the Llama tokenizer and the trend of diminishing gains with increasing model size.

14:20–14:40

Ivan Kartáč

Evaluating formal and commonsense reasoning in real-world interactive scenarios

Abstract: Recent advances in Large Language Models (LLMs) have led to a strong performance on various reasoning tasks, including both formal and commonsense reasoning. Although these models are often used in interactive and compound settings, evaluation is typically limited to isolated problems and relies on static benchmarks with multiple-choice questions. To investigate whether these results transfer to realistic settings such as task-oriented dialogue (TOD), we create a dynamic benchmark that frames each problem in two variants: (1) presented as a standalone problem, and (2) embedded in TOD. The benchmark spans arithmetic, temporal, and spatial reasoning in multiple travel-related domains, and the examples are procedurally generated to mitigate data contamination. To allow models to generate free-form responses, answers are extracted and automatically compared to the ground truth, followed by a bias correction to adjust for potential parsing errors. We observe large and significant gaps in performance between the two settings across various model sizes and architectures, including large and proprietary models. To further explore the factors behind this discrepancy, we design a series of ablation experiments and find that LLMs’ reasoning performance in TOD is primarily affected by multi-turn interaction, tool use, and role conditioning. Our results point to potential limitations in the applicability of LLMs for reasoning in TOD tasks and emphasize the need to evaluate these models in interactive scenarios. Future work will focus on deeper understanding of the mechanisms that underlie the models’ behavior in interactive and compound settings.

14:40–15:00

Evelin Kitti Ács

Identifying and Analyzing Internationalisms Across Four Languages

Abstract: Loanwords are words borrowed from one language and incorporated into the vocabulary of another through the process of linguistic borrowing. They adapt to the phonological, graphical, and grammatical system of the recipient language to different extents, depending on the features of both the recipient and the source languages; for example, the counterparts of the English noun telephone in Hungarian (telefon), Czech (telefon), and Hindi (टेलीफ़ोन ṭelīfon). We aim to identify internationalisms across these four, typologically diverse languages based on available data resources by using semi-automatic methods, and to describe their assimilation patterns. We have surveyed relevant literature, linguistic datasets, and NLP tools for the languages under analysis, and clarified the notion of internationalisms with respect to both the languages and the available data. We have also conducted an initial analysis of the formal features of internationalisms in these languages, with particular attention to their assimilation patterns. Using Wiktionary as a primary data source, we have extracted internationalisms from available resources, including parallel corpora, dictionaries, and other relevant datasets. We have also collected their inflected forms (where applicable) and identified words sharing the same roots (e.g., derivatives and compounds) in order to establish morphological families. Future work will consist of analyzing the inflectional and word-formation behavior of internationalisms within and across the aforementioned languages.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

ÚFAL PhD Research Forum – Session 1