ÚFAL PhD Research Forum – Session 2

Monday, 19 May, 2025 - 14:00

Room:

ÚFAL PhD Research Forum – Session 2

14:00–14:20

Vojtěch John

Morpheme-level language changes in time and space

Abstract: Many language changes can be modelled computationally. In recent years, Bayesian statistical models, often inspired by evolutionary biology, have been successfully used to explore factors underlying lexical language changes as well as the development of typological features. Data-driven computational methods applied to language change have proven very fruitful in general. Nevertheless, such methods have not yet been widely applied to morpheme-level language changes, even though changes on this level frequently play a key role in theoretical accounts of language evolution. This talk will discuss current bottlenecks in computational modelling of morpheme-level language change, especially regarding morphologically annotated data and the state-of-the-art methods for morphology processing. It will also report on the possible usage of computational methods in capturing both morpheme typology and morpheme-level language changes as well as some preliminary results in these tasks.

14:20–14:40

Matěj Kripner

Neural Theorem Proving in Lean 4

Abstract: Formal theorem proving in mathematics consists of generating an automatically verifiable proof given a formalized theorem statement. Neural theorem proving seeks to utilize neural networks to suggest suitable proof steps, typically in combination with decoding strategies like tree search or chain-of-thought.

Neural theorem proving approaches can be split into two categories. In black-box approaches, a formal verifier is only used to verify the final proof and it is left upon the prover model to deduce internal proof states. In contrast, white-box provers explicitly utilize internal proof states and consult the verifier interactively before generating each proof step. Unfortunately, tooling and datasets are lacking for white-box approaches in Lean 4. Addressing this gap is the first result of our work.

Looking forward, our plan is to advance white-box neural theorem proving approaches. One promising avenue of research that we identified is embedding states and actions of the proof search in a latent space, circumventing the time-demanding tactic generation.

Our goal is to assist mathematicians in the formalization of mathematics and also to transfer our approach to an applied domain like code generation or robotics.

14:40–15:00

Kumar Nalin

Efficient methods for building LLMs for low-resourced languages

Abstract: The performance of the current LLMs is primarily limited to high and moderately-resourced languages. This gap can be attributed to the disparity in training data samples. Since developing new datasets for so many existing tasks and languages can be costly, it is crucial to explore more efficient ways to utilize the current ones to bridge the performance gap. First, we plan to develop new methods for training models for low-resource languages for language understanding tasks. We start with benchmarking our methods for the English language in a low-resource setting. We then explore ways to transfer the data, knowledge, and tuning strategies to any low-resource language. In our initial experiments, we train and freeze language-specific embeddings using an alternate method (e.g. fasttext) while training other model parameters using them as a universal representation. Our preliminary results unsurprisingly show an overreliance of finetuning performance on pretrained languages. To improve the performance, we plan to create better-aligned embeddings more isomorphic to higher-resourced languages. In our future work, we plan to utilize methods such as DPO, curriculum learning, delexicalised parsing, etc. We further extend the effective techniques for the generation tasks.

15:00–15:20

Kristýna Onderková

Insight discovery in structured data

Abstract: Table-to-text generation is a challenging task which requires precision in analysing the data and intuition of what types of facts from the table may be interesting. Symbolic approaches, like explicit logical forms, are often custom-designed and difficult to scale. Neural approaches, including latest LLMs, are prone to hallucinations and unconcerned with logical reasoning. The benchmarks for this task also require a little inference or notion of actual interestingness and relevance of generated insights, as this is difficult to evaluate.

We use LLMs ideation to select interesting insights based on their broad knowledge, and to generate code to retrieve relevant data which will ground the final generation of insights in natural language. However, this approach faces many challenges: from contaminated datasets through arduous evaluation to the final bosses of clumsily defined data and implicit presuppositions.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

ÚFAL PhD Research Forum – Session 2