14:00–14:20
Tomáš Sourada
Music I Care About (MusICA-MetaBench): Automated On-Demand Multimodal Music Perception Benchmarking
Abstract: Music represents a cornerstone of human culture, existing digitally across diverse modalities, including audio, symbolic encodings (e.g., MIDI, MusicXML), and sheet music. Much of computational music processing focuses on transforming musical data from one modality to another. The emerging capabilities of Multimodal Large Language Models (MLLMs) are therefore naturally of interest. However, there remains a critical lack of systematic, multimodal evaluation benchmarks capable of verifying music perception capabilities of MLLMs. Current benchmarks are often restricted to single modalities or prioritize textual knowledge retrieval over practical musical perception, failing to accommodate the diversity of musical material. To resolve this, we introduce the Music I Care About Meta-Benchmark (MusICA-MetaBench), a novel automated framework for generating benchmarks tailored to user-provided musical data. By leveraging structured symbolic representations (e.g., MusicXML) and our pre-defined question templates, we automatically generate multiple-choice question-answer pairs that probe music perception competencies, aligned with music pedagogy. MusICA-MetaBench systematically projects these questions onto aligned audio, visual notation, and symbolic files, allowing for granular cross-modal assessment. We demonstrate the efficacy of our framework by providing a benchmark tailored for the ChoraleBricks dataset, testing current state-of-the-art MLLMs across varied modalities. Our results confirm the framework's robustness, its utility in assessing perceptual musical skills, and its ability to adapt to domain-specific musical needs, marking a significant step toward reliable assessment of multimodal music intelligence.
14:20–14:40
Jan Bronec
Why do you remember me? Localizing and removing unwanted information in LLMs
Abstract: The sheer volume of data – often multilingual – required to pre-train LLMs rules out manual sanitization by human annotators. Heuristic approaches for filtering out sensitive, dangerous or hateful content may miss out on various paraphrases of the same content, leaving different kinds of sensitive personal information, dangerous knowledge, and copyright-protected content unfiltered. LLMs trained on these datasets are susceptible to sensitive data leakage. As these inconsistencies are generally unveiled after pre-training, we find it necessary to sanitize pre-trained LLMs for downstream tasks, as well as in-between expensive re-training runs. Contemporary unlearning methods exhibit major model degradation and low robustness to paraphrased prompts. In our previous work presented at the SemEval 2025 workshop, we sought to alleviate this degradation by imposing a constraint on parameter updates, enabling efficient computation of additional regularization in a commonly used unlearning method. To improve paraphrase robustness, we recently examined knowledge editing, which focuses on localized, fine-grained parameter edits. Applying knowledge-editing methods to remove different kinds of knowledge has so far yielded inconclusive results. A better understanding of how each part of the model contributes to conveying a concrete piece of information is crucial. During his current research stay at the University of Oslo, Jan is further exploring interpretability in LLMs, this time focusing on explaining and steering the moral foundations of LLMs across languages.
14:40–15:00
Adnan Al Ali
An Effort for Fair LLMs: From Gender Bias to Fair LLM Detection
Abstract: The talk briefly presents Adnan's past, current, and future work within their dissertation topic. The talk consists of three parts: Previous work: Fairness of LLM detectors. This has been a popular topic after a study1 claimed that GPT detectors are biased against non-native speakers of English due to their texts having smaller entropy, which is supposedly a key classification feature of the detectors. We revisit the claim in a Czech setting and find that non-native speakers of Czech produce texts with different characteristics that do not lead to a smaller entropy. We further find that contemporary detectors do not depend strictly on the text entropy. The paper was presented at EACL2026 SRW. Current work: Cross-lingual alignments (CLA). CLA, defined as the similarity of hidden states among languages have been shown to be a strong predictor of the multilingual performance of LLMs. Good predictability is important for low-resource languages, where task-specific data is scarce. We extended the existing research from multiple-choice tasks to the translation task and found that alignments with English are highly predictive even when translating between two non-English languages. This contributes evidence to the claim that LLMs implicitly use English as a pivot language2. The paper was submitted to ARR, and we plan to present it at EMNLP2026. Future work: Gender bias in voice LLMs. Audio-enabled LLMs are increasingly popular due to their accessibility. However, unlike traditional LLMs, little research has been conducted on their fairness. Preliminary results show that audio-enabled LLMs can amplify biases beyond those found in traditional LLMs. During my internship at the University of Hamburg, I will be working on the possibilities of mitigating the biases from the models.
15:00–15:20
Konstantinos Diamantopoulos
From Grammatical Descriptions to Corpus Reality: Paving the Way for Modelling Complexity
Abstract: The present talk is part of an ongoing PhD project aimed at modelling the morphological structure of words in typologically diverse languages — English, Czech, Slovak, German, and Greek — covering both inflection and word-formation, and determining to what extent structural complexity as described in grammatical descriptions is exploited in actual language use. Pursuing this goal requires, among others, reliable extraction of paradigms from various language data resources to assess facets of complexity such as how many distinct forms a paradigm contains and how regular these forms are within the system. An initial attempt to exploit InterCorp v16 UD appeared unfeasible due to inconsistent lemmatisation, POS tagging, and morphological feature assignment. Consequently, comparable corpora from the Leipzig Corpora Collection were annotated using UDPipe and Stanza for English, Czech, and Greek, yielding two datasets per language. Initial noun-level analysis shows that singular-to-plural ratios challenge the grammatical expectation that most nouns occur in both numbers, suggesting paradigm defectiveness may be more the norm than the exception — with direct implications for complexity modelling. Findings on number defective nouns further vary depending on the annotation tool employed, suggesting annotation decisions constitute a non-negligible variable in corpus-based morphological research.