Recent advances in natural language processing have demonstrated the benefits of incorporating linguistic knowledge into subword-based models. This talk presents research on morphology-aware word representations for Slovak, focusing on how morphological information can enhance tokenization quality and downstream language modeling. We introduce the Slovak Morphological Tokenizer (SKMT), which integrates root morphemes into the Byte-Pair Encoding framework to preserve their integrity within tokens. This approach is compared to standard BPE and SlovakBERT tokenizers, showing a substantial improvement in maintaining morphemic consistency. Two RoBERTa-based Slovak language models were pre-trained using both tokenization methods and evaluated across several NLP tasks. The morphology-aware model exhibited higher training stability and better performance, particularly in question answering and semantic similarity tasks. The talk will discuss the design of SKMT, key experimental results, and broader implications for developing linguistically informed models for morphologically rich languages.
*** The talk will be delivered in person (MFF UK, Malostranské nám. 25, 4th floor, room S1) and will be streamed via Zoom. For details how to join the Zoom meeting, please write to sevcikova et ufal.mff.cuni.cz ***