Principal investigator (ÚFAL):
BeTok
Better Tokenization for Multilingual Language Models and Machine Translation
The project will develop better tokenizers for multilingual language modeling and machine translation.
Publications
- Kirill Semenov, Martin Popel (2025): InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability. In: Tokenization Workshop (TokShop), pp. 1-20, OpenReview, Amherst, MA, USA (url, bibtex)
- Gianluca Vico, Jindřich Libovický (2025): Conditional Unigram Tokenization with Parallel Data. In: Tokenization Workshop (TokShop), pp. 1-21, OpenReview, Amherst, MA, USA (url, bibtex)
Other
- Nathan Schneider, Agata Savary, Elizabeth Salesky, Jindřich Libovický, John McCrae, Yuval Pinter (2025): Panel discussion: Tokenization in the era of LLMs (talk). (url)