Principal investigator (ÚFAL):
BeTok
Better Tokenization for Multilingual Language Models and Machine Translation
The project will develop better tokenizers for multilingual language modeling and machine translation.
Publications
- Abishek Stephen, Jindřich Libovický (2026): Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features. In: Findings of the Association for Computational Linguistics: EACL 2026, pp. 3783-3791, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-386-9 (url, bibtex)
- Gianluca Vico, Jindřich Libovický (2026): Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography. In: Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 70-86, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-372-2 (url, bibtex)
- Kirill Semenov, Martin Popel (2025): InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability. In: Tokenization Workshop (TokShop), pp. 1-20, OpenReview, Amherst, MA, USA (url, bibtex)
- Gianluca Vico, Jindřich Libovický (2025): Conditional Unigram Tokenization with Parallel Data. In: Tokenization Workshop (TokShop), pp. 1-21, OpenReview, Amherst, MA, USA (url, bibtex)
Other
- Nathan Schneider, Agata Savary, Elizabeth Salesky, Jindřich Libovický, John McCrae, Yuval Pinter (2025): Panel discussion: Tokenization in the era of LLMs (talk). (url)