Principal investigator (ÚFAL): 
Project Manager (ÚFAL): 
Provider: 
Grant id: 
25-16242S
ÚFAL budget: 
6704000
Duration: 
2025-2027

BeTok

Better Tokenization for Multilingual Language Models and Machine Translation

The project will develop better tokenizers for multilingual language modeling and machine translation.

Publications

  1. Abishek Stephen, Jindřich Libovický (2026): Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features. In: Findings of the Association for Computational Linguistics: EACL 2026, pp. 3783-3791, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-386-9 (url, bibtex)
  2. Gianluca Vico, Jindřich Libovický (2026): Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography. In: Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 70-86, Association for Computational Linguistics, Kerrville, TX, USA, ISBN 979-8-89176-372-2 (url, bibtex)
  3. Kirill Semenov, Martin Popel (2025): InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability. In: Tokenization Workshop (TokShop), pp. 1-20, OpenReview, Amherst, MA, USA (url, bibtex)
  4. Gianluca Vico, Jindřich Libovický (2025): Conditional Unigram Tokenization with Parallel Data. In: Tokenization Workshop (TokShop), pp. 1-21, OpenReview, Amherst, MA, USA (url, bibtex)

Other

  1. Nathan Schneider, Agata Savary, Elizabeth Salesky, Jindřich Libovický, John McCrae, Yuval Pinter (2025): Panel discussion: Tokenization in the era of LLMs (talk). (url)