Principal investigator (ÚFAL): 
Project Manager (ÚFAL): 
Provider: 
Grant id: 
25-16242S
ÚFAL budget: 
6704000
Duration: 
2025-2027

BeTok

Better Tokenization for Multilingual Language Models and Machine Translation

The project will develop better tokenizers for multilingual language modeling and machine translation.

Publications

  1. Kirill Semenov, Martin Popel (2025): InCa and InDia: Inline Casing and Diacritization Preprocessing For Robust-to-Noise Tokenization and Interpretability. In: Tokenization Workshop (TokShop), pp. 1-20, OpenReview, Amherst, MA, USA (url, bibtex)
  2. Gianluca Vico, Jindřich Libovický (2025): Conditional Unigram Tokenization with Parallel Data. In: Tokenization Workshop (TokShop), pp. 1-21, OpenReview, Amherst, MA, USA (url, bibtex)

Other

  1. Nathan Schneider, Agata Savary, Elizabeth Salesky, Jindřich Libovický, John McCrae, Yuval Pinter (2025): Panel discussion: Tokenization in the era of LLMs (talk). (url)