Subword tokenization has become dominant as the method of segmenting textual input of language models. It offers a compromise between coverage of rare words and preventing excessive text segmentation. Nevertheless, the popular subwording algorithms rely on word frequency, limiting their effectiveness for low-resource languages and domains.
This presentation will delve into the aspects of subword tokenization that influence language model performance and costs: the allocation and overlap of vocabulary units across languages. Additionally, I will talk about potential improvements and alternatives aimed at producing better and fairer textual representations for NLP models.
*** The talk will be delivered in person (MFF UK, Malostranské nám. 25, 4th floor, room S1) and will be streamed via Zoom. For details how to join the Zoom meeting, please write to sevcikova et ufal.mff.cuni.cz ***