MT Marathon 2024 Talks

Monday through Saturday, MT Marathon includes keynote talks.

Confirmed Speakers

Ona de Gibert, Joseph Attieh (University of Helsinki)

Knowledge distillation (TBA)

TBA

Raj Dabre (NICT, Japan)

Advances in Multilingual Machine Translation and Evaluation for Indian Languages

Given the proliferation of internet usage in India, machine translation of Indian languages has become an increasingly important topic. In this talk I will cover the recent advances in multilingual machine translation and evaluation for Indian languages. Specifically I will focus on two major efforts, namely, IndicTrans2 and IndicMT Eval. Regarding IndicTrans2, I will focus on how we scaled up human as well as automatically mined data following which robust open-source machine translation systems were developed which outperform previously existing models, both open and closed-source. I will then discuss MT evaluation of Indian languages where we developed meta-evaluation benchmarks and how we analyzed a large number of metrics to establish their efficacy. I will also briefly talk about IndicComet, a Comet model specially designed for Indian languages. Towards the end of my talk I will briefly cover the future of Indian language machine translation, especially in the context of LLMs.

Liz Salesky (JHU)

Pixel models (TBA)

TBA

Ricardo Rei, Nuno Guerreiro, Sweta Agrawal (Unbabel, Instituto de Telecomunicações)

Tower LLM (TBA)

TBA

Vilém Zouhar (ETH Zürich)

Token(s) of Appreciation for BPE

Tokenization is present in almost all NLP pipelines, but rarely examined mathematically. During the talk we'll formalize, show boundaries, and overall grok the most popular tokenization algorithm, Byte-Pair Encoding. With information theory, we also show what makes some tokenization better than others and how to use this as a metric before training your expensive models. Lastly, we cover stochastic tokenization variants and talk about how the tokenization story is far from being over...

Laurie Burchell (University of Edinburgh)

Language ID (TBA)

TBA

Tsz Kin Lam (University of Edinburgh)

Speech Translation (TBA)

TBA

Julius Cheng, Andreas Vlachos (University of Cambridge)

Decoding (TBA)

TBA