MT Marathon 2024 Talks

Monday through Saturday, MT Marathon includes keynote talks.

Confirmed Speakers

Ona de Gibert, Joseph Attieh (University of Helsinki)

Knowledge Distillation for Machine Translation

Large-scale Machine Translation systems pose a challenge in terms of their environmental impact and accessibility. One method to limit the carbon footprint of these systems is Knowledge Distillation (KD): in KD, a larger (teacher) model is used to guide the learning of a smaller (student) model to replicate its performance, enhancing computational efficiency without sacrificing accuracy. This talk comprehensively explores the application of KD in the domain of MT. We propose a double taxonomy that classifies previous work in regards to their method used and their application.

Khetam Al Sharou (Dublin City University)

When Mistranslation Becomes Misinformation: Exploring Potential Risks of Machine Translation and Its Impact on End-Users

TBA

Raj Dabre (NICT, Japan)

Advances in Multilingual Machine Translation and Evaluation for Indian Languages

Given the proliferation of internet usage in India, machine translation of Indian languages has become an increasingly important topic. In this talk I will cover the recent advances in multilingual machine translation and evaluation for Indian languages. Specifically I will focus on two major efforts, namely, IndicTrans2 and IndicMT Eval. Regarding IndicTrans2, I will focus on how we scaled up human as well as automatically mined data following which robust open-source machine translation systems were developed which outperform previously existing models, both open and closed-source. I will then discuss MT evaluation of Indian languages where we developed meta-evaluation benchmarks and how we analyzed a large number of metrics to establish their efficacy. I will also briefly talk about IndicComet, a Comet model specially designed for Indian languages. Towards the end of my talk I will briefly cover the future of Indian language machine translation, especially in the context of LLMs.

Elizabeth Salesky (JHU)

Translation and Language Modeling with Pixels

Language models are typically defined over a finite set of inputs, even if designed to have an "open vocabulary" through common techniques like subword segmentation. This creates a vocabulary bottleneck when scaling the number f supported languages, and limits models' ability to appropriately handle unseen character sequences. I will discuss a recent line of work in which we overcome the vocabulary bottleneck by replacing the embedding matrix with representations built on visually rendered text (pixels). Doing so enables more robust language representations with improved cross-lingual transfer, both within and across scripts.

Ricardo Rei, Nuno Guerreiro, Sweta Agrawal (Unbabel, Instituto de Telecomunicações)

Tower LLM

TowerLLM is one of the first large language models specifically tailored for machine translation (MT) and translation-related tasks, achieving state-of-the-art (SOTA) results. Through continued pretraining on diverse datasets and fine-tuning with task-specific instructions, TowerLLM not only surpasses open-source alternatives but also competes closely with closed LLMs. This strategy ensures exceptional proficiency across various translation workflows, significantly enhancing both quality and efficiency. In this talk, we will delve into the development of TowerLLM and share insights from our WMT24 General MT shared task submission, where TowerLLM secured first place in automatic metrics for all language pairs. Furthermore, we will discuss its strong positioning to win the task, outperforming other advanced models such as Claude 3.5 and GPT-4o.

Vilém Zouhar (ETH Zürich)

Token(s) of Appreciation for BPE

Tokenization is present in almost all NLP pipelines, but rarely examined mathematically. During the talk we'll formalize, show boundaries, and overall grok the most popular tokenization algorithm, Byte-Pair Encoding. With information theory, we also show what makes some tokenization better than others and how to use this as a metric before training your expensive models. Lastly, we cover stochastic tokenization variants and talk about how the tokenization story is far from being over...

Laurie Burchell (University of Edinburgh)

Language Identification for Dataset Building

Language identification is a fundamental step in many NLP pipelines and is particularly important for building reliable multilingual datasets. However, current language identification systems are far from perfect, particularly for under-resourced language varieties. In this talk, I will cover language identification and why it matters for downstream NLP applications. I will then present two case studies of where language identification systems struggle and discuss why these are important for NLP practitioners. Finally, I will outline some of the open questions in language identification and highlight ongoing work.

Tsz Kin Lam (University of Edinburgh)

Speech Translation: From basics to recent advances

Similar to text translation, speech translation (ST) is also a long-standing research area in the field. Traditionally, ST has been addressed by a cascaded approach. Given the rapid development of the end-to-end approach, is this solution still applicable? Furthermore, how does the emergence of foundational models impact ST? In this talk, I will give a brief introduction to ST and its recent advancement. In the first part, I will discuss some interesting properties of speech signals which could make its translation a more challenging problem than its text inputs. In the second part, I will discuss some existing solutions, including, but not limited to, data augmentation, multi-task learning, using Speech Foundation Models and, in particular, integrating speech into Large Language Models (LLM).

Julius Cheng, Andreas Vlachos (University of Cambridge)

Language Model Decoding Beyond Beam Search

Beam search and ancestral sampling are the most well-known and widely used algorithms for one-best prediction and random sampling respectively due to their simplicity, efficiency, and effectiveness. However, beam search suffers from the “beam search curse” and is outperformed by a variety of reranking algorithms, such as quality-based reranking and minimum Bayes risk (MBR) decoding. For sampling, there exist methods which explicitly optimize for diversity or reduce the occurrence of low-quality samples. In this talk, I present an overview of these decoding approaches, why they work, and how they address specific weaknesses of baseline methods.

MTM 2024