Robust Speech Translation

Guidelines

The state of the art in neural machine translation (NMT), automatic speech recognition (ASR) and speech translation (ST) is based on deep learning. In most NLP tasks, there is a continuous trend of towards end-to-end architectures, which avoid quality bottleneck often induced by intermediate representations. At the same time, they often require vast amounts of data to perform well.

The goal of this thesis is to explore the area of Speech Translation and Automatic Speech Recognition. Specifically, experimentally explore these areas and propose and evaluate variations of model architectures, training data layout or training methods to achieve gains in Speech Translation quality and/or efficiency.

The thesis will focus on improving robustness of ST systems with respect to multiple types of possible noise and domain mismatches: noise in sound acquisition, noise in training data, too scarce or non-existent in-domain training data for both ASR and MT, noise in partial outputs (e.g. some errors in ASR outputs should be recoverable from larger context in the subsequent translation).

A promising possible extension of the work would experiment with including non-verbal traits in translation decisions. While the intended output of the complex system will primarily remain in the text domain, it should still be possible to steer the word and information structure choices according to non-verbal signals recognized in the ASR step.

Utilizing the results of experiments and knowledge from the first stage of the studies, the work may further focus on translation- or speech-related NLP tasks (e.g., domain-specific ST).

References


Transformer correcting ASR outputs (monolignually)

Jasper

Attention is all you need