Live Credible Translation

TL;DR: I aim to make live speech translation more credible by adding quality estimation module.

What is ‘‘Live’‘ Translation?

‘‘Live’‘, or Simultaneous Speech Translation, is a task that combines speech processing, MT, and
simultaneous policies to deliver speech-to-text or speech-to-speech translations with short additive
latency, typically 2-4 seconds. The translation must be processed simultaneously as the source is
being produced. There are challenges that are in addition to offline speech translation: fast
computing, and a problem of translating partial, gradually incoming sentences without full future
context.

Why ‘‘Credible’‘ Translation?

Quality Estimation score indicates how likely are the translation outputs correct or wrong. Similarly
to MT QE, efficient and reliable SST QE could enable new practical applications that have a
potential to enhance the credibility of automatic simultaneous translation. It can be also used in
practical applications beyond the state of the art: real-time SST post-editing, such as an intelligent
support for humans or LLM correcting SST outputs in real-time, multi-sourcing – using the speech
of the original speaker and one or more simultaneous interpreters as multiple sources, and others.

Research Plan

1. Acquiring baselines
◦ SST, such as our SimulStreaming [1] that operates large foundation models such as
Whisper and EuroLLM in simultaneous mode;
◦ ASR CE – confidence estimation for ASR, starting with beam search as in [2] but
applying it to Whisper
◦ state-of-the-art MT and MT QE
2. Benchmarks: We plan to use our corpus of cross-lingual dialogues InCroMin [3], and others
3. Improvements: We have ideas to apply individual estimators for the stages of SST:
◦ acoustics: detecting whether the audio is too noisy to translate
◦ speech processing, such as ASR confidence estimation
◦ translation QE, analogically as in MT QE
◦ simultaneous policy

Moreover, a practical SST QE aims to be efficient. Therefore, we investigate methods that
do not require a large trained neural network for QE, but rather non-trainable methods such as
analyzing the output tokens distribution, contrasting hypothesis with noised or trimmed source, and
analyzing the hypothesis in simultaneous policies.

Progress and Results

I presented this project at a UNCE meeting. Intro slides.
Project accepted for funding by Czech national program OP JAK MSCA Fellowships CZ. It is invidiual post-doc fellowship. I will be hosted at University of Edinburgh for 2 years from 1/2026.
I presented a baseline SST system at IWSLT 2025, including an interactive demo.
I am planning to co-organize IWSLT 2026 shared task on speech metrics and QE.
I will be at MT Marathon 2025 in Helsinki, asking for feedback on this project, and showing a baseline demo (a live speech translation system without confidence estimation yet).

References

Ours:
[1] Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT
2025, Macháček and Polák, 2025, IWSLT 2025.
[3] InCroMin corpus preview: https://github.com/ELITR/incromin-test-calls .

Others:
[2] Language models supporting imperfect hand-writing and speech recognition
systems, Karel Beneš, 2024, Brno University of Technology. Dissertation thesis.

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics