Speech Processing for Summarization in Specific Domains

Guidelines

The goal of the thesis is to examine methods for speech recognition and its subsequent summarization to provide a concise overview of longer speech recordings. The sessions could include one or more speakers.

While the summarization is the target application, the main focus of the thesis will remain in the speech domain. Current speech recognition systems can perform exceptionally well on average (e.g. [1], [2]) in good conditions but they struggle with specific terminology and they are also not robust to suboptimal recording conditions [3].

In order to reach a practically usable application, the state of the art still needs to be improved in several aspects. Specifically, the thesis will touch upon (1) domain adaptation, esp. with respect to domain-specific terminology, using fully automatic as well as interactive approaches, (2) confidence estimation to identify regions of recording where recognition is not reliable enough for further processing, (3) detection of key parts of the speech and integration with the subsequent summarization method.

If necessary, the thesis can handle the summarization step as a black box, relying on available systems and benefitting from their fast developments (e.g. methods based on BERT and its variants [4], [5]). A careful consideration needs to be also given to techniques of summary evaluation, where the standard methods like ROUGE do not always reflect the practical usability of the summary.

The developed methods and systems will be empirically evaluated in parts (ASR only, summarization only etc.) as well as end-to-end. The final output could be, e.g., a combination of a text summary and a selection of sound clips from the original recording that are probably important but impossible to process automatically.

The thesis will focus on the English and Czech language and build upon resources that are already available, under development, or that can be relatively easily obtained from the web.

References

[1] Han, K. J., Prieto, R., Wu, K., & Ma, T. (2019). State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions. arXiv preprint arXiv:1910.00716.
[2] Hadian, H., Sameti, H., Povey, D., & Khudanpur, S. (2018, September). End-to-end Speech Recognition Using Lattice-free MMI. In Interspeech (pp. 12-16).
[3] Macháček, D., Kratochvíl, J., Vojtěchová, T., & Bojar, O. (2019, October). A Speech Test Set of Practice Business Presentations with Additional Relevant Texts. In International Conference on Statistical Language and Speech Processing (pp. 151-161). Springer, Cham.
[4] Liu, Y. (2019). Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318.
[5] Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345.