Navigation

Contact

Head of division

Prof. Dr. Dr. Birger Kollmeier

+49 (0)441 798 5466 oder 5470

W30 3-313

Office

Katja Warnken

+49 (0)441 798 5470

+49 (0)441 798-3902

W30 3-312

Address (Mail address)

Medizinische Physik, Fakultät VI
Universität Oldenburg
26111 Oldenburg

Location / How to find us

For specific questions regarding one of our research topics, please contact the respective people directly (see staff list).

Predicting speech intelligibility with deep neural networks

Constantin Spille, Stephan D. Ewert, Birger Kollmeier and Bernd T. Meyer (2018)
Computer Speech and Language 48, pp. 51-66, March 2018

 

An accurate objective prediction of human speech intelligibility is of interest for many applications such as the evaluation of signal processing algorithms. To predict the speech recognition threshold (SRT) of normal-hearing listeners, an automatic speech recognition (ASR) system is employed that uses a deep neural network (DNN) to convert the acoustic input into phoneme predictions, which are subsequently decoded into word transcripts. ASR results are obtained with and compared to data presented in Schubotz et al. (2016), which comprises eight different additive maskers that range from speech-shaped stationary noise to a single-talker interferer and responses from eight normal-hearing subjects. The task for listeners and ASR is to identify noisy words from a German matrix sentence test in monaural conditions. Two ASR training schemes typically used in applications are considered: (A) matched training, which uses the same noise type for training and testing and (B) multi-condition training, which covers all eight maskers. For both training schemes, ASR-based predictions outperform established measures such as the extended speech intelligibility index (ESII), the multi-resolution speech envelope power spectrum model (mr-sEPSM) and others. This result is obtained with a speaker-independent model that compares the word labels of the utterance with the ASR transcript, which does not require separate noise and speech signals. The best predictions are obtained for multi-condition training with amplitude modulation features, which implies that the noise type has been seen during training. Predictions and measurements are analyzed by comparing speech recognition thresholds and individual psychometric functions to the DNN-based results.

https://doi.org/10.1016/j.csl.2017.10.004

[Free download]

Webi4pcmastern5 (katjn84/a.fdwarn/hkenor657@uol.dske1hkba) (Changed: 2020-02-17)