Supervised Learning-Based Multi-Frame Filtering for Binaural Speech Enhancement

Supervised Learning-Based Multi-Frame Filtering for Binaural Speech Enhancement

Marvin Tammen, Simon Doclo

In many speech communication scenarios, head-mounted assistive listening devices such as hearing aids capture not only the target speaker but also interfering sound sources, resulting in a degradation of speech quality and speech intelligibility. To alleviate this issue, several binaural speech enhancement algorithms such as the binaural multi-channel Wiener filter have been proposed [1], which exploit spatial correlations of both the target speech and noise components. Similarly, for single-microphone scenarios it has been proposed to exploit the fact that speech is highly correlated over time, resulting in the multi-frame Wiener filter (MFWF) [2]. In this contribution, we propose a binaural extension of the MFWF, which exploits both spatial as well as temporal correlations. Similarly to [3], the binaural MFWF is embedded into an end-to-end supervised learning framework, where the required parameters are estimated by temporal convolutional networks (TCNs) that are trained using the mean spectral absolute error loss function. Simulations are conducted to evaluate the binaural MFWF in terms of its binaural speech enhancement performance as well as its ability to preserve binaural target localization cues. This evaluation is performed on a dataset comprising binaural room impulses measured with behind-the-ear hearing aids in realistic environments as well as diverse noise sources at a broad signal-to-noise ratio range. The simulation results demonstrate the advantage of multi-frame filtering instead of single-frame masking as well as the advantage of employing the binaural MFWF structure instead of directly estimating the binaural multi-frame filter coefficients.

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy - EXC 2177/1 - Project ID 390895286.

[1]    S. Doclo, W. Kellermann, S. Makino, S. Nordholm, “Multichannel signal enhancement algorithms for assisted listening devices,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 18-30, Mar. 2015.
[2]    Y. A. Huang and J. Benesty, “A Multi-Frame Approach to the Frequency-Domain Single-Channel Noise Reduction Problem,” IEEE Trans. Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1256–1269, May 2012.
[3]    M. Tammen and S. Doclo, “Deep Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, Jun. 2021, pp. 8443–8447.

Algorithm Bagpipe, 3dB Frogs, 19dB Munching, 7dB Fan, 4dB
noisy
clean
direct binaural single-frame filtering
direct binaural multi-frame filtering
deep binaural MFMVDR (proposed)
(Changed: 01 Aug 2022)