Multi-Frame Speech Enhancement

Parameter Estimation Procedures for Deep Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement

Marvin Tammen, Simon Doclo

Aiming at exploiting temporal correlations across consecutive time frames in the short-time Fourier transform (STFT) domain, multi-frame algorithms for single-microphone speech enhancement have been proposed, which apply a complex-valued filter to the noisy STFT coefficients. Typically, the multi-frame filter coefficients are either estimated directly using deep neural networks or a certain filter structure is imposed, e.g., the multi-frame minimum variance distortionless response (MFMVDR) filter structure. Recently, it was shown that integrating the fully differentiable MFMVDR filter into an end-to-end supervised learning framework employing temporal convolutional networks (TCNs) allows for a high estimation accuracy of the required parameters, i.e., the speech inter-frame correlation vector and the interference covariance matrix. In this paper, we investigate different covariance matrix structures, namely Hermitian positive-definite, Hermitian positive-definite Toeplitz, and rank-1. The main difference between the considered matrix structures lies in the number of parameters that need to be estimated by the TCNs and hence the computational complexity. When assuming a rank-1 matrix structure, we show that the MFMVDR filter can be written as a linear combination of the TCN outputs, significantly reducing computational complexity. In addition, we consider a covariance matrix estimation procedure based on recursive smoothing, where the smoothing factors are estimated using TCNs. Experimental results on the deep noise suppression challenge 1 and 2 datasets show that the estimation procedure using the Hermitian positive-definite matrix structure yields the best performance, closely followed by the rank-1 matrix structure at a much lower complexity. Furthermore, it is shown for the best-performing MFMVDR filters that imposing the MFMVDR filter structure instead of directly estimating the multi-frame filter coefficients is beneficial in terms of speech enhancement performance.

[Audio Demos]

Deep Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement

Marvin Tammen, Simon Doclo

Multi-frame algorithms for single-microphone speech enhancement, e.g., the multi-frame minimum variance distortionless response (MFMVDR) filter, are able to exploit speech correlation across adjacent time frames in the short-time Fourier transform (STFT) domain. Provided that accurate estimates of the required speech interframe correlation vector and the noise correlation matrix are available, it has been shown that the MFMVDR filter yields a substantial noise reduction while hardly introducing any speech distortion.
Aiming at merging the speech enhancement potential of the MFMVDR filter and the estimation capability of temporal convolutional networks (TCNs), in this paper we propose to embed the MFMVDR filter within a deep learning framework. The TCNs are trained to map the noisy speech STFT coefficients to the required quantities by minimizing the scale-invariant signal-to-distortion ratio loss function at the MFMVDR filter output. Experimental results show that the proposed deep MFMVDR filter achieves a competitive speech enhancement performance on the Deep Noise Suppression Challenge dataset. In particular, the results show that estimating the parameters of an MFMVDR filter yields a higher performance in terms of PESQ and STOI than directly estimating the multi-frame filter or single-frame masks and than Conv-TasNet.

[Audio Demos]

Robust Constrained MFMVDR Filters for Single-Channel Speech Enhancement based on Spherical Uncertainty Set

Dörte Fischer, Simon Doclo

Aiming at exploiting speech correlation across consecutive time-frames in the short-time Fourier transform domain, the multi-frame minimum variance distortionless response (MFMVDR) filter for single-channel speech enhancement has been proposed. The MFMVDR filter requires an accurate estimate of the normalized speech correlation vector in order to avoid speech distortion and artifacts. In this paper we investigate the potential of using robust MVDR filtering techniques to estimate the normalized speech correlation vector as the vector maximizing the total signal output power within a spherical uncertainty set, which corresponds to imposing a quadratic inequality constraint. Whereas the singly-constrained (SC) MFMVDR filter only considers the quadratic inequality constraint to estimate the (non-normalized) speech correlation vector, the doubly-constrained (DC) MFMVDR filter integrates a linear normalization constraint into the optimization problem to directly estimate the normalized speech correlation vector. To set the upper bound of the quadratic inequality constraint for each time-frequency point, we propose to use a trained non-linear mapping function that depends on the a-priori signal-to-noise ratio (SNR). Experimental results for different speech signals, noise types and SNRs show that the proposed constrained approaches yield a more accurate estimate of the normalized speech correlation vector than a state-of-the-art maximum-likelihood (ML) estimator. An instrumental and a perceptual evaluation show that both constrained MFMVDR filters lead to less speech and noise distortion but a lower noise reduction than the ML-MFMVDR filter, where the DC-MFMVDR filter is preferred in terms of overall quality compared to the SC-MFMVDR and ML-MFMVDR filters.

Audio samples

Audio samples II

(Changed: 19 Jan 2024)  | 
Zum Seitananfang scrollen Scroll to the top of the page