Navigation

Phase-Aware Signal Processing

For single channel speech enhancement algorithms which rely on spectro-temporal properties, the short-term stationary of natural sounds are exploited by processing the signal in a time-frequency representation, most frequently the short time discrete Fourier transform (STFT) domain. In this domain, the coefficients of the signal are complex-valued, and can therefore be represented by their absolute value (referred to in the literature both as STFT magnitude and STFT amplitude) and their phase.

Until recently, in STFT-based speech enhancement, the focus was on modifying only the magnitude of the STFT components, because it was generally considered that most of the insight about the structure of the signal could be obtained from the magnitude, while little information could be obtained from the phase component.

However, with the recent combination of increased expectation towards speech enhancement algorithms and increased computational power of assisted listening and speech communication devices, all options for improvements are back on the table. Therefore, researchers started re-investigating the role of the STFT phase for speech enhancement.

Here we present a growing number of algorithms which either enhance the spectral phase and/or employ phase information for an improved signal processing performance.

Phase-aware speech enhancement: audio examples

Here are some audio examples illustrating the power of phase-aware speech enhancement.

For the following audio examples the spectral phase of voiced speech sounds is estimated from the noise corrupted recoding using [KG14]. This phase estimate is then used for the phase-aware clean speech amplitude estimator [GK13] up to 4 kHz. At higher frequencies and in unvoiced speech the clean amplitude is estimated using [BKM08], which is the phase-blind counterpart to [GK13]. Furthermore, the estimated amplitude is combined with the estimated phase for signal reconstruction up to 1.5 kHz. As a reference, we also provide the result of a traditional, phase-blind amplitude enhancement scheme [BKM08].

References

[KG14] Martin Krawczyk, Timo Gerkmann, "STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement", IEEE/ACM Trans. Audio, Speech, Language Proc., 2014, Vol. 22, No. 12, pp. 1931-1940, Dec. 2014.

[GK13] Timo Gerkmann, Martin Krawczyk, "MMSE-Optimal Spectral Amplitude Estimation Given the STFT-Phase", IEEE Signal Processing Letters, Vol. 20, No. 2, pp. 129-132, Feb. 2013.

[BKM08] Colin Breithaupt, Martin Krawczyk, Rainer Martin, "Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech", IEEE Int. Conf. Acoustics, Speech, Signal Processing, Las Vegas, NV, USA, Apr. 2008.

Clean Speech Noisy Speech
phase-aware amplitude + reconstr. phase phase-aware amplitude + noisy phase
phase-blind amplitude + noisy phase

Example that illustrates the relevance of the spectral phase

As an illustration emphasizing the power of phase, it is interesting to remark that, from a particular magnitude spectrogram, it is possible to reconstruct virtually any time domain signal with a carefully crafted phase. For instance, one can derive a magnitude spectrogram from that of a speech signal such that it yields either a speech signal similar to the original or a piece of rock music, depending on the choice of the phase. The point here is to exploit the inconsistency between magnitude and phase to make contributions of neighboring frames cancel each other just enough to reconstruct the energy profile of the target sound.

More details and audio examples can be found here.

Reference

[LR11] J. Le Roux, “Phase-controlled sound transfer based on maximally-inconsistent spectrograms,” in Acoustical Society of Japan Spring Meeting, no. 1-Q-51, Mar. 2011.

Phase-Aware Estimation of Complex Spectral Coefficients

A way to jointly estimate magnitudes and phases is to derive a joint MMSE estimator of magnitudes and phases directly in the STFT domain when an uncertain initial phase estimate is available. This phase-aware complex estimator is referred to as the Complex estimator with Uncertain Phase (CUP) [G14a,G14b]. The initial phase estimate can be obtained by an estimator based on signal characteristics, such as the sinusoidal model based approach [KG14]. The resulting magnitude estimate is a non-linear trade-off between a phase-blind and a phase-sensitive magnitude estimator, while the resulting phase is a trade-off between the noisy phase and the initial phase estimate. These trade-offs are controlled by the uncertainty of the initial phase estimate, avoid processing artifacts and lead to an improvement in predicted speech quality [G14a].

Audio examples can be found here.

Reference

[G14a] Timo Gerkmann, "Bayesian estimation of clean speech spectral coefficients given a priori knowledge of the phase", IEEE Trans. Signal Processing, Vol. 62, No. 16, pp. 4199-4208, Aug. 2014.

[G14b] Timo Gerkmann, “MMSE-optimal enhancement of complex speech coefficients with uncertain prior knowledge of the clean speech  phase,” in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Florence, Italy, May 2014.

Phase-Aware Spectral Amplitude Estimation

An estimate of the speech phase, e.g. obtained using a harmonic-model [KG14], can also be employed as additional information in speech amplitude estimation. For this, we derive an optimal Bayesian estimator of the clean speech spectral amplitudes when the clean speech phase is given. We show that the additional information of the phase provides new means to distinguish between noise outliers and speech.

More details and audio examples can be found here.

Reference

[GK13] Timo Gerkmann, Martin Krawczyk, "MMSE-Optimal Spectral Amplitude Estimation Given the STFT-Phase", IEEE Signal Processing Letters, Vol. 20, No. 2, pp. 129-132, Feb. 2013. 

Hamonic Model Based Phase Estimation

With the help of a harmonic model, the STFT phase of clean voiced speech sounds is estimated based only on a noisy observation and an estimate of the fundamental frequency. These phase estimates are for example employed for phase-aware spectral amplitude estimation and the estimation of the complex clean speech spectral coefficients.

Code and audio examples can be found here.

Reference

[KG14] Martin Krawczyk, Timo Gerkmann, "STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement", IEEE/ACM Trans. Audio, Speech, Language Proc., 2014, Vol. 22, No. 12, pp. 1931-1940, Dec. 2014

[KG12] Martin Krawczyk, Timo Gerkmann, "STFT Phase Improvement for Single Channel Speech Enhancement", Int. Workshop Acoust. Signal Enhancement (IWAENC), Aachen, Germany, Sep. 2012.

Consistent Wiener Filtering

While a classical Wiener filter changes only the magnitudes in the STFT domain,separately for each time-frequency point, the consistent Wiener filter takes the relationship between STFT coefficients across time and frequency into account, which modifies both the magnitude and the phase of the noisy observation to obtain the separated speech. Optimization is performed directly on the signal in the time domain or jointly on phase and magnitude in the complex time-frequency domain, through a conjugate gradient method with a well-chosen preconditioner. Thanks to this joint optimization, the consistent Wiener filter can lead to significantly better separation performance compared to the classicalWiener filter and other methods that rely on variance estimates to reconstruct speech.

More details and code can be found here.

Some audio examples can be found here.

Reference

[LRV13] J. Le Roux and E. Vincent, “Consistent Wiener filtering for audio source separation,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 217–220, Mar. 2013.

Webmaster (timo.g/berkmann@ukool.de99) (Changed: 2020-01-23)