Speech Enhancement

Sparse Coding and Dictionary Learning

19 December 2017
| |

The goal of speech enhancement in natural environments is to enhance speech signals which have been degraded by real-world interferers. Typical interferers may include speech babble in a bar, engine noise in a car, traffic noise in general and wind noise in an outdoor environment. 

This blog post aims to give you an introduction to a speech enhancement method based on sparse coding and dictionary learning as described in [1, 2, 3].

Speech enhancement aims to increase the intelligibility of speech, thus reducing word error rates in a speech recognition context, and also aims to increase perceived signal quality, making it more comfortable to listen to enhanced speech for longer periods of time. Speech enhancement is both a relevant as well as a difficult task. Its importance arises from the many practical applications, for instance in mobile communications and hearing aids. Its difficulty arises from the fact that real-world interferers are often non-stationary and speech-like, inducing a varying and possibly significant spectral overlap.

Overview

We will first introduce the general setting of single-channel speech enhancement, then review sparse coding as well as dictionary learning, and finally bring everything together to enhance speech in the presence of real-world interferers. The following audio clip is an example recording of degraded speech, which we will be able to enhance with the method discussed in this blog post.

 

Setting

We consider a one-to-one conversation in a natural environment recorded by a single microphone. This setting can be formalized as a linear additive mixture x(n) of underlying clean speech and interferer,

    \begin{align*} x(n) = s(n) + i(n) \end{align*}

where x(n) denotes the time-domain degraded speech, s(n) and i(n) denote the underlying time-domain clean speech and interferer signals, respectively.

The goal of speech enhancement is to separate the observed mixture of degraded speech x(n) into its underlying components. This under-determined problem can be solved by introducing prior knowledge in the form of signal models, more specifically in the form of learned speech and interferer dictionaries.

Method Overview

To enhance the mixture of degraded speech, the time-domain signal is first transformed into a suitable feature space. One possible choice is the short-time Fourier transform (STFT) magnitude domain, where mixture additivity holds in approximation [5]. After the feature transformation, overlapping blocks are extracted and vectorized, these vectors constitute the individual signal observations over time.

    \begin{align*} | X(\omega, n) | \approx | S(\omega, n) | + | I(\omega, n) | \\ \end{align*}

For notational simplicity, a single observation of degraded speech in the STFT magnitude domain is denoted as \mathbf{x} \in \mathbb{R}^{D}, underlying clean speech as \mathbf{s} \in \mathbb{R}^{D} and the interferer as \mathbf{i} \in \mathbb{R}^{D}, omitting the time index n and using bold symbols to indicate a vector of magnitude frequency observations. The individual mixture signal observations are then each sparsely coded (see below) using a composite dictionary consisting of a learned speech and interferer dictionary. Since speech as well as many interferes contain structure, their structured components can be sparsely explained by few signal components from a suitably learned dictionary. This observation lies at the core of speech enhancement based on sparse coding. If both the speech and interferer dictionary are coherent only to their respective structured signal components in the mixture (and sufficiently incoherent to each other), sparse coding is able to separate the mixture into its underlying components, and at the same time suppress any unstructured components (random noise).

A speaker independent speech dictionary has to be learned off-line (see below), since the clean speech signal is never observable during enhancement. Since speech is well-structured however, such a pre-trained dictionary remains valid during enhancement. On the other hand, an interferer dictionary can (and in realistic scenarios has to be be) learned and updated on-line during speech pauses.

The whole pipeline is visualized below in Figures 1 and 2, where dictionary learning and enhancement are shown as two separate steps.

Dictionary Learning

Dictionary Learning

Dictionary learning pipeline: Clean speech is transformed (‘FT’) using the short-time Fourier transform and then fed into the dictionary learning algorithm (‘DL’), resulting in a learned speech dictionary \mathbf{D}^{(s)}, and similarly, an interferer dictionary \mathbf{D}^{(i)} is learned. Both dictionaries are then concatenated into the composite dictionary \mathbf{D} = \lbrack \mathbf{D}^{(s)} \, \mathbf{D}^{(i)} \rbrack.

Speech Enhancement

Speech Enhancement

Speech enhancement pipeline: The degraded speech mixture is transformed (‘FT’) using the short-time Fourier transform to obtain \mathbf{x}, which is then sparsely coded in the composite dictionary \mathbf{D} = \lbrack \mathbf{D}^{(s)} \, \mathbf{D}^{(i)} \rbrack, to obtain a coding \mathbf{c} = \lbrack \mathbf{c}^{(s)}; \mathbf{c}^{(i)} \rbrack. This coding is then separated into two parts \mathbf{c}^{(s)} and \mathbf{c}^{(i)} corresponding to speech and interferer dictionary atoms, respectively. A Wiener-like filter is constructed and applied to the mixture signal (‘f’), which is then inverse transformed (‘IFT’) into the time-domain (using the mixture’s phase) to obtain an estimate of clean speech.

Sparse Coding

Sparse coding lies at the core of speech enhancement based on sparse coding and dictionary learning and aims to approximate a signal observation with low error by using a linear combination of only a few signal components from a pre-defined set of components (atoms from a dictionary). More formally, a K-sparse coding of a signal \mathbf{x} \in \mathbb{R}^{D} in a dictionary \mathbf{D} \in \mathbb{R}^{D \times L} of unit-norm signal components (atoms) is a sparse linear combination of K \ll L atoms approximating \mathbf{x}. The cardinality of the coding \mathbf{c} \in \mathbb{R}^{L}, || \mathbf{c} || _{0} = K is the number of non-zero coefficients. The sparse coding problem can for instance be formulated as follows using a cardinality constraint,

    \begin{gather*} c^{*} = \arg \min _{c} \, \lVert \mathbf{x} - \mathbf{D}\mathbf{c} \rVert_{2} \\ \text{s.t. } \lVert \mathbf{c} \rVert_{0} \leq K \end{gather*}

Greedy approximations can be obtained with e.g. Orthogonal Matching Pursuit (OMP) [7] and, by relaxing the \ell_{0} norm to the \ell_{1} norm, with Least Angle Regression (LARS) [4] or LARC [2].

Sparse coding is a trade-off between the signal approximation error \lVert \mathbf{x} - \mathbf{D}\mathbf{c} \rVert_{2}, the coding cardinality K, as well as the size L of the dictionary \mathbf{D} \in \mathbb{R}^{D \times L}. As previously noted, for structured signal classes, it is possible to learn dictionaries such that low approximation error can be achieved, while at the same time requiring a coding with only a few active dictionary atoms, i.e. with a low coding cardinality.

Dictionary Learning

An ideal dictionary is coherent to its respective signal class (i.e. a speech dictionary is coherent to the speech signal class, and an interferer dictionary is coherent to an interferer signal class), and at the same time is incoherent to any other signal classes. In order to ensure this property, dictionaries have to be learned from training data (an analytic dictionary, for instance a Wavelet basis, does not fulfill this property).

Spectrograms of speech enhanced with various methods.

Spectrograms of speech enhanced with various methods. (Simon Ziffermayer / Zühlke)

A dictionary learning algorithm iteratively adapts an initial dictionary (consisting of unit norm atoms sampled randomly from the unit sphere) to a particular signal class, such that observations from this signal class can be sparsely coded in the dictionary with low error.

Formally, a dictionary learning algorithm factorizes a data matrix \mathbf{X} \in \mathbb{R}^{D \times N} (where columns constitute observations in our feature space) into a dictionary matrix \mathbf{D} \in \mathbb{R}^{D \times L} and coding matrix \mathbf{C} \in \mathbb{R}^{L \times N},

    \begin{align*} \arg \min _{\mathbf{D}, \mathbf{C}} \, \lVert \mathbf{X} - \mathbf{D}\mathbf{C} \rVert _{F}^{2} \end{align*}

subject to a sparsity constraint on the columns of \mathbf{C} and a unit \ell_{2} norm constraint on the columns of \mathbf{D}, where \lVert \, . \, \rVert _{F} denotes the Frobenius norm. Since both \mathbf{D} and \mathbf{C} are unknown, this minimization is non-convex and also intractable due to the sparsity constraint on \mathbf{C}. To obtain a locally optimal solution, algorithms based on alternating minimization with respect to \mathbf{C} and \mathbf{D} can be used, for instance K-SVD [6].

First, the dictionary \mathbf{D} is initialized with atoms sampled from the unit \ell_{2} sphere. Then, alternatingly, \mathbf{C} and \mathbf{D} are updated. First, a coding update on \mathbf{C} is performed. Since the codings of individual observations are column-separable, n separate sparse coding problems can be solved to obtain an updated coding matrix \mathbf{C}. Then, the dictionary matrix is updated as follows. By separating the contribution of the l-th atom \mathbf{d}_{(:,l)}

    \begin{align*} \lVert \mathbf{X} - \mathbf{D}\mathbf{C} \rVert _{F}^{2} & \; = \; \bigg\lVert \bigg( \mathbf{X} - \sum_{m \neq l} \mathbf{d}_{(:,m)} \mathbf{c}_{(m,:)} \bigg) - \mathbf{d}_{(:,l)} \mathbf{c}_{(l,:)} \bigg\rVert _{F}^{2} \\ & \; = \; \bigg\lVert ( \mathbf{R}^{(l)} - \mathbf{d}_{(:,l)} \mathbf{c}_{(l,:)} \bigg\rVert _{F}^{2} \end{align*}

an updated approximation of \mathbf{d}_{(:,l)} can be obtained using a rank one approximation to the residual matrix \mathbf{R}^{(l)}. At the same time, this also yields an updated coding \mathbf{c}_{(l,\mathcal{N})} which ensures that the coding coefficients are adapted to the new atom. The sub-setting using \mathcal{N} keeps the locations of the non-zero coefficients at the same place. Alternating updates are performed until a stopping criterion is reached.

The success of dictionary learning is measured by the ability of a trained dictionary to sparsely code observations from its corresponding signal class not seen during training with low approximation error.

Enhancement

The enhancement of degraded speech \mathbf{x} aims to obtain an estimate \mathbf{\hat{s}} of underlying clean speech from the mixture signal \mathbf{x}, such that the residual norm \lVert \mathbf{\hat{s}} - \mathbf{s} \rVert _{2} is significantly smaller than \lVert \mathbf{x} - \mathbf{s} \rVert _{2}.

For this purpose, the mixture signal \mathbf{x} is sparsely coded in the concatenation of a learned speech dictionary and interferer dictionary, i.e. the sparse coding problem introduced earlier is solved using the composite dictionary \mathbf{D} = \lbrack \mathbf{D}^{(s)} \, \mathbf{D}^{(i)} \rbrack,

    \begin{gather*} \arg \min _{\mathbf{c}} \, \lVert \mathbf{x} - \mathbf{D}\mathbf{c} \rVert _{2} \\ = \arg \min _{\mathbf{c}^{(s)}, \mathbf{c}^{(i)}} \, \lVert \mathbf{x} - \lbrack \mathbf{D}^{(s)} \, \mathbf{D}^{(i)} \rbrack \lbrack \mathbf{c}^{(s)}; \mathbf{c}^{(i)} \rbrack \rVert _{2} \\ \text{s.t. } \lVert \lbrack \mathbf{c}^{(s)}; \mathbf{c}^{(i)} \rbrack \rVert _{1} \leq \theta \end{gather*}

The resulting coding vector \mathbf{c} linearly combines speech and interferer atoms to approximate the mixture, and can be interpreted as \mathbf{c} = \lbrack \mathbf{c}^{(s)}; \mathbf{c}^{(i)} \rbrack, where \mathbf{c}^{(s)} contains weights corresponding to speech dictionary atoms, and where \mathbf{c}^{(i)} contains weights corresponding to interferer dictionary atoms.

An estimate \mathbf{\hat{s}} of the underlying clean speech signal STFT magnitude (similarly for the interferer) can be obtained as

    \begin{align*} \mathbf{\hat{s}} = \mathbf{D}^{(s)} \mathbf{c}^{(s)} \end{align*}

which, due to the separation of the coding \mathbf{c} into \mathbf{c}^{(s)} and \mathbf{c}^{(i)}, requires that speech signal components are explained only (or mostly) using speech dictionary atoms, and interferer signal components are explained using only (or mostly) interferer dictionary atoms. Thus, speech enhancement using the above approach can result in two separate and competing errors. A too sparse coding introduces source distortion, where the underlying clean speech signal is simply explained by too few atoms. A too dense coding however, while avoiding source distortion, introduces source confusion, where speech signal components are increasingly explained also by components from the interferer dictionary, and vice-versa.

In order to obtain the time-domain estimate of the clean speech signal, first a Wiener-like filter is constructed from the estimated speech \mathbf{\hat{s}} and interferer components \mathbf{\hat{i}} and applied to the degraded speech mixture observation in order to obtain a magnitude estimate of clean speech

    \begin{align*} \mathbf{s}_{enhanced} = \mathbf{\hat{s}} \oslash (\mathbf{\hat{s}} + \mathbf{\hat{i}}) \otimes \mathbf{x} \end{align*}

where \oslash and \otimes denote element-wise division and multiplication, respectively. Finally, the magnitude estimate is inverted back into the time-domain by using the phase of the mixture signal. By using the discussed speech enhancement method, we are now able to enhance the example audio clip introduced at the beginning:

Before speech enhancement

After speech enhancement

For further details about speech enhancement based on sparse coding and dictionary learning, as well as references to previous and related work, have a look at [1, 2, 3].

 

References
[1] Christian Sigg, Tomas Dikk, Joachim Buhmann, Speech Enhancement with Sparse Coding in Learned Dictionaries, IEEE International Conference on Acoustics, Speech and Signal Processing, 2010.
[2] Christian Sigg, Tomas Dikk, Joachim Buhmann, Speech Enhancement Using Generative Dictionary Learning, IEEE Transactions on Audio, Speech, and Language Processing, 2012.
[3] Christian Sigg, Tomas Dikk, Joachim Buhmann, Learning Dictionaries With Bounded Self-Coherence, IEEE Signal Processing Letters, 2012.
[4] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, Least Angle Regression, The Annals of Statistics, 2004.
[5] Philipos Loizou, Speech Enhancement: Theory and Practice, Taylor and Francis, 2007.
[6] Michal Aharon, Michael Elad, Alfred Bruckstein, Yana Katz, K-SVD: An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representations, IEEE Transactions on Signal Processing, 2006.
[7] Geoffrey Davis, Stephane Mallat, Zhifeng Zhang, Adaptive Time-Frequency Decompositions with Matching Pursuits, Journal on Optical Engineering, 1994.

Advanced Data Scientist

Tomas Dikk

Comments (0)

×

Sign up for our Updates

Sign up now for our updates.

This field is required
This field is required
This field is required

I'm interested in:

Select at least one category
You were signed up successfully.