The goal of speech enhancement in natural environments is to enhance speech signals which have been degraded by real-world interferers. Typical interferers may include speech babble in a bar, engine noise in a car, traffic noise in general and wind noise in an outdoor environment.
Speech enhancement aims to increase the intelligibility of speech, thus reducing word error rates in a speech recognition context, and also aims to increase perceived signal quality, making it more comfortable to listen to enhanced speech for longer periods of time. Speech enhancement is both a relevant as well as a difficult task. Its importance arises from the many practical applications, for instance in mobile communications and hearing aids. Its difficulty arises from the fact that real-world interferers are often non-stationary and speech-like, inducing a varying and possibly significant spectral overlap.
We will first introduce the general setting of single-channel speech enhancement, then review sparse coding as well as dictionary learning, and finally bring everything together to enhance speech in the presence of real-world interferers. The following audio clip is an example recording of degraded speech, which we will be able to enhance with the method discussed in this blog post.
We consider a one-to-one conversation in a natural environment recorded by a single microphone. This setting can be formalized as a linear additive mixture of underlying clean speech and interferer,
where denotes the time-domain degraded speech, and denote the underlying time-domain clean speech and interferer signals, respectively.
The goal of speech enhancement is to separate the observed mixture of degraded speech into its underlying components. This under-determined problem can be solved by introducing prior knowledge in the form of signal models, more specifically in the form of learned speech and interferer dictionaries.
To enhance the mixture of degraded speech, the time-domain signal is first transformed into a suitable feature space. One possible choice is the short-time Fourier transform (STFT) magnitude domain, where mixture additivity holds in approximation . After the feature transformation, overlapping blocks are extracted and vectorized, these vectors constitute the individual signal observations over time.
For notational simplicity, a single observation of degraded speech in the STFT magnitude domain is denoted as , underlying clean speech as and the interferer as , omitting the time index and using bold symbols to indicate a vector of magnitude frequency observations. The individual mixture signal observations are then each sparsely coded (see below) using a composite dictionary consisting of a learned speech and interferer dictionary. Since speech as well as many interferes contain structure, their structured components can be sparsely explained by few signal components from a suitably learned dictionary. This observation lies at the core of speech enhancement based on sparse coding. If both the speech and interferer dictionary are coherent only to their respective structured signal components in the mixture (and sufficiently incoherent to each other), sparse coding is able to separate the mixture into its underlying components, and at the same time suppress any unstructured components (random noise).
A speaker independent speech dictionary has to be learned off-line (see below), since the clean speech signal is never observable during enhancement. Since speech is well-structured however, such a pre-trained dictionary remains valid during enhancement. On the other hand, an interferer dictionary can (and in realistic scenarios has to be be) learned and updated on-line during speech pauses.
The whole pipeline is visualized below in Figures 1 and 2, where dictionary learning and enhancement are shown as two separate steps.
Dictionary learning pipeline: Clean speech is transformed (‘FT’) using the short-time Fourier transform and then fed into the dictionary learning algorithm (‘DL’), resulting in a learned speech dictionary , and similarly, an interferer dictionary is learned. Both dictionaries are then concatenated into the composite dictionary .
Speech enhancement pipeline: The degraded speech mixture is transformed (‘FT’) using the short-time Fourier transform to obtain , which is then sparsely coded in the composite dictionary , to obtain a coding . This coding is then separated into two parts and corresponding to speech and interferer dictionary atoms, respectively. A Wiener-like filter is constructed and applied to the mixture signal (‘f’), which is then inverse transformed (‘IFT’) into the time-domain (using the mixture’s phase) to obtain an estimate of clean speech.
Sparse coding lies at the core of speech enhancement based on sparse coding and dictionary learning and aims to approximate a signal observation with low error by using a linear combination of only a few signal components from a pre-defined set of components (atoms from a dictionary). More formally, a -sparse coding of a signal in a dictionary of unit-norm signal components (atoms) is a sparse linear combination of atoms approximating . The cardinality of the coding , is the number of non-zero coefficients. The sparse coding problem can for instance be formulated as follows using a cardinality constraint,
Sparse coding is a trade-off between the signal approximation error , the coding cardinality , as well as the size of the dictionary . As previously noted, for structured signal classes, it is possible to learn dictionaries such that low approximation error can be achieved, while at the same time requiring a coding with only a few active dictionary atoms, i.e. with a low coding cardinality.
An ideal dictionary is coherent to its respective signal class (i.e. a speech dictionary is coherent to the speech signal class, and an interferer dictionary is coherent to an interferer signal class), and at the same time is incoherent to any other signal classes. In order to ensure this property, dictionaries have to be learned from training data (an analytic dictionary, for instance a Wavelet basis, does not fulfill this property).
A dictionary learning algorithm iteratively adapts an initial dictionary (consisting of unit norm atoms sampled randomly from the unit sphere) to a particular signal class, such that observations from this signal class can be sparsely coded in the dictionary with low error.
Formally, a dictionary learning algorithm factorizes a data matrix (where columns constitute observations in our feature space) into a dictionary matrix and coding matrix ,
subject to a sparsity constraint on the columns of and a unit norm constraint on the columns of , where denotes the Frobenius norm. Since both and are unknown, this minimization is non-convex and also intractable due to the sparsity constraint on . To obtain a locally optimal solution, algorithms based on alternating minimization with respect to and can be used, for instance K-SVD .
First, the dictionary is initialized with atoms sampled from the unit sphere. Then, alternatingly, and are updated. First, a coding update on is performed. Since the codings of individual observations are column-separable, separate sparse coding problems can be solved to obtain an updated coding matrix . Then, the dictionary matrix is updated as follows. By separating the contribution of the -th atom
an updated approximation of can be obtained using a rank one approximation to the residual matrix . At the same time, this also yields an updated coding which ensures that the coding coefficients are adapted to the new atom. The sub-setting using keeps the locations of the non-zero coefficients at the same place. Alternating updates are performed until a stopping criterion is reached.
The success of dictionary learning is measured by the ability of a trained dictionary to sparsely code observations from its corresponding signal class not seen during training with low approximation error.
The enhancement of degraded speech aims to obtain an estimate of underlying clean speech from the mixture signal , such that the residual norm is significantly smaller than .
For this purpose, the mixture signal is sparsely coded in the concatenation of a learned speech dictionary and interferer dictionary, i.e. the sparse coding problem introduced earlier is solved using the composite dictionary ,
The resulting coding vector linearly combines speech and interferer atoms to approximate the mixture, and can be interpreted as , where contains weights corresponding to speech dictionary atoms, and where contains weights corresponding to interferer dictionary atoms.
An estimate of the underlying clean speech signal STFT magnitude (similarly for the interferer) can be obtained as
which, due to the separation of the coding into and , requires that speech signal components are explained only (or mostly) using speech dictionary atoms, and interferer signal components are explained using only (or mostly) interferer dictionary atoms. Thus, speech enhancement using the above approach can result in two separate and competing errors. A too sparse coding introduces source distortion, where the underlying clean speech signal is simply explained by too few atoms. A too dense coding however, while avoiding source distortion, introduces source confusion, where speech signal components are increasingly explained also by components from the interferer dictionary, and vice-versa.
In order to obtain the time-domain estimate of the clean speech signal, first a Wiener-like filter is constructed from the estimated speech and interferer components and applied to the degraded speech mixture observation in order to obtain a magnitude estimate of clean speech
where and denote element-wise division and multiplication, respectively. Finally, the magnitude estimate is inverted back into the time-domain by using the phase of the mixture signal. By using the discussed speech enhancement method, we are now able to enhance the example audio clip introduced at the beginning:
Before speech enhancement
After speech enhancement
For further details about speech enhancement based on sparse coding and dictionary learning, as well as references to previous and related work, have a look at [1, 2, 3].
 Christian Sigg, Tomas Dikk, Joachim Buhmann, Speech Enhancement with Sparse Coding in Learned Dictionaries, IEEE International Conference on Acoustics, Speech and Signal Processing, 2010.
 Christian Sigg, Tomas Dikk, Joachim Buhmann, Speech Enhancement Using Generative Dictionary Learning, IEEE Transactions on Audio, Speech, and Language Processing, 2012.
 Christian Sigg, Tomas Dikk, Joachim Buhmann, Learning Dictionaries With Bounded Self-Coherence, IEEE Signal Processing Letters, 2012.
 Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, Least Angle Regression, The Annals of Statistics, 2004.
 Philipos Loizou, Speech Enhancement: Theory and Practice, Taylor and Francis, 2007.
 Michal Aharon, Michael Elad, Alfred Bruckstein, Yana Katz, K-SVD: An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representations, IEEE Transactions on Signal Processing, 2006.
 Geoffrey Davis, Stephane Mallat, Zhifeng Zhang, Adaptive Time-Frequency Decompositions with Matching Pursuits, Journal on Optical Engineering, 1994.