As mentioned before, unsupervised learning tasks support the main tasks of supervised training by somehow modeling the input distribution, \(p(x)\). Denoising is no exception. When we use denoising as an auxiliary task, we are not interested in denoising itself, neither are we interested in taking samples from \(p(x)\) or computing the probability of the data. What we want is to extract features that describe the data and which are useful for our primary task of supervised learning.

In denoising, we replace the probabilistic modeling task with a function approximation task. Function approximation is what deep neural networks are good at and that is why denoising is such a great match with deep learning. It is also simpler to learn the denoising function compared to probabilistic graphical models.

# What is denoising?

Let’s introduce some notation. We draw samples from the input distribution \(p(x)\) and call those \(x\). After corrupting the samples with noise we call them \(\tilde x\). Therefore, we can denote the distribution of corrupted data with \(p(\tilde x)\). The task of denoising is to recover the original \( x \) from corrupted \(\tilde x\) by using function \(g\): \[x \approx g(\tilde x) .\] Function \(g\) is called the denoising function and it tries to invert the corruption. The noise is typically Gaussian noise or missing values (dropout). A neural network can learn to approximate \(g(\tilde x)\) by minimizing mean square error \[C=||x-g(\tilde x)||^2.\] This is what denoising autoencoders do: they take some samples \(x\) from a dataset, corrupt them, and try to minimize the cost. Note that function \(g(\tilde x)\) is deterministic but the corruption is random in nature. Therefore, it is impossible to exactly recover the original \(x\), so the denoising function gives an expectation and, thus, it operates on the distribution level.

The optimal denoising function \({g({\tilde{x}})}\) can be pretty complex, but we use a machine learning model capable of representing very wide and complex classes of functions, while being relatively easy to train.

# Why add noise?

One common view of denoising is as simply regularizing an autoencoder, in a similar way that additive noise, Gaussian noise or drop-out is used to regularize purely supervised models. We argue that the role of denoising is more than just regularization: it allows capturing the important structure (or latent variables) of the data distribution \(p(x)\), which can boost the performance in primary tasks, such as regression or classification.

Let’s first recall classic autoencoders, which learn data representation \(h = \phi(x)\) so as to minimize the mean square error (MSE) of reconstruction: \[\min ||x – \psi(h)||^2 .

\] Such autoencoders pull information from different parts of an object \(x\) and learn features which are useful for performing the reconstruction task. For example, if \(x\) is an image of a hand-written digit, an autoencoder learns features that represent the shapes of digits:

*Autoencoders pull information from all over the object in order to perform reconstruction.*

One potential problem with classic autoencoders is that they can copy the corresponding part of the input (as shown with the red line on the previous figure) unless the model is restricted somehow. One way to fix this is to forbid using the reconstructed part in the input.

*In regression, the model cannot copy the input directly to the output and therefore it is forced to find useful features.*

This kind of modeling is done, for example, in video processing where the task is to predict the next frame given the previous frames of a video. Another example is NADE which models the data distribution as a product of conditional distributions \(p(x_i~|~x_{<i})\). This is better, but the problem with the MSE cost still exists, as it ignores higher-order features (such as variance) of the modeled distributions. In the following example, the only thing that discriminates the digit from the background is the variance of the pixel intensities, as the expected intensities are equal throughout the image.

*Models minimizing MSE fail to represent higher-order features.*

Therefore, the variances of the pixel intensities are useful features, for example, for performing the classification task. However, even if these features were present in the latent representation \(h\), they would be totally useless for decreasing the MSE cost \(||x – \psi(h)||^2\). Such models essentially only represent the mean of the conditional distribution \(p(x~|~h)\) and fail to learn higher-order features. This is too restrictive in many real-world tasks.

This problem can be fixed by explicitly modeling the mean and the variance of the conditional distributions, for example, by minimizing the cost function like \[\log \psi_{\sigma^2}(h) + \frac{||x – \psi_\mu(h)||^2}{\psi_{\sigma^2}(h)} \, .\] However, scaling this to modeling even higher-order features (than just the mean and the variance) is likely to require a full probabilistic model \(p(x, h)\). This is the direction taken by variational autoencoders, for example.

Fortunately, learning higher-order features happens naturally with denoising, and the goal of this blog post is to explain why.

*Denoising autoencoders can represent higher-order features (such as variance).*

## Why denoising models the data distribution

In the next blog post, we will show that there are one-to-one correspondences both between data distribution \(p(x)\) and corrupted distribution \(p(\tilde x)\), and between \(p(\tilde x)\) and the optimal denoising function \(g(\tilde x)\). This means that \(p(x)\) and optimal \(g(\tilde x)\) must have a one-to-one correspondence. Assuming additive Gaussian corruption noise with standard deviation \(\sigma_n\), the equation we end up looks like this: \[g(\tilde x) = \tilde x + \sigma_n^2 \nabla \log p (\tilde x) .\] Let’s look at an example below that illustrates all this in 2D.

*Optimal denoising function (r.h.s. plot) captures properties of the data distribution (l.h.s. plot).*

The input distribution \(p(x)\) is two crescents shown on the left hand side of the figure. In the middle, we show the corrupted version \(p(\tilde x)\), which is wider because of the added Gaussian noise. On the right hand side, arrows show the best estimate of how to move from \(\tilde x\) to the original \(x\), that is what optimal \(g(\tilde x)\) should be. One thing to notice here is that the arrows are always perpendicular to the contour lines of the corrupted data distribution and point towards the higher density regions. Thus, the shape of the denoising function reflects the shape of the corrupted data distribution \(p(\tilde x)\) which, in turn, is descriptive of the data distribution \(p(x)\).

If we increase the width of the data distribution, it also affects \(g(\tilde x)\), which means that \(g(\tilde x)\) implicitly models the distribution width:

*Optimal denoising function (r.h.s. plot) captures properties of the data distribution (l.h.s. plot).*

In this visualization, you can play around with the width of the data distribution used in this example to see how it affects \(g(\tilde x)\).

# Why do we learn useful representations by denoising?

Let’s not forget that denoising is not the goal in itself: we are not interested in \({g({\tilde{x}})}\) but in what structure, or hidden variables, \({g({\tilde{x}})}\) has to discover in order to do as good a job as possible with denoising. To show that, let’s start with a very simple data distribution, which is Gaussian: \(p(x)={\mathcal{N}}(\mu, \sigma^2)\). Assuming Gaussian corruption distribution \({\mathcal{N}}(0, \sigma_n^2)\), the optimal denoising can be shown to be linear: \[g(\tilde x) = \tilde x + a (\mu – \tilde x),\] with \(a = \sigma_n^2 / (\sigma^2 + \sigma_n^2)\). Thus, in order to perform accurate denoising, the model only has to learn two parameters (\(a\) and \(\mu\)), the same number of parameters (\(\mu\) and \(\sigma\)) which the data distribution has. This means that instead of parameterizing and learning \(p(x)\), we parameterize and learn \(g(\tilde x)\). Note that by minimizing MSE \[ ||x – \hat x||^2\] one can only learn the mean \(\mu\) of the distribution. Now that we give \(\tilde x\) instead of \(x\), we learn variance \(\sigma^2\), too.

Let’s now consider a distribution with more parameters, a mixture of two Gaussians: \[p(x) = w_i {\mathcal{N}}(\mu_1, \sigma^2_1) + (1 – w_i) {\mathcal{N}}(\mu_2, \sigma^2_2) .\] If both Gaussians have the same variance \(\sigma_1=\sigma_2\), the denoising function looks like a rotated \(sigmoid\):

*Optimal denoising (r.h.s. plot) for a mixture of Gaussians with \(w_1=0.5\) and \(\sigma_1=\sigma_2=0.3\) (l.h.s. plot). Red and green dashed lines correspond to the optimal denoising function for each individual Gaussian.*

However, when the widths of the two Gaussians are different, the denoising function actually has a linear component (as in the simple Gaussian case) for each of the individual Gaussians and smooth transitions between them:

*Optimal denoising (r.h.s. plot) for a mixture of Gaussians with \(w_1=0.5\) and \(\sigma_1=0.3\)*

It can be shown that the optimal denoising function for the mixture distribution is of the form \[g(\tilde x) = p(K=0~|~\tilde x) g_1(\tilde x) + p(K=1~|~\tilde x) g_2(\tilde x),\] where K is a binary latent variable that represents which peak a sample \(\tilde x\) belongs to. Thus, the optimal denoising function contains five parameters (the same number of parameters that the modeled distribution) and it has to represent the posterior probability \(p(K~|~\tilde x)\) of the latent variable \(K\). In the same way \(g\) can model other latent variables in the data.

In this visualization, you can play around with the mixture-of-Gaussian example to see how the properties of the data distribution affect \(g(\tilde x)\). Here, you can explore the shapes of the optimal denoising functions for various one-dimensional distributions.

# Summary

So, the main takeaways of this blog post are:

- Denoising is a task which helps modeling data distribution \(p(x)\).
- In contrast to simpler approaches, denoising is sensitive to higher-order features of the data distribution such as width (variance).
- A successful denoising function has to represent latent variables that describe the modeled data distribution.

## Leave a Reply