In this blog post, we complement our first post and examine denoising from a more analytic perspective with detailed mathematical derivations. We will show that there is a unique twoway connection between the uncorrupted data distribution \(p(x)\) and the optimal denoising function \(g(\tilde x)\), provided that the corruption noise is Gaussian. The corrupted distribution \(p(\tilde x)\) plays a central role as an intermediate stage, which is schematically shown below.
We will also derive the denoising function for two interesting cases of \(p(x)\): Gaussian and a mixture model.
Let’s keep in mind that when learning by denoising, data samples \(x\) are first corrupted by noise to obtain corrupted samples \(\tilde x\), and thereafter the original samples are reconstructed as \(\hat x = g(\tilde x)\) by minimizing the mean square error (MSE): \[C = \mathbb E_{p(x, \tilde x)} \left\{ \left\ x – g(\tilde x) \right\^2 \right\}\,. \qquad(1)\] Denoising function \(g\) aims to cancel the effect of the corruption process as accurately as possible and, therefore, it is the function that must be learned.
Connection between \(p(x)\) and \(p(\tilde x)\)
The mapping \(p(x) \to p(\tilde x)\) is defined by the corruption process or, in other words, by the conditional distribution \(p(\tilde x\,\,x)\):
\[p(\tilde x) = \int \!\! p(x, \tilde x) \text{d}x = \int \!\! p(\tilde x\,\,x) p(x) \text{d}x \,.\] If the corruption process is additive \[\tilde x = x + \eta\] and noise \(\eta\) has Gaussian distribution \(\rho(\eta)\), the marginal distribution of \(\tilde x\) is a convolution: \[p(\tilde x) = \int \!\! \rho(\tilde x – x) p(x) \text{d}x \,.\] It is well known that the Fourier transform (denoted here by \(\mathcal{F}\)) of a convolution is a product: \[\mathcal{F}\{p_{\tilde X}\} = \mathcal{F}\{\rho\}\mathcal{F}\{p_X\}\,,\] from where it is clear that there is a onetoone correspondence between \(\mathcal{F}\{p_{\tilde X}\}\) and \(\mathcal{F}\{p_X\}\) if the Fourier transform of the “corrupting” distribution \(\mathcal{F}\{\rho\}\) does not have zeros. This is indeed true for Gaussian \(\rho\). Thus, using the Fourier inversion theorem, we can state that there is a onetoone correspondence between \(p(\tilde x)\) and \(p(x)\).
Connection between \(p(\tilde x)\) and \(g(\tilde x)\)
Next, we derive the connection between the corrupted data distribution \(p(\tilde x)\) and the optimal denoising function \(g(\tilde x)\). We first use the general result that the minimizer of Eq. (1) is given by \[g(\tilde x) = \mathbb E_{p(x\,\,\tilde x)}\left\{ x\right\}\,.\] Let’s again assume that the corruption noise is Gaussian: \[p(\tilde x\,\,x) = \mathcal{N}(\tilde x\,\,x, \sigma_n^2 I)
= \frac{1}{(2\pi\sigma_n^2)^{d/2}}\exp\left[\frac{\\tilde x\, – x\^2}{2\sigma_n^2}\right]\,,\] and note that the gradient of \(p(\tilde x\,\,x)\) wrt \(\tilde x\) is: \[\nabla_{\tilde x}\,p(\tilde x\,\,x) = \frac{(x\, – \tilde x)}{\sigma_n^2} p(\tilde x\,\,x)\,.\] We use this result to compute the gradient of \(p(\tilde x)\): \[\begin{aligned}
\nabla_{\tilde x}\,p(\tilde x) =& \int \!\!\nabla_{\tilde x}\,p(\tilde x\,\,x) p(x) \text{d}x
\nonumber \\
=& \int \!\! \frac{(x\, – \tilde x)}{\sigma_n^2} p(\tilde x\,\,x) p(x) \text{d}x
\nonumber \\
=& \int \!\! \frac{(x\, – \tilde x)}{\sigma_n^2} p(x\,\,\tilde x) p(\tilde x) \text{d}x
\nonumber \\
=& \frac{1}{\sigma_n^2} \left( \int \!\! x\,p(x\,\,\tilde x) \text{d}x\, \tilde x\int \!\! p(x\,\,\tilde x) \text{d}x \right) p(\tilde x)
\nonumber \\
=& \frac{1}{\sigma_n^2}\Big(\mathbb E_{p(x\,\,\tilde x)} \left\{ x\right\} – \tilde x \Big) p(\tilde x)
\nonumber \\
=& \frac{1}{\sigma_n^2}\big(g(\tilde x)\, \tilde x\big)p(\tilde x)\,.
\nonumber\end{aligned}\] From this expression, we can solve \(g(\tilde x)\) in terms of \(p(\tilde x)\) to find
\[g(\tilde x) = \tilde x + \sigma_n^2 \nabla_{\tilde x} \log p(\tilde x)\,. \qquad(2)\]
This simple connection between the optimal denoising function \(g(\tilde x)\) and the corrupted data distribution \(p(\tilde x)\) is the main result of this blog post. To our knowledge, this hasn’t been presented in literature before. The finding generalizes the result by Alain and Bengio: \[g(\tilde x) = \tilde x + \sigma_n^2 \nabla_{x} \log p(x) + o(\sigma_n^2)\quad{\rm as}\quad \sigma_n \to 0\,,\] which is formally valid only in the small noise limit. In comparison, our result is valid for the Gaussian noise of arbitrary noise level \(\sigma_n\), but with the crucial difference that the uncorrupted data distribution \(p(x)\) is replaced by the corrupted one, \(p(\tilde x)\). Therefore we conclude that, at least for Gaussian noise, the vector field \(g(\tilde x)\, – \tilde x\) is, in general, proportional to the gradient of \(\log p(\tilde x)\) instead of \(\log p(x)\) and only aligns with the latter in the small noise limit \(\sigma_n \to 0\).
For completeness, we also present the relation \(g(\tilde x) \to p(\tilde x)\). By rewriting the equation for \(g(\tilde x)\) as \[\nabla_{\tilde x} \log p(\tilde x) =\big(g(\tilde x)\, – \tilde x\big)/ \sigma_n^2\] and integrating both sides of this equation along an arbitrary contour between points \(0\) and \(\tilde x\), we get \[\int_0^{\tilde x} \nabla_{x'} \log p(x')\cdot\text{d}x' = \log p(\tilde x) – \log p(0)
= \frac{1}{\sigma_n^2} \int_0^{\tilde x} \big(g(x')\, – x'\big)\cdot\text{d}x'\,.
\nonumber\] By solving \(p(\tilde x)\) in terms of \(g(\tilde x)\), we get \[p(\tilde x) = \frac{1}{Z} \exp\left[\frac{1}{\sigma_n^2} \int_0^{\tilde x} \big(g(x')\, – x'\big)\cdot\text{d}x' \right]\,,\] where \(Z = 1/p_{\tilde X}(0)\) is a normalization constant fixed by \(\int p(\tilde x)\text{d} \tilde x = 1\).
Denoising function for Gaussian data
As an example, consider the Gaussian data distribution \(p(x) = \mathcal{N}\big(x\,\,\mu, \sigma^2\big)\) and Gaussian corruption noise as in \[p(\tilde x) = \mathcal{N}\big(\tilde x\,\,\mu, \sigma^2 + \sigma_n^2 \big) \,.\] Then, computing the optimal denoising function gives: \[g(\tilde x) = \tilde x + \sigma_n^2 \nabla_{\tilde x} \left(\frac{\\tilde x\, – \mu\^2}{2(\sigma^2 + \sigma_n^2)}\right) = \tilde x + \frac{\sigma_n^2}{\sigma^2 + \sigma_n^2}(\mu – \tilde x)\,.\] This example illustrates the behaviour of \(g(\tilde x)\) as a function of the noise level \(\sigma_n\). Although Eq. (2) suggests that \(g(\tilde x) \tilde x\) grows with \(\sigma_n^2\), in practice, the growth is saturated since \(\nabla_{\tilde x} \log p(\tilde x)\) scales as \(1/(\sigma^2 + \sigma_n^2)\).
Denoising function for mixture model
Let’s now derive the denoising function for a general mixture model with the probability distribution \[p(x) = \sum\limits_{k = 1}^N p(x, k) = \sum\limits_{k = 1}^N p(x\,\,k) P(k)\,,\] where \(k = 1,\ldots,N\) is a discrete latent variable representing the mixture components and \(p(x\,\,k) \equiv p_k(x)\) is the \(x\)distribution of the component \(k\). By a straightforward computation, the optimal denoising function can be decomposed in terms of the components as \[\begin{aligned}
g(\tilde x) =& \mathbb E_{p(x\,\,\tilde x)}\left\{ x\right\} = \int \!\! p(x\,\,\tilde x) x\, \text{d}x
\nonumber \\
=& \int \!\! \sum\limits_{k = 1}^N p(x\,\,k, \tilde x) p(k\,\,\tilde x) x\,\text{d}x
\nonumber \\
= & \sum\limits_{k = 1}^N p(k\,\,\tilde x) \mathbb E_{p(x\,\,k, \tilde x)}\left\{ x\right\}
\nonumber \\
= & \sum\limits_{k = 1}^N p(k\,\,\tilde x) g_k(\tilde x)\,,
\nonumber\end{aligned}\] where \(g_k(\tilde x) = \mathbb E_{p(x\,\,k, \tilde x)} \left\{ x\right\}\) is the optimal denoising function of components \(k\). We see that \(g(\tilde x)\) implicitly represents the posterior distributions \(p(k\,\,\tilde x)\), just like in Bayesian inference.
Summary
So, the main takeaways of this blog post are:

Assuming that the corruption process is Gaussian, there is a onetoone correspondence between the data distribution \(p(x)\) and the optimal denoising function \(g(\tilde x)\). The optimal denoising function can be computed as \[g(\tilde x) = \tilde x + \sigma_n^2 \nabla_{\tilde x} \log p(\tilde x)\,,\] where \(\sigma_n\) is the standard deviation of the corruption noise.

Understanding the precise mathematical form of the connections between \(p(x)\) and \(g(\tilde x)\) makes it possible, in principle, to learn the structure of data distribution \(p(x)\) in great detail by performing a comprehensive denoising task, or vice versa.
Leave a Reply