why a denoising auto-encoder is like performing stochastic gradient this on this expression?

Question

I was reading deeplearning book from Ian Goodflow which u can download it from here on chapter 14 Autoencoders about denoising Autoencoder it said that (page 511)

we can view DAE as performing stochastic gradient descent on the following expectations:

$$ - {\mathbb{E}_{x \sim {{\hat p}_{data}}(x)}}{\mathbb{E}_{\tilde x \sim C(\tilde X|X)}}{\log _{{p_{decoder}}}}(x|h = f(\tilde x)) $$

but I don't get why? could u help me with that? here is the text:

Denoising Autoencoders (DAE) is an autoencoder that receives a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output. The DAE training procedure is illustrated in ﬁgure 14.3. We introduce a corruption process $C(\tilde{x} | x)$ which represents a conditional distribution on corrupted samples $\tilde{x}$, given a data sample. The autoencoder then learns a reconstruction distribution estimated from training pairs $(x,\tilde{x})$, as follows:

Sample a training example $x$ from the training data.

Sample a corrupted version $\tilde{x}$ from $C(\tilde{x} |x)$.

Use $(x,\tilde{x})$ as a training example for estimating the autoencoder reconstruction distribution $p_{reconstruct}(x |\tilde{x}) =p_{decoder}(x | h)$ with $h$ the output of encoder $f(\tilde{x})$ and $p_{decoder}$ typically deﬁned by a decoder $g(h)$.

Typically we can simply perform gradient-based approximate minimization (such as minibatch gradient descent) on the negative log-likelihood decoder. So long as the encoder is deterministic, the denoising autoencoder is a feed forward network and may be trained with exactly the same techniques as any other feed forward network.We can therefore view the DAE as performing stochastic gradient descent on the following expectation:$$- {\mathbb{E}_{x \sim {{\hat p}_{data}}(x)}}{\mathbb{E}_{\tilde x \sim C(\tilde X|X)}}{\log _{{p_{decoder}}}}(x|h = f(\tilde x))$$ (14.14), where $\hat{p}_{data}(x)$ is the training distribution.

I can't find the quote in chapter 14; could you link more correctly to it and/or add more detail to the meaning of the notation in your question? — user3658307, Jun 13 '17 at 03:35

score 2 · Accepted Answer · answered Jun 20 '17 at 02:23

Let $C(\tilde{x}|x)$ describe the conditional probability distribution of a corruption process. More precisely, $\tilde{x}\sim C_x$, where $C_x$ describes the probability distribution of the corruption process, given (or conditioned on) $x$, as a parameter to the density. In other words, given an input data point $x$, it outputs some noisy version $\tilde{x}$. However, this is not deterministic; in other words, if you give it the same point multiple times, it will give different values for $\tilde{x}$.

Our goal is to get a reconstruction distribution $p_{\text{reconstruct}}(x|\tilde{x})$, which can tell us which $x$ is most likely, given $\tilde{x}$, as well as a way to recover $x$ from $\tilde{x}$.

How can we do this? The idea is to use a denoising autoencoder. Take a training set $\mathcal{T}=\{x_i,\tilde{x}_i\}_i$ and train a autoencoder with reconstruction error. This gives us two things, ideally:

An encoder function $f$, which takes a corrupt data point $\tilde{x}$ and returns an "intermediate" representation $h=f(\tilde{x})$.
A decoder function $g$ that takes an encoded noisy data point and reconstructs the "real" denoised data point $x=g(h)=g(f(x))$

Then we have a function that can take noisy data and return clean data.

But how can we get this "practically"? The encoder is not so important, but the decoder is generally a feed-forward artificial (deep) neural network. These are usually trained by stochastic gradient descent (SGD), using an error function as an energy function to minimize.

Here, the energy $\mathcal{E}$ will be based on the negative log-likelihood of the data, $-\log(p_{\text{decoder}}(x|h=f(\tilde{x}))$. Why? Because, given a noisy point $\tilde{x}$ we want the probability function (which defines the decoder) to give as high a probability as possible to the true value $x$ (which is known in $\mathcal{T}$ at least). Recall that $\log(a)<0$ when $a<1$, but here $a$ is a probability so $a<1$. Thus, we take $-\log(p_{\text{decoder}}(x|h=f(\tilde{x}))$ to get a positive number that we want to minimize.

As is normally done, training is by SGD, which will minimize the expected loss. Normally, if $L(x)$ is the loss or error energy function, SGD will minimize $$ E = \mathbb{E}_{x\sim \hat{p}_{\text{data}}(x)}[{L(x)}] $$ where $\hat{p}_{\text{data}}(x)$ is the empirical probability density of the data (e.g. from $\mathcal{T}$). Here, $L(x)=-\log(p_{\text{decoder}}(x|h=f(\tilde{x}))$, as described earlier.

However, there is an extra factor that the SGD is minimizing over, which is the corruption process. Recall that the corrupter $C$ is random, so it makes sense to talk about the expected value of an output from the process. In other words, if you consider the random variable $\alpha\sim C_x$, which has density $C(\alpha|x)$, then it makes sense to ask what $\mathbb{E}_{\alpha\sim C_x}[\alpha]$ is. For example, suppose $\tilde{x}\sim C_x$ is given by $\tilde{x}=x+\epsilon$, where $\epsilon\sim\mathcal{N}(0,\sigma^2)$. Then $\mathbb{E}[\tilde{x}]=x$. Basically, this expectation is saying what the "expected" $\tilde{x}$ will be, given $x$.

SGD always minimizes the expected value of the loss function as its energy. In this case, since both $x$ (which is distributed according to the empirical probability distribution of the known data) and $\tilde{x}$ (which is distributed according to the corruption process) are random, we need to take the expected value over both of them. Thus, we get: $$ \mathcal{E} = -\mathbb{E}_{x\sim \hat{p}_{\text{data}}(x)}\; \mathbb{E}_{\tilde{x}\sim C_x} \, \log(p_{\text{decoder}}(x|h=f(\tilde{x})) $$ where we can move the negative sign out because the expectation is linear. So, basically, instead of just minimizing the expected loss over the empirical distribution of the data, we minimize the expected loss over both the empirical distribution and that of the corrupter. (Sorry for the slight change in notation, but I think it's clearer).

We are essentially minimizing the average error of a data point $x$, where the "average" is chosen to be based on the expected (average) noise given by the corruption process as well as the expected points we will draw on average from the training distribution.

Note that we can rewrite this as: $$ \mathcal{E} = -\int_x \int_\tilde{x} \log(p_{\text{decoder}}(x|h=f(\tilde{x})) \hat{p}_{\text{data}}(x) C(\tilde{x}|x) d\tilde{x}\,dx $$ which you might notice is a weighted average error, where the weights are the probabilities of occurrence. In other words, we are weighting the error on the more likely values as being more significant than the error on the rarer ones, where "more likely" is measured using the two probability distributions $\hat{p}_{\text{data}}(x)$ and $C(\tilde{x}|x)$.

What it is saying is that the encoder-decoder can be used for dimensionality reduction (and feature extraction) and for denoising ? — reuns, Jun 20 '17 at 02:53
And to achieve this, once the encoder-decoder knows the data, we corrupt the data and update (wrt to the objective function) only the decoder parameters ? — reuns, Jun 20 '17 at 02:56
@user1952009 Yes, the original use of neural autoencoders was dimensionality reduction and this is its "extension" to denoising. You can actually optimize both parameter sets, if you like. Using tied weights can also be useful. Check this out, for instance. — user3658307, Jun 20 '17 at 05:01

why a denoising auto-encoder is like performing stochastic gradient this on this expression?

1 Answers1