Let $C(\tilde{x}|x)$ describe the conditional probability distribution of a corruption process. More precisely, $\tilde{x}\sim C_x$, where $C_x$ describes the probability distribution of the corruption process, given (or conditioned on) $x$, as a parameter to the density. In other words, given an input data point $x$, it outputs some noisy version $\tilde{x}$. However, this is not deterministic; in other words, if you give it the same point multiple times, it will give different values for $\tilde{x}$.
Our goal is to get a reconstruction distribution $p_{\text{reconstruct}}(x|\tilde{x})$, which can tell us which $x$ is most likely, given $\tilde{x}$, as well as a way to recover $x$ from $\tilde{x}$.
How can we do this? The idea is to use a denoising autoencoder.
Take a training set $\mathcal{T}=\{x_i,\tilde{x}_i\}_i$ and train a autoencoder with reconstruction error. This gives us two things, ideally:
- An encoder function $f$, which takes a corrupt data point $\tilde{x}$ and returns an "intermediate" representation $h=f(\tilde{x})$.
- A decoder function $g$ that takes an encoded noisy data point and reconstructs the "real" denoised data point $x=g(h)=g(f(x))$
Then we have a function that can take noisy data and return clean data.
But how can we get this "practically"? The encoder is not so important, but the decoder is generally a feed-forward artificial (deep) neural network. These are usually trained by stochastic gradient descent (SGD), using an error function as an energy function to minimize.
Here, the energy $\mathcal{E}$ will be based on the negative log-likelihood of the data, $-\log(p_{\text{decoder}}(x|h=f(\tilde{x}))$. Why? Because, given a noisy point $\tilde{x}$ we want the probability function (which defines the decoder) to give as high a probability as possible to the true value $x$ (which is known in $\mathcal{T}$ at least). Recall that $\log(a)<0$ when $a<1$, but here $a$ is a probability so $a<1$. Thus, we take $-\log(p_{\text{decoder}}(x|h=f(\tilde{x}))$ to get a positive number that we want to minimize.
As is normally done, training is by SGD, which will minimize the expected loss. Normally, if $L(x)$ is the loss or error energy function, SGD will minimize
$$ E = \mathbb{E}_{x\sim \hat{p}_{\text{data}}(x)}[{L(x)}] $$
where $\hat{p}_{\text{data}}(x)$ is the empirical probability density of the data (e.g. from $\mathcal{T}$).
Here, $L(x)=-\log(p_{\text{decoder}}(x|h=f(\tilde{x}))$, as described earlier.
However, there is an extra factor that the SGD is minimizing over, which is the corruption process. Recall that the corrupter $C$ is random, so it makes sense to talk about the expected value of an output from the process.
In other words, if you consider the random variable $\alpha\sim C_x$, which has density $C(\alpha|x)$, then it makes sense to ask what $\mathbb{E}_{\alpha\sim C_x}[\alpha]$ is.
For example, suppose $\tilde{x}\sim C_x$ is given by $\tilde{x}=x+\epsilon$, where $\epsilon\sim\mathcal{N}(0,\sigma^2)$. Then $\mathbb{E}[\tilde{x}]=x$.
Basically, this expectation is saying what the "expected" $\tilde{x}$ will be, given $x$.
SGD always minimizes the expected value of the loss function as its energy. In this case, since both $x$ (which is distributed according to the empirical probability distribution of the known data) and $\tilde{x}$ (which is distributed according to the corruption process) are random, we need to take the expected value over both of them. Thus, we get:
$$
\mathcal{E} = -\mathbb{E}_{x\sim \hat{p}_{\text{data}}(x)}\; \mathbb{E}_{\tilde{x}\sim C_x} \,
\log(p_{\text{decoder}}(x|h=f(\tilde{x}))
$$
where we can move the negative sign out because the expectation is linear.
So, basically, instead of just minimizing the expected loss over the empirical distribution of the data, we minimize the expected loss over both the empirical distribution and that of the corrupter. (Sorry for the slight change in notation, but I think it's clearer).
We are essentially minimizing the average error of a data point $x$, where the "average" is chosen to be based on the expected (average) noise given by the corruption process as well as the expected points we will draw on average from the training distribution.
Note that we can rewrite this as:
$$
\mathcal{E} = -\int_x \int_\tilde{x}
\log(p_{\text{decoder}}(x|h=f(\tilde{x}))
\hat{p}_{\text{data}}(x) C(\tilde{x}|x) d\tilde{x}\,dx
$$
which you might notice is a weighted average error, where the weights are the probabilities of occurrence. In other words, we are weighting the error on the more likely values as being more significant than the error on the rarer ones, where "more likely" is measured using the two probability distributions $\hat{p}_{\text{data}}(x)$ and $C(\tilde{x}|x)$.