What is the loss function defined by Mnih and Hinton in their paper “Learning to Label Aerial Images from Noisy Data”?

Question

In section 3.3 of the paper, they state that they use the cross entropy. Then they define the probability for a label to be a false positive as $\theta_0$ and a false negative as $\theta_1$.

They use it to somehow modify the loss function but never actually state what this new loss is.

I was expecting something like

$f_\theta(\hat{m_i}, \tilde{m_i}) = \tilde{m_i} *g_{\theta_0}(\hat{m_i}) + (1-\tilde{m_i})*g_{\theta_1}(1-\hat{m_i})$

but $g$ is not given.

n1k31t4 · Answer 1 · 2018-07-20T17:24:17.493

Section 3.3 simply gives the equation for the negative log-likelihood. They say that it takes the form of cross-entropy (because it just looks like a cross-entropy equation, perhaps?), but mathematically it seems to come from the fact that they define the model in equation 2 to follow a Bernoulli distribution, which can be 0 or 1 with probabilities p or q respectively:

$$ \Pr(X=1)=p=1-\Pr(X=0)=1-q $$

The probability mass function of the Bernoulli looks like this:

$$ f(x;p)=px+(1-p)(1-x)\!\quad {\text{for }}x\in \{0,1\} $$

and the maximum likelihood for the Bernoulli looks like this:

$$ L(p) = \prod_{i+1}^n p^{x_1} (1 - p)^{1 - x_1} $$

Taking the log (to get the log-likelihood that they mention), gives something of the form we see in equation 3:

$$ \log{L(p)} = \log{p}\sum_{i=1}^n x_i + \log{(1-p)}\sum_{i=1}^n (1-x_i) $$

Then you can takes the sum out the front as it is over the image patches:

$$ \sum_{i}^{w^2_m} (\tilde{m}_i ln\hat{m}_i + (1-\tilde{m}_i)ln(1- \hat{m}_i)) $$ So here This sums over all the pixel-wise output units in each image patch (map: m), hence why the summation goes to $w_m^2$; the width of the image patch squared. $\tilde{m}$ is a realisation (possible outcome) of $\hat{m}$, just as $x$ is a realisation of distribution $p$ in the above equations.

This looks like the binary cross-entropy loss function, which is why I think they say:

For the model given in Equation 2 the negative log likelihood takes the form of a cross entropy between the patch $\tilde{m}$ derived from the given map and the predicted patch $\hat{m}$

score 0 · Accepted Answer · answered Jul 23 '18 at 12:15

$m_i$ is the true unobserved label for pixel $i$ (0 for background, 1 for building/road/whatever the model is segmenting)
$\tilde{m}_i$ is the observed label for pixel $i$
$\hat{m}_i$ is the prediction for pixel $i$.

$\theta_0$ and $\theta_1$ are the probability of false positives and false negatives in the labels.

\begin{equation} \label{thetas} \begin{split} \theta_0 &= p(\tilde{m}_i = 1 | m_i = 0) \\ \theta_1 &= p(\tilde{m}_i = 0 | m_i = 1) \end{split} \end{equation}

We don't try to minimize the difference between label and prediction anymore ($\epsilon = \tilde{m}_i - \hat{m}_i$) but the difference between the probability that the true unobserved label is 1 and the prediction (for an input $s$: $\epsilon = p(m_i = 1 | \tilde{m_i}, s) - \hat{m}_i$).

Bayes law gives us:

\begin{equation} p(m_i = 1 | \tilde{m}_i) - \hat{m}_i = \frac{p(\tilde{m}_i | m_i = 1) * p(m_i=1)}{p(\tilde{m}_i)} -\hat{m}_i \end{equation}

and since $m_i$ can only be $0$ or $1$,

\begin{equation} p(\tilde{m}_i) = p(\tilde{m}_i | m_i=1) * p(m_i=1) + p(\tilde{m}_i | m_i=0) * p(m_i=0) \end{equation}

The definitions of $\theta_0$ and $\theta_1$ give us using the Bernoulli law those two distributions:

\begin{equation} \begin{split} p(\tilde{m}_i | m_i=0) & = \theta_0^{\tilde{m}_i} * (1-\theta_0)^{(1-\tilde{m}_i)} \\ p(\tilde{m}_i | m_i=1) & = \theta_1^{(1-\tilde{m}_i)} * (1-\theta_1)^{\tilde{m}_i} \end{split} \end{equation}

since $p(m_i =1) = 1 - p(m_i = 0)$ and $ p(m_i=1 ) = \hat{m}_i$ we get if $\tilde{m}_i = 0$

\begin{equation} \epsilon = \frac{\theta_1 * \hat{m}_i}{\theta_1 * \hat{m}_i + (1- \theta_0) * (1-\hat{m}_i))} - \hat{m}_i \end{equation}

and if $\tilde{m}_i = 1$

\begin{equation} \epsilon = \frac{(1-\theta_1) * \hat{m}_i}{(1-\theta_1) * \hat{m}_i + \theta_0 * (1-\hat{m}_i)} -\hat{m}_i \end{equation}

that is what is plotted in Figure 2.

What is the loss function defined by Mnih and Hinton in their paper “Learning to Label Aerial Images from Noisy Data”?

2 Answers2