3

In section 3.3 of the paper, they state that they use the cross entropy. Then they define the probability for a label to be a false positive as $\theta_0$ and a false negative as $\theta_1$.

They use it to somehow modify the loss function but never actually state what this new loss is.

I was expecting something like

$f_\theta(\hat{m_i}, \tilde{m_i}) = \tilde{m_i} *g_{\theta_0}(\hat{m_i}) + (1-\tilde{m_i})*g_{\theta_1}(1-\hat{m_i})$

but $g$ is not given.

Tom M.
  • 671
  • 3
  • 9
Borbag
  • 141
  • 6

2 Answers2

2

Section 3.3 simply gives the equation for the negative log-likelihood. They say that it takes the form of cross-entropy (because it just looks like a cross-entropy equation, perhaps?), but mathematically it seems to come from the fact that they define the model in equation 2 to follow a Bernoulli distribution, which can be 0 or 1 with probabilities p or q respectively:

$$ \Pr(X=1)=p=1-\Pr(X=0)=1-q $$

The probability mass function of the Bernoulli looks like this:

$$ f(x;p)=px+(1-p)(1-x)\!\quad {\text{for }}x\in \{0,1\} $$

and the maximum likelihood for the Bernoulli looks like this:

$$ L(p) = \prod_{i+1}^n p^{x_1} (1 - p)^{1 - x_1} $$

Taking the log (to get the log-likelihood that they mention), gives something of the form we see in equation 3:

$$ \log{L(p)} = \log{p}\sum_{i=1}^n x_i + \log{(1-p)}\sum_{i=1}^n (1-x_i) $$

Then you can takes the sum out the front as it is over the image patches:

$$ \sum_{i}^{w^2_m} (\tilde{m}_i ln\hat{m}_i + (1-\tilde{m}_i)ln(1- \hat{m}_i)) $$ So here This sums over all the pixel-wise output units in each image patch (map: m), hence why the summation goes to $w_m^2$; the width of the image patch squared. $\tilde{m}$ is a realisation (possible outcome) of $\hat{m}$, just as $x$ is a realisation of distribution $p$ in the above equations.

This looks like the binary cross-entropy loss function, which is why I think they say:

For the model given in Equation 2 the negative log likelihood takes the form of a cross entropy between the patch $\tilde{m}$ derived from the given map and the predicted patch $\hat{m}$

n1k31t4
  • 15,468
  • 2
  • 33
  • 52
0

$m_i$ is the true unobserved label for pixel $i$ (0 for background, 1 for building/road/whatever the model is segmenting)
$\tilde{m}_i$ is the observed label for pixel $i$
$\hat{m}_i$ is the prediction for pixel $i$.

$\theta_0$ and $\theta_1$ are the probability of false positives and false negatives in the labels.

\begin{equation} \label{thetas} \begin{split} \theta_0 &= p(\tilde{m}_i = 1 | m_i = 0) \\ \theta_1 &= p(\tilde{m}_i = 0 | m_i = 1) \end{split} \end{equation}

We don't try to minimize the difference between label and prediction anymore ($\epsilon = \tilde{m}_i - \hat{m}_i$) but the difference between the probability that the true unobserved label is 1 and the prediction (for an input $s$: $\epsilon = p(m_i = 1 | \tilde{m_i}, s) - \hat{m}_i$).

Bayes law gives us:

\begin{equation} p(m_i = 1 | \tilde{m}_i) - \hat{m}_i = \frac{p(\tilde{m}_i | m_i = 1) * p(m_i=1)}{p(\tilde{m}_i)} -\hat{m}_i \end{equation}

and since $m_i$ can only be $0$ or $1$,

\begin{equation} p(\tilde{m}_i) = p(\tilde{m}_i | m_i=1) * p(m_i=1) + p(\tilde{m}_i | m_i=0) * p(m_i=0) \end{equation}

The definitions of $\theta_0$ and $\theta_1$ give us using the Bernoulli law those two distributions:

\begin{equation} \begin{split} p(\tilde{m}_i | m_i=0) & = \theta_0^{\tilde{m}_i} * (1-\theta_0)^{(1-\tilde{m}_i)} \\ p(\tilde{m}_i | m_i=1) & = \theta_1^{(1-\tilde{m}_i)} * (1-\theta_1)^{\tilde{m}_i} \end{split} \end{equation}

since $p(m_i =1) = 1 - p(m_i = 0)$ and $ p(m_i=1 ) = \hat{m}_i$ we get if $\tilde{m}_i = 0$

\begin{equation} \epsilon = \frac{\theta_1 * \hat{m}_i}{\theta_1 * \hat{m}_i + (1- \theta_0) * (1-\hat{m}_i))} - \hat{m}_i \end{equation}

and if $\tilde{m}_i = 1$

\begin{equation} \epsilon = \frac{(1-\theta_1) * \hat{m}_i}{(1-\theta_1) * \hat{m}_i + \theta_0 * (1-\hat{m}_i)} -\hat{m}_i \end{equation}

that is what is plotted in Figure 2.

Borbag
  • 141
  • 6