How does binary cross entropy work?

Question

Let's say I'm trying to classify some data with logistic regression.

Before passing the summed data to the logistic function (normalized in range $[0,1]$), weights must be optimized for desirable outcome. In order to find optimal weights for classification purposes, relatively minimizable error function must be found, this can be cross entropy.

From my knowledge, cross entropy measures quantification between two probability distributions by bit difference between set of same events belonging to two probability distributions.

For some reason, cross entropy is equivalent to negative log likelihood. Cross entropy loss function definition between two probability distributions $p$ and $q$ is:

$$H(p, q)=-\sum_{x}p(x)\,log_e(q(x))$$

From my knowledge again, If we are expecting binary outcome from our function, it would be optimal to perform cross entropy loss calculation on Bernoulli random variables.

By definition probability mass function $g$ of Bernoulli distribution, over possible outcome $x$ is:

$$g(x|p)=p^{x}(1-p)^{1-x} \ \textrm{for} \ x\in [0, 1]$$

Which means that probability is $1-p$ if $x=0$ and $p$ if $x=1$.

Bernoulli probability distribution is based on binary outcome and therefore process of cross entropy being performed on Bernoulli random variables is called binary cross entropy:

$$\mathcal{L}(\theta)= -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(p_i) + (1-y_i)\log(1-p_i) \right]$$

Is this true? why are negative logarithm likelihoods associated with cross entropy? why does Bernoulli random variable perform so well?

In short, how does binary cross entropy work?

score 9 · Accepted Answer · answered Jan 13 '19 at 11:18

When doing logistic regression you start calculating a bunch of probabilities $p_i$ and your target is maximize the product of those probabilities (as they're considered independent events). The higher the result of the product the better is your model.
As we are dealing with probabilities we are multiplying numbers between 0 and 1, therefore, if you multiply a lot of those numbers you would get smaller and smaller results. So we need a way to move from probabilities multiplication to a sum of other numbers.
Then is when $ln$ function enters in to play. We can use some of this function properties such as:
- $ln(a b) = ln(a) + ln(b)$.
- When our prediction is perfect i.e. 1, the $ln(1) = 0$.
- $ln$ lower than 0 are growing negative numbers e.g. $ln(0.9) = -0.1$ and $ln(0.5) = -0.69$.
So we can move from maximizing the multiplication of probabilities to minimizing the sum of the $-ln$ of those probabilities. The resulting cross-entropy formula is then:

$$ - \sum_{i=1}^m y_i ln(p_i) + (1-y_i) log (1-p_i) $$

If $y_i$ is 1 the second term of the sum is 0, likewise, if $y_i$ is 0 then the first term goes away.
Intuitively cross entropy says the following, if I have a bunch of events and a bunch of probabilities, how likely is that those events happen taking into account those probabilities? If it is likely, then cross-entropy will be small, otherwise, it will be big.

How does binary cross entropy work?

1 Answers1