How to show that cross entropy is minimized?

Question

This Question is taken from the book Neural Networks and DeepLearning by Michael Nielsen

The Question: In a single-neuron ,It is argued that the cross-entropy is small if σ(z)≈y for all training inputs. The argument relied on y being equal to either 0 or 1. This is usually true in classification problems, but for other problems (e.g., regression problems) y can sometimes take values intermediate between 0 and 1. Show that the cross-entropy is still minimized when σ(z)=y for all training inputs. When this is the case the cross-entropy has the value:

C is the cost function

σ is the sigmoid function

y is the expected output of the network

I am not sure if I understand the question correctly.If σ(z)=y wouldn't the partial derivative of C with respect to the weights(or Bias) be equal to zero?This will imply that the weight is already at the correct value and the cost function cannot be reduced any further by changing the weight.

z is the weighted input to the neuron.(z=wx+b)

Where b is the bias and x is the activation of a neuron in the previous layer.(In this case it is the input to the network itself)

w is the weight value

user172818 · Accepted Answer · 2017-04-20T13:28:30.670

You are calculating the so-called binary cross-entropy. Let $f(\cdot)$ be a sigmoid function. The binary cross-entropy between $y$ and $f(t)$ is

$$ F(t,y) = H(y,f(t)) = -y\log f(t) - (1-y)\log(1-f(t)). $$

When you compute the partial derivative of $F(t,y)$ with respect to $t$, you get:

$$ \frac{\partial F(t,y)}{\partial t} = f(t) - y $$

You reach the minimum when $f(t)=y$. In case of multiple variables, let:

$$ F(\vec{t},\vec{y}) = \frac{1}{n}\sum_{i=1}^n F(t_i,y_i) $$

so

$$ \frac{\partial F(\vec{t},\vec{y})}{\partial t_i} = \frac{f(t_i) - y_i}{n} $$

Now, if $t_i=\sum_j w_{ij}x_j+b_i$ is the activation, with the chain rule:

$$ \frac{\partial F(\vec{t},\vec{y})}{\partial w_{ij}}=\sum_k\frac{\partial F(\vec{t},\vec{y})}{\partial t_k}\cdot\frac{\partial t_k}{\partial w_{ij}}=\frac{1}{n}\cdot x_j\cdot[f(t_i) - y_i]$$

How to show that cross entropy is minimized?

1 Answers1