I'm trying to understand the gradient derivation for the back-propagation algorithm.
I'm having trouble computing the explicit derivative of the Loss Mean Square Error function with respect to the output value in a regression setting. I only have one output neuron.
Let,
- $n$ be the number of training examples
- $ y_i $ be the predicted target for training example $x_i$
- $ t_i $ be the actual target value (from train data) for training example $x_i$
- $ L_i $ be the loss for sample $i$
I'm using the following definition of the loss function,
$$ E = \frac{1}{2n} \sum_{i=1}^{n} L_i = \frac{1}{2n} \sum_{i=1}^{n} \frac{1}{2} ( y_i - t_i)^2 $$
how do I compute, $\frac{\partial E}{\partial y}$ ?
This is in a neural network setting, so $E$ is a function of $w$ the weights, in the Bishop Book equation (5.11) is, as far as I can see, the same expression except that it's not divided by $n$,
$$ E(w) = \frac{1}{2} \sum_{i=1}^n (y(x_i, w) - t_i)^2 $$ So here $y$ is a function which depends on $x_i$ and $w$, so writing,
$$ \frac{\partial E}{\partial y} $$ means deriving by a function ??
And yet Bishop does this at equation (5.19),
$$ \frac{\partial E}{\partial y_k} = y_k − t_k $$
Where $y_k$ is the output of the kth neuron and $t_k$ the actual target value, but where are the instances gone ? They've dissapeared from the equation ! $y_k$ is predicted for an input $x$ !
I don't understand the nature of $y$ and why it's legal to derive E with respect to it.
Thanks for any help.