1

I've a very basic question about cost functions. I'm studying gradient descent and there we're using partial differentiation of features "Theta". But isn't the cost function an absolute value for each data point. i.e., $\big(h(x) -y\big)^2$. I'm getting a little confused how we're differentiating this function? I'm trying to build intuition about this. Thanks!

Luca Anzalone
  • 628
  • 2
  • 10
MLENGG
  • 11
  • 3

1 Answers1

1

When computing gradients for gradient decent, the chain rule is the key. Back-propagation (which is used in neural networks to compute the gradient) means basically applying the chain rule again and again and again until all elements of the gradient are computed.

In your case, you have a prediction $h(x;\theta)$ and a cost-function $C(\hat{y},y)$.
Note: I added $\theta$ as parameter for $h$, since the values of $\theta$ will influence the output of $h$ and we explicitly want to compute the derivation with respect to $\theta$. I hope this makey things clearer. In short, we can write: $$\begin{align} \hat{y}&=h(x;\theta)\\ C(\hat{y},y)&=(\hat{y}-y)^2 \end{align}$$

If you want to compute the derivation with respect to a single parameter $\theta_k$, you get with the chain rule $$\frac{\partial}{\partial\theta_k}C(\hat{y},y)=\frac{\partial}{\partial\hat{y}}C(\hat{y},y)\cdot \frac{\partial\hat{y}}{\partial\theta_k}$$

With your cost function you then get: $$\begin{align} \frac{\partial}{\partial\hat{y}}C(\hat{y},y) &= 2\cdot(\hat{y}-y) = 2\cdot(h(x;\theta)-y)\\ \frac{\partial\hat{y}}{\partial\theta_k} &= \frac{\partial}{\partial\theta_k}h(x;\theta)=\ldots \end{align}$$

Depending on your concrete $h$, you can go on to compute the gradient.

Broele
  • 1,947
  • 1
  • 9
  • 16