0

It is related to this question:Deriving cost function using MLE :Why use log function? I wish I can just comment and get answer from @user76170, but as I am new to here I am not allowed to comment. Hence, here comes the question:

I don't understand why @user76170 takes the likelihood as $p(x_i\mid y_i)$. It should be $p(y_i\mid x_i;\theta)$ in my opinion.

And I don't know why in Andre Ng's class the cost function is a average cost (we have $m$ observations, so we divide it by $m$), like this: $$ −\frac{1}{m} \left(\sum_{i=1}^m y_i\log(h_\theta(x_i))+(1−y_i) \log(1−h_\theta(x_i))\right) $$ what if we don't take average?

1 Answers1

0

The likelihood is $p(y_i|x_i;\theta)$. It should never be $p(x_i|y_i)$, as in a discriminative model $x_i$ are not random.

Whether to take the average of a cost function does not matter, because minimizing $L$ is the same as minimizing $\frac{1}{m}L$. The averaging sometimes makes matters simpler in a regularized MLE framework.

Yining Wang
  • 1,329
  • Can you suggest a website or book I could look to about regularized MLE framework? – Ngok Chao HO Jun 29 '17 at 16:28
  • Sure. If you're looking for a Bayesian interpretation *Machine Learning: A Probablistic Perspective" would be a good reference. For a frequentist viewpoint, you may wish to read "Elements of Statistical Learning". – Yining Wang Jun 30 '17 at 01:46