1

I am learning this post

here is a formula about the likelihood of single data point

$$ P(y_i ) = h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$

$P(y_i)$ is known as the likelihood of single data point $\mathbf{x}_i$, i.e. given the value of $y_i$ what is the probability of $\mathbf{x}_i$ occurring. it is the conditional probability $P(\mathbf{x}_i | y_i)$.

what is the detailed procedure about the likelihood can be represented as the conditional probability.

JJJohn
  • 1,502

1 Answers1

2

I think the explanation you linked to has an error at this point. (This is surprising because the answer you linked to has 17 upvotes.) I would say that $$ P(y_i \mid x_i) = h_\theta(x_i)^{y_i }(1-h_\theta(x_i))^{1-y_i}. $$ The feature vectors $x_i$ are given, and the observed data are the corresponding labels $y_i$. We choose $\theta$ to make the likelihood of the observed labels as large as possible.

The likelihood of the observed labels is \begin{align} L(\theta) &= P(y \mid x) \\ &= \Pi_{i=1}^n P(y_i \mid x_i) \\ &=\Pi_{i=1}^n h_\theta(x_i)^{y_i }(1-h_\theta(x_i))^{1-y_i}. \end{align} The log-likelihood function is $$ \log L(\theta) = \sum_{i=1}^n y_i \log(h_\theta(x_i)) + (1-y_i) \log(1-h_\theta(x_i)). $$ We choose $\theta$ to maximize $\log L(\theta)$.

I think the post you linked to did not explain that part correctly.

littleO
  • 54,048