13

A Naive Bayes predictor makes its predictions using this formula:

$$P(Y=y|X=x) = \alpha P(Y=y)\prod_i P(X_i=x_i|Y=y)$$

where $\alpha$ is a normalizing factor. This requires estimating the parameters $P(X_i=x_i|Y=y)$ from the data. If we do this with $k$-smoothing, then we get the estimate

$$\hat{P}(X_i=x_i|Y=y) = \frac{\#\{X_i=x_i,Y=y\} + k}{\#\{Y=y\}+n_ik}$$

where there are $n_i$ possible values for $X_i$. I'm fine with this. However, for the prior, we have

$$\hat{P}(Y=y) = \frac{\#\{Y=y\}}{N}$$

where there are $N$ examples in the data set. Why don't we also smooth the prior? Or rather, do we smooth the prior? If so, what smoothing parameter do we choose? It seems slightly silly to also choose $k$, since we're doing a different calculation. Is there a consensus? Or does it not matter too much?

jonsca
  • 561
  • 1
  • 5
  • 25
Chris Taylor
  • 231
  • 1
  • 2
  • 4

1 Answers1

5

The typical reason for smoothing in the first place is to handle cases where $\#\{X_i = x_i | Y = y\} = 0$. If this wasn't done, we would always get $P(Y=y|X=x) = 0$ whenever this was the case.

This happens when, for example, classifying text documents you encounter a word that wasn't in your training data, or just didn't appear in some particular class.

On the other hand, in the case of the class prior probability, $P(Y = y)$, this situation should not occur. If it did this would mean you are trying to assign objects to classes which didn't even appear in the training data.

Also, I've never encountered the term $k$-smoothing. Laplace or Additive smoothing is much more common.

alto
  • 1,528
  • 11
  • 12