1

I am learning machine learning and encountered KL divergence: $$ \int p(x) \log\left(\frac{p(x)}{q(x)}\right) \, \text{d}x $$ I understand that this measure calculates the difference between two probability distributions. I have written down the formula as follows (where $p(x)$ is the true distribution and $q(x)$ is the approximated distribution): $$ \int p(x) \log\left(\frac{1}{q(x)}\right) \, \text{d}x - \int p(x) \log\left(\frac{1}{p(x)}\right) \, \text{d}x $$ I understand this formula as the difference between the entropy of the approximated distribution and the entropy of the true distribution.

The question is: why do we multiply the entropy of the approximated distribution by $p(x)$ instead of $q(x)$?

  • Are you asking why the Kullback-Leibler divergence is defined the way it is? It doesn’t rely on one distribution being true and the other an approximation. Also: MathJax. – A rural reader Jun 08 '24 at 16:52
  • Maybe here: https://math.stackexchange.com/questions/90537/what-is-the-motivation-of-the-kullback-leibler-divergence – Konstruktor Jun 08 '24 at 17:54
  • It is measuring the average effect of using $q$ when $p$ is the "right" distribution, relative to the effect of using $p$ when $p$ is the "right" distribution. Exactly what is meant by "effect" and "right" depends on which of many interpretations is given to the KL divergence. In all cases, this results in a nonnegative value. – r.e.s. Jun 08 '24 at 19:22
  • Thanks all for reply. – Dmitry_IT_03 Jun 08 '24 at 19:50
  • From a pragmatic point of view, since $-log(x)$ is a convex function, it follows easily by Jensen's inequality that this is an easy way to detect whether two random variables are identically distributed. Moreover, if you set $q$ to the Gaussian density, then the positivity of the KL-divergence demonstrates that for a given variance, the Gaussian is the unique distribution that maximizes Shannon entropy. – Deane Feb 09 '25 at 19:44

0 Answers0