The Kullback-Leibler Divergence is defined as
$$K(f:g) = \int \left(\log \frac{f(x)}{g(x)} \right) \ dF(x)$$
It measures the distance between two distributions $f$ and $g$. Why would this be better than the Euclidean distance in some situations?
The Kullback-Leibler Divergence is defined as
$$K(f:g) = \int \left(\log \frac{f(x)}{g(x)} \right) \ dF(x)$$
It measures the distance between two distributions $f$ and $g$. Why would this be better than the Euclidean distance in some situations?
The short answer is that KL divergence has a probabilistic/statistical meaning (and a lot of them, in fact) while Euclidean distance has not. For example, a given difference $f(x)-g(x)$ has a whole different meaning depending on the absolute sizes of $f(x)$ and $g(x)$.
The WP page on the subject is a must read, naturally. Let me explain only one interpretation of KL divergence. Assume a random i.i.d. sample $\mathfrak X=(x_k)_{1\leqslant k\leqslant n}$ follows the distribution $f$ and a random i.i.d. sample $\mathfrak Y=(y_k)_{1\leqslant k\leqslant n}$ follows the distribution $g$. A way to distinguish $\mathfrak X$ from $\mathfrak Y$ is to ask for the likelihood that $\mathfrak Y$ behaves like $\mathfrak X$, that is, that $\mathfrak Y$ behaves like a typical sample from $f$.
More precisely, one wants to estimate how unlikely $\mathfrak Y$ becomes when one asks that $\mathfrak Y$ behaves like an $f$ sample, compared to its ordinary likelihood as a $g$ sample.
The computation is rather simple and based on the following. Assume $N(x,x+\mathrm dx)$ values from the sample fall in each interval $(x,x+\mathrm dx)$. Then, the likelihood scales like $$ \prod g(x)^{N(x,x+\mathrm dx)}=\exp\left(\sum N(x,x+\mathrm dx)\log g(x)\right). $$ For a typical $f$ sample, $N(x,x+\mathrm dx)\approx nf(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ masquerading as an $f$ sample scales like $$ \ell_n(f\mid g)\approx\exp\left(n\int f(x)\log g(x)\mathrm dx\right). $$ On the other hand, for a typical $g$ sample, $N(x,x+\mathrm dx)\approx ng(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ behaving like a typical $g$ sample scales like $$ \ell_n(g\mid g)\approx\exp\left(n\int g(x)\log g(x)\mathrm dx\right). $$ Thus $\ell_n(f\mid g)\ll\ell_n(g\mid g)$, as was to be expected, and the ratio $\dfrac{\ell_n(f\mid g)}{\ell_n(g\mid g)}$ decreases exponentially fast when $n\to\infty$, approximately like $\mathrm e^{-nH}$, where $$ H=\int f(x)\log f(x)\mathrm dx-\int f(x)\log g(x)\mathrm dx=K(f\mid g). $$
Kullback-Leibler divergence can be regarded better in the following sense:
For two probability measures $P$ and $Q$, Pinsker's inequality states that $$ |P-Q|\le [2 KL(P\|Q)]^{\frac{1}{2}},$$ where l.h.s. is the total variation metric (corresponds to $\ell_1$-norm). So convergence in KL-divergence sense is stronger than convergence in total variation. The motivation comes from information theory as Jeff pointed out.
Following a similar calculation to the great answer already posted here, the KL divergence is essential to maximum likelihood estimation (MLE) given a set of discrete data points. The probability distribution minimizing the KL divergence is the one that maximizes the likelihood of an estimator for the parameter(s) on which the probability distribution depend(s). This is in contrast to when the probabilities involved in MLE each have errors that follow a normal distribution, in which case minimizing the Euclidean distance is equivalent to performing MLE.
Suppose that you have $N$ i.i.d. events that each yield one of $M$ different results, and you measure result $i$ $n_i$ times. You have access to a theory that tells you the predicted probability of getting result $i$ given some underlying parameter(s) $\theta$, $p(i|\theta)$, and you are trying to construct an estimator $\hat{\theta}$ that maximizes the likelihood $$\mathcal{L}({\theta}|n_1,\cdots,n_M)=\prod_{k=1}^N p(i_k|{\theta})=\prod_{i=1}^M [p(i|{\theta})]^{n_i}.$$ Straightforward calculations provide us with an alternate quantity to maximize: \begin{aligned} \hat{\theta}&=\arg\max_{\theta}\mathcal{L}({\theta}|n_1,\cdots,n_M)\\ &=\arg\max_{\theta}\ln\mathcal{L}({\theta}|n_1,\cdots,n_M)\\ &=\arg\max_{\theta}\sum_{i=1}^M n_i \ln p(i|\theta)\\ &=\arg\max_{\theta}\sum_{i=1}^M q_i \ln p(i|\theta), \end{aligned} where we have defined the measured frequencies as $q_i=n_i/N$. We continue with the manipulations: \begin{aligned} \hat{\theta}&=\arg\max_{\theta}\left[\sum_{i=1}^M q_i \ln p(i|\theta)- \sum_{i=1}^M q_i\ln q_i\right]\\ &=\arg\max_{\theta}\sum_{i=1}^M q_i \ln \frac{p(i|\theta)}{q_i}\\ &=\arg\min_{\theta}\sum_{i=1}^M q_i \ln \frac{q_i}{p(i|\theta)}\\ &=\arg\min_{\theta}K[Q(n_1,\cdots,n_M):P(\theta)]. \end{aligned}
We immediately see that the estimator $\hat{\theta}$ arises from minimizing the KL divergence between the measured frequencies $q_i$ and the theoretical probabilities $p(i|\theta)$ that depend on the underlying parameter(s). This is preferable to forming an estimator by minimizing the Euclidean distance between $Q(n_1,\cdots,n_M)$ and $P(\theta)$ because that estimator will only maximize the likelihood function for normally-distributed errors.
Ref: "Weighing the odds" by Williams.
[This is similar to the answer here]
Let ${ Y _1, Y _2 , \ldots }$ be an iid sample with true pdf ${ f . }$ Let ${ g }$ be another pdf. By the Strong Law of Large Numbers, with ${ \mathbb{P} _f }$ probability ${ 1 }$ we have
$${ \frac{1}{n} \ln \frac{\text{lhd}(g; Y _1, \ldots, Y _n)}{\text{lhd}(f; Y _1, \ldots, Y _n)} = \frac{1}{n} \sum _{i = 1} ^{n} \ln \frac{g(Y _i)}{f(Y _i)} \longrightarrow \mathbb{E} _f \ln \frac{g(Y)}{f(Y)} . }$$
Note that by Jensen's inequality we have
$${ \mathbb{E} _f \ln \frac{g(Y)}{f(Y)} \leq \ln \mathbb{E} _f \frac{g(Y)}{f(Y)} = 0 . }$$
We define the relative entropy as
$${ \text{App}(f \leftarrow g) := - \mathbb{E} _f \ln \frac{g(Y)}{f(Y)} \geq 0 . }$$
So roughly speaking, the typical likelihood ratio ${ \frac{\text{lhd}(g; Y _1, \ldots, Y _n)}{\text{lhd}(f; Y _1, \ldots, Y _n)} }$ of a sample ${ Y _1, \ldots, Y _n }$ is of the order of magnitude ${ e ^{- n \text{App}(f \leftarrow g) } . }$ The larger the relative entropy ${ \text{App}(f \leftarrow g) , }$ the smaller the typical likelihood ratio of a sample, and the larger the deviation of the pdf ${ g }$ from the true pdf ${ f . }$