12

Let $p_1(\cdot), p_2(\cdot)$ be two discrete distributions on $\mathbb{Z}.$ Total variation distance is defined as $d_{TV}(p_1,p_2)= \frac{1}{2} \displaystyle \sum_{k \in \mathbb{Z}}|p_1(k)-p_2(k)|$ and Shannon entropy is defined the usual way, i.e $$ H(p_1)=\sum_k p_1(k) \log(\frac{1}{p_1(k)}) $$ Binary entropy function $h(\cdot)$ is defined by $h(x)=x \log(1/x)+(1-x)\log(1/1-x), \ \forall x \in (0,1)$

I am trying to prove that $H(\frac{p_1+p_2}{2})-\frac{1}{2}H(p_1)-\frac{1}{2}H(p_2) \leq h (d_{TV}(p_1,p_2)/2)$. Can anyone guide me in this direction ?

  • Out of curiosity, where did that question arise? – Clement C. Oct 15 '15 at 23:47
  • I would write of a function $h$ rather than of a function $h(\cdot)$, reserving the parentheses to express a value of the function at some argument, as in $\text{“}h(x)=\text{some expression depending on }x\text{''}$. However, you feel strongly that you need the parentheses, the proper notation is $h(\cdot)$ rather than $h(.)$. I edited accordingly. ${}\qquad{}$ – Michael Hardy Oct 15 '15 at 23:53
  • @ClementC. : The exact problem statement is as follows : $X ~ Bern(0.5),\ \mathbb{P}(Y=k|X=0)=p_1(k),\ \mathbb{P}(Y=k|X=1)=p_2(k)$. I am trying to prove $I(X;Y) \leq h(d_{TV}(p_1,p_2))$ – pikachuchameleon Oct 16 '15 at 02:37
  • @AshokVardhan I am deleting my previous comments, since they are no longer relevant to the question after the correction/edit you made. On a side note, I wonder if looking as the other expression of TV, namely $\sup_S (p_1(S) - p_2(S))$, would help as a first step. – Clement C. Oct 17 '15 at 14:22
  • @Ashok Vardhan. Perhaps in this way. Let $ P^n ={ p^n_1,…, p^n _n} . $ $ Q^n ={ q^n _1,…, q^n _n}, \sum_i p^n _i = \sum_i q^n _i=1. $ Firstly we put $ n=2 $ and show that inequality is true. Then we can try to prove it in general using the following recurrent formulas: $ H(P^n)=(1-p^n_n) H(P^{n-1})- p^n_n\log (p^n_n), $ where $ p^{n-1}_i=p^n_i/(1-p^n_n), i \in {1,\dots ,n-1} $ (analogous for $ Q ). $ – Yog Urt Oct 22 '15 at 21:20
  • 1
    Without some assumptions on the entropies of $p_1,p_2$, it seems that what you are trying to prove may lead into trouble, because the right hand side of the inequality, namely, $h(d_{TV}(p_1,p_2)/2)$, is always finite, ( clearly $d_{TV}(p_1,p_2)\leq 1$), but the left hand side can be infinite. For example, take $p_1$ with $H(p_1)=\infty$. Now choose a second distribution $p_2$ for which $H(p_2)$ is finite. You get: $$\infty-\frac{1}{2}\infty-\frac{1}{2}H(p_2)\leq C$$ for some positive number $C>0$. This does not make much sense. – uniquesolution Oct 23 '15 at 21:44
  • @uniquesolution Support them on some finite set. The quantities are then well defined. I think an elegant solution for even bernoulli $p_1$ and $p_2$ would be pretty cool. – stochasticboy321 Oct 24 '15 at 02:42
  • @AshokVardhan Did this come up in your own work, or is this from a textbook? I have a long-winded proof strategy in mind, which I'll post unless it is a problem that is expected to be solved fairly easily. In the latter case, I'm probably thinking along the wrong lines. – stochasticboy321 Oct 24 '15 at 18:42
  • @uniquesolution Here's a nicer argument why the LHS is bounded: From the third comment to this question, it's the mutual information of a bernoulli random variable with something else, and is hence upper bounded by $1$. – stochasticboy321 Oct 24 '15 at 18:44
  • Thank you. It would be nice if people asked what they mean to ask, so that they are not asked what they mean to mean :) – uniquesolution Oct 24 '15 at 19:10
  • @stochasticboy321 : I do not have an elegant solution. However, the inequality is true for a pair of Bernoulli distributions. If their parameters are $\alpha$ and $\beta$, respectively, equality holds for $\alpha = \beta$ or $\alpha = 1-\beta$. (This is an edit. Swapped parity when writing the last clause.) – Eric Towers Oct 24 '15 at 22:05
  • The inequality does hold, in general, for two letter distributions, and I can show this through some horrible optimisation that made my stomach crawl as I went through it. To see the cases you mentioned in particular, note that the they render $X$ and $Y$ independent, making the LHS above $0$. – stochasticboy321 Oct 25 '15 at 00:38
  • @stochasticboy321 Can you post that as an answer? I have 11h to award the bounty, or it goes to oblivion... even a partial answer is better than nothing. – Clement C. Oct 25 '15 at 15:43
  • @stochasticboy321: It came up in a work. It's not from a text book. I really need the proof of the above statement. – pikachuchameleon Oct 26 '15 at 01:43
  • @ClementC. The bounty isn't that important, it's just that it's a horrible solution that leads to no real insight on the problem. – stochasticboy321 Oct 26 '15 at 05:39
  • @AshokVardhan I'll try to work on the idea I had if I find a bit of time. If I don't seem to make much progress in a day or two, or can't find the time, I'll type it up and post it here. – stochasticboy321 Oct 26 '15 at 05:39
  • You're going to need more assumptions. Rearranging the sum on the left side (maybe this is what you started with?), you get $$\frac{1}{2}D_{KL}\left(P_1 \left| \frac{P_1+P_2}{2}\right.\right) + \frac{1}{2}D_{KL}\left(P_2 \left| \frac{P_1+P_2}{2}\right.\right).$$ This can clearly be made bigger than 1 (the max of $h(\cdot)$). Reading the comments I am amazed that this is even true for Bernoulli RVs. – Christian Chapman Oct 28 '16 at 18:02

1 Answers1

1

Below the second part is rather inelegant, and I think this can possibly be improved. Suggestions are welcome.


Note that the LHS is the Jensen-Shannon divergence ($\mathrm{JSD}$) between $P_1$ and $P_2$, and that $\mathrm{JSD}$ is a $f$-divergence. For $f$-divergences generated by $f, g$ the joint ranges of $D_f,D_g$ are defined as \begin{align} \mathbb{R}^2 \supset \mathcal{R} :=& \{ (D_f(P\|Q), D_g(P\|Q)): P, Q \textrm{ are distributions on some measurable space} \} \\ \mathcal{R}_k :=& \{ (D_f(P\|Q), D_g(P\|Q)): P, Q \textrm{ are distributions on } ([1:k], 2^{[1:k]} ) \}\end{align}

A remarkable theorem of Harremoees & Vajda (see also these notes by Wu) states that for any pair of $f$-divergences, $$\mathcal{R} = \mathrm{co}(\mathcal{R}_2),$$ where $\mathrm{co}$ is the convex hull operator.

Now, we want to show the relation $\mathrm{JSD} \le h(d_{TV}).$ Since both $\mathrm{JSD}$ and $d_{TV}$ are $f$-divergences, and since the set $\mathcal{S} := \{ y - h(x) \le 0\}$ is convex in $\mathbb{R}^2$, it suffices to show this inequality for distributions on $2$-symbols, since by the convexity we have $\mathcal{R}_2 \subset \mathcal{S} \implies \mathrm{co}(\mathcal{R}_2) \subset \mathcal{S},$ as the convex hull of a set is the intersection of all convex sets containing it. The remainder of this answer will thus concentrate on showing $\mathcal{R}_2 \subset \mathcal{S}$.


Let $p := \pi + \delta, q:= \pi - \delta,$ where $\delta \in [0,1/2]$ and $\pi \in [\delta, 1- \delta].$ We will show that $$ \mathrm{JSD}(\mathrm{Bern}(p)\|\mathrm{Bern}(q) ) \le h\left(\frac{1}{2}d_{TV}(\mathrm{Bern}(p)\|\mathrm{Bern}(q) )\right) = h(\delta), \tag{1}$$ which suffices to show the relation on $2$-letter distributions. Note that above $p\ge q$ always, but this doesn't matter since both $\mathrm{JSD}$ and $d_{TV}$ are symmetric in their arguments.

For conciseness I'll set represent the $\mathrm{JSD}$ above by $J$. All '$\log$'s in the following are natural, and we will make use of the simple identities for $p \in (0,1)$ $$ \frac{\mathrm{d}}{\mathrm{d}p} h(p) = \log \frac{1-p}{p} \\ \frac{\mathrm{d}^2}{\mathrm{d}p^2} h(p) = -\frac{1}{p} - \frac{1}{1-p}. $$

By the expansion in the question, $$J(\pi, \delta) = h( \pi) - \frac{1}{2} h(\pi + \delta) - \frac{1}{2} h(\pi - \delta).$$

It is trivial to see that the relation $(1)$ holds if $\delta = 0$. Let us thus assume that $\delta > 0.$ For $\pi \in (\delta, 1-\delta),$ we have

\begin{align} \frac{\partial}{\partial \pi} J &= \log \frac{1-\pi}{\pi} - \frac{1}{2} \left( \log \frac{1 - \pi - \delta}{\pi + \delta} + \log \frac{1 - \pi +\delta}{\pi - \delta}\right) \end{align} and \begin{align} \frac{\partial^2}{\partial \pi^2} J &= \frac{1}{2} \left( \frac{1}{\pi + \delta} + \frac{1}{\pi - \delta} + \frac{1}{1 - \pi - \delta} + \frac{1}{1 - \pi + \delta} \right) - \frac{1}{\pi} - \frac{1}{1-\pi} \\ &= \frac{\pi}{\pi^2 - \delta^2} - \frac{1}{\pi} + \frac{1 - \pi}{( 1-\pi)^2 - \delta^2} - \frac{1}{1-\pi} \\ &= \frac{\delta^2}{\pi(\pi^2 - \delta^2)} + \frac{\delta^2}{(1-\pi)( (1-\pi)^2 - \delta^2)} > 0, \end{align}

where the final inequality uses $\delta > 0,$ and that $ \pi \in (\delta, 1-\delta).$

As a consequence, for every fixed $\delta >0,$ $J$ is strictly convex on $(\delta, 1-\delta).$ Since the maxima of a convex function on an interval must lie on the end points, we have $$ J(\pi ,\delta) \le \max( J(\delta, \delta), J(1- \delta, \delta) ).$$

But $$J(\delta, \delta) = h(\delta) - \frac{1}{2} (h(2\delta) + h(0) ) = h(\delta) - \frac{1}{2} h(2\delta),$$ and similarly $$J(1-\delta, \delta) = h(\delta) - \frac{1}{2} h(1-2\delta) = h(\delta) - \frac{1}{2} h(2\delta),$$ by the symmetry of $h$. We immediately get that for every $\delta \in [0,1/2], \pi \in [\delta, 1-\delta],$ $$J(\pi, \delta) \le h(\delta) - \frac{1}{2} h(2\delta) \le h(\delta),$$ finishing the argument.


Note that the last line indicates something stronger for $2$-symbol distributions: $J(\pi, \delta) \le h(\delta) - h(2\delta)/2$. Unfortunately the RHS is a convex function of $\delta$, so this doesn't directly extend to all alphabets. It'd be interesting if a bound that has such an advantage can be shown in general.