4

Let $\mathcal P$ be the set of probability mass functions (pmfs) on $\mathbb Z_{>0}$, i.e. for $p=(p(x))_{x\in\mathbb Z_{>0}}\in\mathcal P$ we have $p\ge 0$ and $\sum_{x=1}^\infty p(x)=1$. Let $H(p)=-\sum_{x=1}^\infty p(x)\ln(p(x))$ be the entropy and $E(p)=\sum_{x=1}^\infty p(x)x$ the expectation. Further, let $s(p)=|\{x\in\mathbb Z_{>0}:p(x)>0\}|$ be the size of the support of $p$.

For $s\in\mathbb Z_{>0}$ let $\mathcal P_s=\{p\in\mathcal P:s(p)=s\}$, and further let $\mathcal P_\infty=\{p\in\mathcal P:s(p)=\infty,E(p)<\infty\}$.

Question: What are the best bounds for $r(s)=\sup_{p\in\mathcal P_s}H(p)/E(p)$?

Motivation: I want good upper bounds for the entropy in terms of the support size and the expectation.

Background: The question only makes sense if we consider strictly positive random variables, otherwise $r(s)$ would be infinite, which can be seen by taking limits towards the one-point mass on $0$.

For this question we can assume that $p$ is supported on $\{1,\dots,s\}$, respectively $\mathbb Z_{>0}$ for $s=\infty$, and non-increasing, since ordering the weights minimizes the expectation while preserving the entropy.

For $s<\infty$ we have $r(s)\le\ln(s)$ because $H(p)\le\ln(s)$ is maximal for the uniform distribution and $E(p)\ge 1$. Of course, with the uniform distribution we also get a lower bound, namely $r(s)\ge 2\ln(s)/(s+1)$, which is not tight because the entropy is stationary at the uniform distribution while the expectation is not. Also, we know that the supremum is attained due to continuity. Of course, identifying the maximizers would be highly desirable.

For $s=\infty$ it is known that $H(p)=E(p)=\infty$ is possible, as discussed here. As can be seen here, there are also quite a few follow-up questions. Unfortunately, I am not convinced by the given answer, and am still not aware of an answer to the question if $H(p)<\infty$ for all $p\in\mathcal P_\infty$. Should this be true, we may of course still have $r(\infty)=\infty$, and in any case explicit maximizing sequences would be highly desirable.

Finally, a similar question regarding a lower bound can be found here.

Update: A limiting argument directly yields that $r(s+1)\ge r(s)$. As discussed here, we have $H(p)\le\ln(E(p)+0.5)+1$ given by Theorem 8 in this preprint. The map $f(x)=(\ln(x+0.5)+1)/x$ is decreasing on $[1,\infty)$, with $f(x)=1$ for $x\approx 1.858$. Since we can assume that $p$ is non-increasing, we have $E(p)\le\frac{1+s}{2}$.

We clearly have $r(1)=0$ and a discussion of $f(p_1)=\frac{H(p_1)}{p_1+2(1-p_1)}$ gives the maximizer $p_1=\frac{1}{2}(\sqrt 5-1)\approx 0.618$, the expectation $E(p)\approx 1.382$ and $r(2)\approx 0.481$. For $s=3$ we fix $\mu\in(1,2]$ and consider $p_1\ge p_2\ge p_3\ge 0$ with $p_1+p_2+p_3=1$ and $E(p)=\mu$. Set $p_3=x$ and observe that $p_1=2-\mu+x$, $p_2=\mu-1-2x$, and that $\max(0,\frac{2}{3}\mu-1)\le x\le\frac{1}{3}(\mu-1)$. The derivative $\ln(\frac{p_2^2}{p_1p_3})$ of the entropy on this restriction is decreasing with exactly one root $x=\frac{1}{2}\mu-\frac{1}{3}-\frac{1}{6}\sqrt{4-3(2-\mu)^2}$. Numerical evaluation gives \begin{align*} p_1&\approx 0.544\\ p_2&\approx 0.296\\ p_3&\approx 0.161\\ E(p)&\approx 1.617\\ r(3)&\approx 0.609. \end{align*}

Matija
  • 3,663
  • 1
  • 5
  • 24
  • Based on a crude Lagrange multipliers approach, I conjecture that $r(s)\le 1$ for all $s$, with a fair geometric distribution as the maximizing distribution for $s=\infty$, and a truncated geometric distribution for $s< \infty$ – leonbloy Feb 06 '23 at 21:25
  • @leonbloy That sounds like a reasonable conjecture so far, and $H(p)\le E(p)$ for all positive integer valued random variables would be a nice result. – Matija Feb 07 '23 at 00:16
  • 1
    in Golomb et al's book, the result $H-E \le 0 $ is indeed shown to be true, but with two caveats: the entropy is in bits (elsewhere we should replace by $-\log_b(b-1)$ where $b$ is the logarithm base). And, of course, the expectation must be finite. – leonbloy Feb 07 '23 at 13:22
  • @leonbloy Thanks for the info! Yes, it's a hassle with the missing base. Do you mean the book from kodlu's answer? I haven't checked it out yet, but the bound in Theorem 4.1.2 is worse. – Matija Feb 07 '23 at 14:07

2 Answers2

4

Finite mean implies finite entropy. The proof is in Golomb, Scholtz and Peile's book Information Theory:Adventures of Agent 00111 (Theorem 4.1). There are copies online if you dig around, I have no time to type in details, so with apologies I give the indications below:

enter image description here

Main idea of the proof:

enter image description here

The authors also note that since the mean is sensitive to rearrangements of the probability distribution sequence but the entropy isn't one can also show that if $\{p_n\}$ has infinite entropy all its rearrangements have infinite mean, but the converse does not hold.

kodlu
  • 10,287
  • 2
    +1 I took the idea and translated it to nats, yielding $H(p)\le E(p)+\frac{1}{e-1}$, which gives a slightly worse bound than the bound obtained from Rioul's bound $E(p)+\frac{1}{2}$. – Matija Feb 06 '23 at 19:43
2

If we weaken the condition to $s(p) \le t,$ i.e., study $\tilde{r}(t) := \sup H(p)/E(p) : s(p) \le t,$ then observe that to minimise $E$ for a given $t$, the support must lie in $[1:t]$. Now, if we fix a value of $E(p) = \mu \in [1,t],$ and consider the maximiser of $H(p),$ for such $p$, then it's just Lagrange multipliers to argue that this is optimised by a geometric law supported on $[1:t]$ Since this is true for each $\mu,$ this is also the form of the maximiser of $\tilde{r}(t)$. Further since the law has full support, this is also the form of the maximiser of $r(t)$ for any $t \ge 1$ under the condition that $E(p) = \mu \in [1,t]$. Let's call the max entropy for a given mean $\mu$ and $t$ to be $h(\mu,t)$. Then it follows that $$r(t) = \max_p H(p)/E(p) = \max_{\mu \in [1,t]} \max_{p : E(p) = \mu} H(p)/\mu = \max_{\mu \in [1,t]} H(\mu,t)/\mu.$$ In general, then, the optimal law is a geometric law on $[1:t]$.


Going beyond this turns into a mess because the geometric law on a finite set is messy, but nevertheless, this does lead to interesting conclusions.

Setup. Consider a geometric law $$ p(n; \eta, t) := \frac{1-\eta}{1- \eta^t}\eta^{n-1} \mathbf{1}\{n \in [1:t]\}.$$ Note that here that each $\eta > 0$ gives a valid law. $\eta = 1$ gives the uniform law, $\eta > 1$ gives laws skewed towards $t$ and $\eta < 1$ gives laws skewed towards $1$, with the limiting laws supported entirely on $1$ (as $\eta \to 0$) and on $t$ (as $\eta \to \infty$). This mean can be computed as $$ \mu(\eta,t) := \frac{1}{1-\eta} -\frac{t\eta^t}{1 - \eta^t},$$ while the entropy of $p(n;\eta,t)$ is $$ H(\eta,t) := - \log \frac{1-\eta}{1-\eta^t} - \log\eta( \mu(\eta,t)- 1).$$

Since all of the optimal laws are parametrised by $\eta$, maximising the ratio $h(\mu,t)/\mu$ over all $\mu$ is equivalent to the program maximising $H(\eta,t)/\mu(\eta,t)$ over all $\eta,$, i.e., $$ r(t) = \max_\eta \frac{H(\eta,t)}{\mu(\eta,t)}.$$ The first order condition for this program is $$ H'(\eta,t) \mu(\eta,t) - H(\eta,t) \mu'(\eta,t) = 0,$$ where $H'(\eta,t) = \partial_\eta H(\eta,t)$ and $\mu'(\eta,t) = \partial_\eta \mu(\eta,t).$


Developing the first-order condition. Computing the derivative, we find that $$H'(\eta,t) = \frac{1}{1-\eta} - \frac{t\eta^{t-1}}{1-\eta^t} - \frac{1}{\eta}(\mu(\eta, t) - 1) - \log \eta (\mu'(\eta, t)) = - \mu'(\eta,t) \cdot \log \eta.$$

Plugging this in, the first order condition is $$ \mu'(\eta,t)( \mu(\eta, t) \log \eta + H(\eta, t)) = 0. $$

Now, observe that $$ \mu'(\eta, t) = \frac{1}{(1 - \eta)^2} - \frac{t^2 \eta^{t-1}}{(1-\eta^t)^2}.$$ I claim that this has no roots except at $\eta = 1$. This should be intuitive, but more formally, observe that for any fixed $\eta > 0,$ the function $$ g(t) := \frac{(1-\eta^t)^2}{t^2 \eta^{t-1}} = 4\eta (\sinh(t\log(\eta)/2)/t)^2 $$ is monotone in $t$ for $t \ge 1$. Indeed, $g(t) = 4\eta \log^2(\eta/2) h(t \log(\eta)/2)^2$ for $h(z) := \sinh(z)/z,$ and a simple derivative argument along with the fact that $\tanh(x) \le x$ suffices to show that the sign of $h(t\log(\eta)/2)$ is a constant, and its magnitude is increasing with $t$. So, the only way that $\mu'(\eta, t) =0$ for $t \ge 2$ is if $\eta = 1$.


Optimal $\eta$. This leaves the condition $ H(\eta,t)+ \log(\eta) \mu(\eta,t) = 0.$ This translates to the equation $$ \log \frac{1-\eta}{1-\eta^t} = \log \eta \iff 1-2\eta + \eta^{t+1} = 0.$$ This equation has exactly two solutions for $t \ge 2,$ one at $1$, and one in $(1/2, 1)$. Let $\eta_*(t)$ denote the solution in $(1/2,1)$. I claim that this is the optimal choice of $\eta$, at least for $t \ge 8$. Indeed, $\eta = 1$ yields the uniform law, and this does not maximise $r$ for $t \ge 8$ (the uniform law on $\{1,2\}$ achieves a greater ratio). Further, the limits $\eta \to 0$ and $\eta \to \infty$ both yield laws that concentrate on $1$ point, and so have $0$ entropy (and thus $0$ ratio), while the above solution has non-zero ratio. Thus, no other point can be optimal for $t \ge 8$ (in general, of course, we can just check the curvature, but I don't want to deal with that :P).


The behaviour of $r$. Notice, interestingly, that since at the optimal $\eta_*(t), H(\eta_*(t) ,t) + \log\eta_*(t) \cdot \mu(\eta_*(t),t) = 0,$ we can immediately conclude that $$ r(t) = -\log \eta_*(t).$$

While it's hard in general to say anything more, since $\eta_*(t)$ is difficult to nail down, we can study asymptotics quite cleanly. Indeed, let $g_t(\eta) := 1 - 2\eta + \eta^{t+1}$. Then notice that for large $t$, taking a Taylor expansion near $1/2,$ $$ g_t(1/2 + \varepsilon) = 2^{-(t+1)} + (-2 + (t+1) 2^{-t}) \varepsilon + t(t+1) 2^{-t} \varepsilon^2 + O(\varepsilon^3), $$ which means that for $t \gg 1,$ $\eta_*(t) \approx 2^{-1} + 2^{-t},$ which yields $r(t) = - \log\eta_*(t) \approx \log(2) - 2^{-(t-1)}.$

Further, we can show that $r(t) < \log(2)$ for all $t$ (if the entropy is measured in bits, then this is $r(t) < 1$, leonbloy's conjecture from the comments).

To show this, it suffices to argue that $\eta_*(t) > 1/2,$ i.e., that $g_t(\eta) = 0$ does not have any roots $\le 1/2.$ Indeed, $$ g_t'(\eta) = -2 + (t+1) \eta^{t}$$ is strictly increasing with $\eta$, and since $2^x \ge x$ for every $x$, it follows that $g_t'(1/2) < 0,$ which in turn implies that $ \forall \eta \in [0,1/2], g_t'(\eta) < 0.$ Therefore, $$\min_{\eta \in [0,1/2]} g_t(\eta) = g_t(1/2) = 2^{-(t+1)} > 0,$$ and thus there is no root in $[0,1/2]$, i.e., $\eta_*(t) > 1/2.$

  • Thank you very much for the answer! I will work through it asap. The results definitely look very promising! – Matija Feb 07 '23 at 06:50
  • Of course, do let me know if I've left too many holes. Also it might be possible to clean this up a little using some exponential family tricks, haven't really thought about it. – stochasticboy321 Feb 07 '23 at 08:29
  • 1
    Very good answer. – leonbloy Feb 07 '23 at 17:09
  • Thanks a lot for the great explanation! Unfortunately, I still did not have the time to reconcile every detail, but I did implement the optimization and arrived at the same conclusion. And just to be clear, this answer gives way more than I asked for, namely sharp bounds for a given expectation and support size. – Matija Feb 15 '23 at 15:40