23

For a discrete probability distribution, the entropy is defined as: $$H(p) = \sum_i p(x_i) \log(p(x_i))$$ I'm trying to use the entropy as a measure of how "flat / noisy" vs. "peaked" a distribution is, where smaller entropy corresponds to more "peakedness". I want to use a cutoff threshold to decide which distributions are "peaked" and which are "flat". The problem with this approach is that for "same shaped" distributions, the entropy is different for different sample sizes! as a simple example take the uniform distribution - it's entropy is: $$p_i = \frac{1}{n}\ \ \to \ \ H = \log n$$ To make things worse, there doesn't seem to be a general rule for more complex distributions.

So, the question is:

How should I normalize the entropy so that I get the same "scaled entropy" for "same" distributions irrespective of the sample size?

  • 3
    If I may give a tip that does not directly help with your question: you should consider using Rényi entropies of which Shannon entropy is just a special case. It gives a more refined reflection of distributions. It's like the Fourier analysis for fractals. – Raskolnikov May 18 '13 at 02:37
  • @Raskolnikov - that is very interesting! Unfortunately, one would still run into scaling problems for any of the Rényi entropies. – Nathaniel Bubis May 18 '13 at 02:40
  • Why is "entropy different for different sample sizes", and why would we expect it to be the same? – develarist Dec 14 '20 at 07:32

2 Answers2

18

Use the normalized entropy:

$$H_n(p) = -\sum_i \frac{p_i \log_b p_i}{\log_b n}.$$

For a vector $p_i = \frac{1}{n}\ \ \forall \ \ i = 1,...,n$ and $n>1$, the Shannon entropy is maximized. Normalizing the entropy by $\log_b n$ gives $H_n(p) \in [0, 1]$. You will see that this is simply a change of base, so one may drop the normalization term and set $b = n$. You can read more about normalized entropy here and here.

  • Thank you! I'll read up about efficiency. – Nathaniel Bubis Sep 25 '14 at 12:13
  • @nbubis, I came across this information trying to solve a similar problem as you posed. However, I'm not able to find more on "efficiency" other than the link I posted. However, I did find more results on normalized entropy. – a Data Head Sep 25 '14 at 15:06
  • Is it defined for n=1? – Gregor Sturm Mar 30 '20 at 09:46
  • @GregorSturm, good catch! In practical terms, there is no uncertainty for n = 1. But mathematically, a change of base of n = 1 is undefined. It is an indeterminate form, but the denominator is a constant, so I don't think L'Hôpital's rule is applicable. – a Data Head May 03 '21 at 21:41
6

A partial answer for further reference:

In short, use the integral formulation of the entropy and pretend that the discrete distribution is sampling a continuous one.

Thus, create a continuous distribution $p(x)$ whose integral is approximated by the Riemann sum of the $p_i$'s: $$\int_0^1 p(x)dx \sim \sum_i p_i\cdot \frac{1}{N} = 1$$ This means that the $p_i$'s must first be normalized so that $\sum_i p_i = N$.

After normalization, we calculate the entropy: $$H=-\int_0^1 p(x)\log\left(p(x)\right)dx \sim -\sum_i p_i \log(p_i)\cdot \frac{1}{N}$$

As $N\to \infty$ this gives an entropy which is solely related to the distribution shape and does not depend on $N$. For small $N$, the difference will depend on how good the Riemann sum approximates the integrals for given $N$.

  • 2
    Great thought there. I wish I had thought of that. – Raskolnikov May 20 '13 at 21:54
  • 1
    This gives $H=0$ for a uniform distribution, and you were looking for a measure of flatness. No ? – Stéphane Laurent May 31 '13 at 20:55
  • @StéphaneLaurent - yes it does, but peaked distributions will still have lower entropies $H < 0$ – Nathaniel Bubis May 31 '13 at 21:30
  • 2
    How to reconcile the fact that the approach in this answer gives $H=0$ for a unform distribution, whereas in the unnormalized case, $H=0$ is a low entropy state with only one certain outcome (one spike/peak), far removed from the uniform distribution. It is just shifting down the scale then? Is there a source article where this approach was taken from? How can it additionally be restricted to a domain of $H\in [-1,1]$ if it doesn't already? (no mention was made) – develarist Dec 14 '20 at 07:39