Estimating the Shanon entropy of a high-dimensional discrete multivariate random variable by dead simple monte carlo sampling?

Question

I'm interested in estimating Shannon's entropy for a discrete multivariate random variable $X$ that has high dimensionality (i.e. $X=(X1,..,Xn)$, where n is at the hundreds).

I can effectively sample from $X$, as well as evaluate $P(X=x)$. But due to the high dimensionality, I cannot enumerate all possible values to calculate the entropy.

A simple, straightforward Monte Carlo estimator seems to be:

Sample a large number ($m$) of observations $\{o_1,..,o_m\}$ from $X$ and calculate the sample average of their negative log-probability: $\hat{H}(X)=-\frac{1}{M}\sum_i{\log{p(o_i)}}$.

This approximates $H(X)=-\sum_{x\in X}{p(x)\log{p(x)}}$ which requires summation over all possible values of $X$.

Is this a correct approach? Am I missing something?
Is this estimator biased? and if it is, why?
If this is an unbiased estimator, then what is the issue that is addressed by entropy estimation methods such as those mentioned in this question? Does the bias arise in the setting where we can't evaluate $\log{p(x)}$ and must resort to noisy occurrence counts or histograms?

the most upvoted answer in the stack-exchange question you provide a link for states at the beginning that there is no unbiased estimator for the entropy and provides a reference. — Stelios, Dec 19 '20 at 08:26
@Stelios thank you for the reference. Following your comment, I have read the proof for Proposition 8 in Paninski 2003. Unfortunately, it does not provide me with insight with respect to whether and why bias would appear in the simple procedure described above. Does reading that proof provides you with such insight? Can you explain how bias would creep into the sample average above? — Trisoloriansunscreen, Dec 19 '20 at 21:31
One easy way to see the bias is that if you collect only $M$ samples, then the plug-in estimate can be at most $\log(M)$ (entropy of at most $M$ distinct symbols), which is too small if $M \ll |\cal{X}|$ and the entropy is actually $\Omega(\log |\cal{X}|)$. In general, it is known that the minimax sample complexity for the plug in estimator to estimate the entropy to additive error $\varepsilon$ is $\Theta( |\cal{X}| \varepsilon^{-2})$. There's been a bunch of work on estimation of entropy from samples, see e.g., https://arxiv.org/pdf/1711.02141.pdf and cited+citing papers for more details. — stochasticboy321, Dec 20 '20 at 21:50
@stochasticboy321 Your example and the linked paper apply to the case in which one can sample but cannot evaluate the probabilities, so one must rely on some sort of counting or density estimation to estimate the entropy. In the setting described in the question above, one can both sample and evaluate. Let's consider a uniform discrete distribution with $|\cal{X}$| values. Even if we use a single sample, its negative log probability is $\log|\cal{X}$|, so the resulting estimate is unbiased. — Trisoloriansunscreen, Dec 21 '20 at 00:53
Ah, I've misread the question, my apologies. I guess if you can evaluate $p$ then the issue becomes one of trying to nail down $\mathrm{Var}[\log p(X)]$ in order to characterise sample costs. In any case, the methods in the linked question are mostly about the case where $p$ is not known so I think this answers your question 3. — stochasticboy321, Dec 21 '20 at 00:55

leonbloy · Accepted Answer · 2020-12-23T22:50:39.187

1

In general, for any rv $X$ and any function $Y=g(X)$ such that $E[Y^2]$ is finite, we have that the sample average

$$ \frac{1}{M}\sum_M g(x_i) $$ is an unbiased consistent estimator of $E[g(X)]$ (hence it converges in probability, cf WLLN)

Then, if you know the probability distribution $p$, then of course the estimator is unbiased (and consistent).

The problem is when we don't know $p$ and we must estimate it from the samples. As the accepted answered of the linked question points in the flowchart, when one knows $p$ the problem does not arise.

edited Dec 23 '20 at 22:50

answered Dec 21 '20 at 18:30

leonbloy

66,202

if the estimator is unbiased when we know the probability distribution, is it also unbiased when we have modelled the sample using the "most appropriate/closest matching" parametric pdf? – develarist Dec 29 '20 at 01:07

Estimating the Shanon entropy of a high-dimensional discrete multivariate random variable by dead simple monte carlo sampling?

1 Answers1