Doubts on some definition of Shannon's entropy notion

Question

I would like some clarifications on two points of Shannon's definition of entropy for a random variable and his notion of self-information of a state of the random variable.

We define the amount of self information of a certain state of a random variable as: $h(v) = -log_2 P(v)$.

As far I understand, Shannon arrived at this definition because it respected some intuitive properties (for instance, we want that states with the highest probability to give the least amount of information etc.).

I have read somewhere that given a random variable and considering the self-information of its states, any self-information amount will give us the minimal number of bits needed to encode the states.

I do understand this when we are talking about variables with following a uniform probability distribution. In that case, any amount of self-information for each state $v$ will give us the number of bits needed to efficiently transmit that state, under the assumption that, since all the states have the same probability, we should not "prioritize" one over the others.

But I do not understand how this holds for a random variable with a non-uniform probability distribution: why should $h(v) = -log_2 P(v)$ be the number of bits needed for exactly that state $v$ ?

As far as I understand, the general entropy of a random variable $X$ should be the mean amount of self-information that every state of a random variable has. For instance, if we have a variable with 4 states, ${v1,v2,v3,v4}$, than $H(X) = 1/4( h(v1) + h(v2) + h(v3) + h(v4))$.

How is this consistent with the entropy of the variable defined as: $H(X) = - \sum_{x \in X} P_X(x) \cdot \log{\left(P_X(x)\right)}$ ?

In case we do not know the probability distribution of the variable, can $H(X) = - \sum_{x \in X} P_X(x) \cdot \log{\left(P_X(x)\right)}$ be approximated by an estimation of the probability based on the result of the experiments resulting from the execution of that random variable?

Thanks and sorry for the extensive questions.

In your item 2), the "mean" is intended to be with respect to the given probability distribution $P_X$, not with respect to the uniform distribution. — Andreas Blass, Jan 11 '21 at 17:56
Actually, the Shannon's definition of entropy is not intuitive. I highly recommend reading chapter 2 of "Information Theory" by Robert B. Ash. It reviews the axiomatic foundation upon which the notion of entropy is extracted. I understand that some, including the great Thomas Cover, prefer to define entropy, for pedagogical purposes. But the fact is that information theory is deep rooted in Statistical Physics, from Boltzmann to Gibbs and others. — Arash, Jan 11 '21 at 17:59
@Arash - while this is perhaps not the place for this discussion, I'll submit that a satisfying definition of entropy can be attained by studying asymptotic equiparition, which directly identifies $H$ as measure of the amount of uncertainty of a source. The larger point is that the preference to define entropy instead of stating axioms is as much philosophical as it is pedagogical - see, e.g., this. That said, certainly the axiomatic definition should not be ignored, and the tilt in this question is towards this. — stochasticboy321, Jan 11 '21 at 19:03
@Andreas Blass thanks for the answer. I still do not understand this, though. Do we have to consider the sum of the various probability associated with the states over the number of possible state of the variable? — PwNzDust, Jan 11 '21 at 21:10
@PwNzDust Quite generally, the mean of a random variable $Y$ (on a finite sample space) with respect to a probability distribution $P$ is the sum (over all elements $v$ of the sample space) $\sum_vP(v)Y(v)$. In your situation, the relevant $Y$ is the self-information $h$, and the relevant $P$ is $P_X$. — Andreas Blass, Jan 11 '21 at 21:17

Doubts on some definition of Shannon's entropy notion

0 Answers0