I would like some clarifications on two points of Shannon's definition of entropy for a random variable and his notion of self-information of a state of the random variable.
- We define the amount of self information of a certain state of a random variable as: $h(v) = -log_2 P(v)$.
As far I understand, Shannon arrived at this definition because it respected some intuitive properties (for instance, we want that states with the highest probability to give the least amount of information etc.).
I have read somewhere that given a random variable and considering the self-information of its states, any self-information amount will give us the minimal number of bits needed to encode the states.
I do understand this when we are talking about variables with following a uniform probability distribution. In that case, any amount of self-information for each state $v$ will give us the number of bits needed to efficiently transmit that state, under the assumption that, since all the states have the same probability, we should not "prioritize" one over the others.
But I do not understand how this holds for a random variable with a non-uniform probability distribution: why should $h(v) = -log_2 P(v)$ be the number of bits needed for exactly that state $v$ ?
- As far as I understand, the general entropy of a random variable $X$ should be the mean amount of self-information that every state of a random variable has. For instance, if we have a variable with 4 states, ${v1,v2,v3,v4}$, than $H(X) = 1/4( h(v1) + h(v2) + h(v3) + h(v4))$.
How is this consistent with the entropy of the variable defined as: $H(X) = - \sum_{x \in X} P_X(x) \cdot \log{\left(P_X(x)\right)}$ ?
- In case we do not know the probability distribution of the variable, can $H(X) = - \sum_{x \in X} P_X(x) \cdot \log{\left(P_X(x)\right)}$ be approximated by an estimation of the probability based on the result of the experiments resulting from the execution of that random variable?
Thanks and sorry for the extensive questions.