I am trying to understand the logic behind the mathematical formula for Entropy:
$$ \text{ Entropy for a Discrete Random Variable:} \quad H(X) = -\sum_{x \in X} p(x) \log_2 p(x) $$ $$ \text{ Entropy for a Continuous Random Variable:} \quad h(X) = -\int_{-\infty}^{\infty} f(x) \log_2 f(x) \, dx $$
From a practical view, the formulas make sense. For example:
- Coin 1: Suppose there I flip a coin 100 times and observe 55 Heads
- Coin 2: I then flip another coin 100 times and observe 98 Heads.
The first coin is a lot more uncertainty than the second coin. If I gave my results of both coins to a friend, and my friend had to trust my results (without flipping the coins themselves) - my friend would believe that there is more randomness in the first coin compared to the second coin. Thus, the first coin would have more entropy the second coin.
I wrote the formulas for both scenarios:
$$ H(X_1) = -\left[0.55 \log_2(0.55) + (1-0.55) \log_2(1-0.55)\right] \approx 0.9928 $$ $$ H(X_2) = -\left[0.98 \log_2(0.98) + (1-0.98) \log_2(1-0.98)\right] \approx 0.1414 $$
Based on these formulas, we can see that the first coin indeed has higher entropy compared to the second coin.
However, I have still not fully understood the Entropy formula (e.g. in the discrete case).
I understand the summation term - for a given random variable, we are interested in studying the randomness/uncertainty over each possible outcome the random variable can take. This shows us why we need a $p(x)$ term.
But what I don't understand, is that why do we need a $\log p(x)$ term in this formula? In my opinion, isn't all the uncertainty and information about the random variable coded within $x$ and $p(x)$ ... thus there should be no need to multiply by $\log p(x)$ ?
I can see that when $p(x)$ is large (i.e. close to $1$), then $\log p(x)$ is close to 0. I guess someone could say: a coin which is always likely to be heads (i.e. high $p(x)$) contains very little surprise.
On the other hand, when $p(x)$ is small (i.e. close to 0), then $\log p(x)$ approaches negative infinity… I guess this is a way of saying "high surprise" for observing an event with low probability.
And finally, when $p(x)$ is 0.5, then $\log p(x)$ (base 2) is -1.
It seems to me that they are trying to weight the probabilities based on a multiplication factor using log probabilities, but I don't understand why this weighting is necessary.
Can someone please help me out?