Why are probabilities weighed by their logarithms in the definition of entropy? How does that relate entropy to “surprise”?

Question

I am trying to understand the logic behind the mathematical formula for Entropy:

$$ \text{ Entropy for a Discrete Random Variable:} \quad H(X) = -\sum_{x \in X} p(x) \log_2 p(x) $$ $$ \text{ Entropy for a Continuous Random Variable:} \quad h(X) = -\int_{-\infty}^{\infty} f(x) \log_2 f(x) \, dx $$

From a practical view, the formulas make sense. For example:

Coin 1: Suppose there I flip a coin 100 times and observe 55 Heads
Coin 2: I then flip another coin 100 times and observe 98 Heads.

The first coin is a lot more uncertainty than the second coin. If I gave my results of both coins to a friend, and my friend had to trust my results (without flipping the coins themselves) - my friend would believe that there is more randomness in the first coin compared to the second coin. Thus, the first coin would have more entropy the second coin.

I wrote the formulas for both scenarios:

$$ H(X_1) = -\left[0.55 \log_2(0.55) + (1-0.55) \log_2(1-0.55)\right] \approx 0.9928 $$ $$ H(X_2) = -\left[0.98 \log_2(0.98) + (1-0.98) \log_2(1-0.98)\right] \approx 0.1414 $$

Based on these formulas, we can see that the first coin indeed has higher entropy compared to the second coin.

However, I have still not fully understood the Entropy formula (e.g. in the discrete case).

I understand the summation term - for a given random variable, we are interested in studying the randomness/uncertainty over each possible outcome the random variable can take. This shows us why we need a $p(x)$ term.

But what I don't understand, is that why do we need a $\log p(x)$ term in this formula? In my opinion, isn't all the uncertainty and information about the random variable coded within $x$ and $p(x)$ ... thus there should be no need to multiply by $\log p(x)$ ?

I can see that when $p(x)$ is large (i.e. close to $1$), then $\log p(x)$ is close to 0. I guess someone could say: a coin which is always likely to be heads (i.e. high $p(x)$) contains very little surprise.
On the other hand, when $p(x)$ is small (i.e. close to 0), then $\log p(x)$ approaches negative infinity… I guess this is a way of saying "high surprise" for observing an event with low probability.
And finally, when $p(x)$ is 0.5, then $\log p(x)$ (base 2) is -1.

It seems to me that they are trying to weight the probabilities based on a multiplication factor using log probabilities, but I don't understand why this weighting is necessary.

Can someone please help me out?

For an interesting other view, have a look to this question and its answer — Jean Marie, May 11 '24 at 06:30
Highly recommended answer to a similar question on mathoverflow: https://mathoverflow.net/a/146579 — littleO, May 11 '24 at 07:08
See also: https://math.stackexchange.com/questions/331103/intuitive-explanation-of-entropy — littleO, May 11 '24 at 07:10
Where have you seen this expression "mathematical surprise" ? — Jean Marie, May 11 '24 at 15:01
"It seems to me that they are trying to weight the probabilities based on a multiplication factor using log probabilities" - or, are they trying to weight the information content of each event by the probability of that event? — Filip Milovanović, May 11 '24 at 20:09
Information content is not about uncertainty, as it is about quantifying how much information an event carries. Imagine there's a beacon hidden in a large hotel, and you have a scanner that can tell you if it's in the scanned area, but not the exact room. The event of scanning the entire building has 100% prob of detection, but gives you no info. If you scan just the left wing, each result has 50% prob. Suddenly you've eliminated half the options. That's one bit of info (you need one bit to distinguish left from right): $-log_2(p(x))$ is really $log_2(1/p(x))$, in this case $log_2(2) = 1$ — Filip Milovanović, May 11 '24 at 20:24

score 3 · Answer 1 · answered May 11 '24 at 07:05

I assume you are referring to Shannon's entropy, which is used in information theory. In information theory, Shannon defined a method for measuring the amount of uncertainty using probability theory. To answer your question, let me start with an example that I have always found insightful. If I ask you what the probability is that the sun will rise tomorrow, I am convinced that you will tell me that it is 99.99999%, and I am equally convinced that you will not be very surprised. In fact, the sun rises every day and it is impossible that it will not rise tomorrow. Now suppose that the sun does not rise. The probability is extremely low, but should the event occur, the information content would be extremely high. These facts lead us to the conclusion that the information content of the observation of the random variable $x$ and the expected uncertainty of $x$ prior to the observation should be represented by a decreasing function of probability $p(x)$: the more likely $x$ is to occur, the less information its actual observation contains. Shannon used a functional to prove that there is a single meaningful way to quantify the degree of uncertainty in evidence conveyed by a probability distribution function $p$ on a finite set. This functional has the form \begin{equation} -k \sum p(x) \log_{b} p(x) \end{equation} where $b$ and $k$ are positive constants, and $b \neq 1$. The choice of b and k determines the unit of uncertainty. The most common choice is to define the unit by the requirement that the amount of uncertainty is equal to 1 when all probabilities are equal, i.e. when the uncertainty is maximum. The units for entropy are “nats” when the natural logarithm is used and “bits” for base 2 logarithms. The resulting functional is the entropy for a discrete random variable $H(x)$. This functional can also be rewritten using the expected value as: \begin{equation} H(x)= E \{\log_{b} \frac{1}{p(x)} \} \end{equation} where I set $k=1$, and we define $0 \log 0$ to be 0. There are numerous axiomatic approaches to the problem of quantifying information and uncertainty in probability theory. Several arguments have been made, based on several solid axiomatic characterizations, to support the claim that Shannon entropy is the only useful function in probability theory for quantifying information and uncertainty. These requirements are quite intuitive. For example, the Shannon entropy must satisfy the axiom of expansibility (i.e. if a component with probability zero is added to the probability density function, the uncertainty does not change), symmetry (the uncertainty is invariant with respect to a permutation of the probabilities of the distribution), continuity, maximum (the uncertainty is maximum when all situations are equally probable), etc. The entropy of a continuous random variable is not exactly the equivalent of the discrete form. The proof is not lengthy, but I suggest you take a look at the text "Elements of Information Theory" by Cover and Thomas.

score 2 · Answer 2 · answered May 11 '24 at 05:42

According to the wikipedia page on entropy it is defined as $\mathbb{E}(I(X))$, where $I(X)$ is the information content of $X$, a random variable. If you then follow the link to the information content, https://en.wikipedia.org/wiki/Information_content, it says that it was designed to follow three axioms:

An event with probability 100% is perfectly unsurprising and yields no information.

The less probable an event is, the more surprising it is and the more information it yields.

If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

so the $\log$ factor is to help with the third axiom, which is a useful additivity property.

Would I be wrong to say that the element of surprise is identical to the minimum number of steps required to narrow down a list of possibilities from n to 1? Halving (binary) seems to be most efficient, at least in terms of groups left to scan. Correct/incorrect/both/neither? — Hudjefa, May 12 '24 at 09:45

score 1 · Answer 3 · answered May 12 '24 at 08:31

Entropy is an estimate for the expected number of symbols required to encode a random variable’s sample in an alphabet of a fixed size.

A string of $n$ symbols from an alphabet of size $k$ represents a choice of one out of $k^n$ possibilities. If the possibilities are all equally likely, for example if you have $n$ independent, uniformly distributed variables, each of which can assume one of $k$ values, you can simply assign one string of length $n$ to each possible outcome (presumably so that the symbol at position $j$ represents the value of the $j$-th variable). However, if you want to efficiently represent a non-uniformly distributed variable, you will need to partition the space of strings differently. Ideally, you would make it so that an outcome of probability $p$ is represented by a string of length $\log_{1/k}(p) = -\log_k(p)$. This ideal value is usually not even a rational number, never mind an integer; as such, it is not always achievable directly with a single variable, but can be asymptotically approached by choosing ever more complex encodings to communicate a large amount of i.i.d. samples in bulk.

Entropy is then the expected value of the number of symbols you will need to communicate a sample in this idealized encoding. “Surprise” is then characterized as the length of a message communicating the sample: intuitively, the less surprising an outcome, the less you need to say to correct your expectations. If the outcome was one you expected, you can just say “everything as expected”, now matter how large the probability space; if there were minute differences, you can describe those, then say “otherwise there was nothing out of the ordinary”, and so on. If it deviated a lot, however, you may need to say “disregard all expectations” and describe the sample from scratch. If you had assigned all short encodings to more probable outcomes, this may take a lot of symbols.

Why are probabilities weighed by their logarithms in the definition of entropy? How does that relate entropy to “surprise”?

3 Answers3