1

The title is really just an example of the question I want to ask, but I can't think of a more general way to put it in a single sentence. The main difficulty I seem to have with learning math is in understanding how symbols are supposed to be parsed in math notation. Take for example the right side of the definition of information entropy from wikipedia:

enter image description here

I was lucky enough to just learn what the E means in this context (otherwise I wouldn't have even been able to google it since the whole thing is just an image). But even after learning what all symbols mean, my main issue is with p(X) (with the uppercase X). I think it's pretty clear what they're trying to say with the formula. The whole thing is meant to be expanded as p(x1) * -log p(x1) + p(x2) * -log p(x2)... which is the entropy formula. But if I was a computer program that needs to parse the formula, assuming I already have all symbol definitions, how would I do it? I'm not asking this because I want to write such program, but because I'd like to have some clear rules I can follow to understand other formulae which I often am unable to parse even if I know what all the symbols mean.

I have some idea about how it may work, but I'd much rather hear an explanation from someone with more math experience than myself. My idea is that there is an implicit assumption in math that when you have a function typed A -> B, but instead you pass it, for instance, [A, A] then you get back [B, B], or more generally, you find instances of the expected type inside the "shape" that you pass, and then apply the function to each of them and return the same shape.

In the case above, X could be though of as a pair of lists (C, P) where C are possible cases (such as heads or tails) and P are their probabilities, and E just returns the dot product of these lists. Since the function -log p(x) takes a possible case as its input (because that's what p takes), then when we give it the pair (C, P) it'll only operate over the items in C and return back a new pair of lists where all cases have been replaced by applying the function to each of them. For instance if the lists were ([H, T], [1/2, 1/2]) then the function would return ([-log p(H), -log p(T)], [1/2, 1/2]), which then we can pass to E to get -log p(H)*1/2 + -log p(T)*1/2.

In this specific case that seems to work, but I wonder if my reasoning is actually correct in general, or if maybe there's some more general rule one can follow to parse function usage in math when they take the "wrong" datatype.

Juan
  • 1,069
  • 1
    It should be parsed as something like this: $\mathbb{E}$ is the expectation operator. In its square brackets, there must be a random variable. $X$ is previously bound to be a random variable. In this context, all functions are to be lifted to the random variable monad. See https://toywiki.xyz/random_variable_monads.html for more details. – CrabMan Sep 08 '23 at 13:06
  • 1
    There was a great answer by @QiaochuYuan about how this notation is pretty bad. I don't know why he deleted it, but I wish he's put it back. But it does let us split your question into two possible sub-questions, with different answers. Q1: How would a symbol processing computer understand this definition? A: It wouldn't. It's not rigorously valid. Q2: How could I understand this definition? A: Step one is to read up on and understand the basics of random variables. (The wikipedia article is probably a fine place to start). R.V.s use their own language and symbols. You'll need to learn them. – JonathanZ Sep 08 '23 at 16:16
  • @CrabMan I really like that idea and I think it's more natural and general than the one I proposed (replacing elements of matching types). – Juan Sep 08 '23 at 16:34

1 Answers1

2

Edit: Okay, I was confused, this actually parses but I don't like it.

Generally speaking, if $X$ is a random variable taking values in $\mathbb{R}$, and $f : \mathbb{R} \to \mathbb{R}$ is a function, then $f(X)$ is the new random variable resulting from applying $f$ to the values of $X$, and $\mathbb{E}(f(X))$ is the expectation of that random variable. That is, for example, what this notation means in the definition of the moment generating function $\mathbb{E}(e^{tX})$.

$p(x)$ is the function which takes as input $x \in \mathbb{R}$ and returns as output the probability $\mathbb{P}(X = x)$ that $X$ takes on that value. So $p(X)$ is the random variable which takes on the value $p(x)$ with probability $p(x)$, as desired. But it's worth noting that $p$ is not some fixed function but itself depends on $X$, so really should be written $p_X$ or something.

Personally I don't think this is a good way to think about the entropy; it doesn't take as input a random variable at all but a discrete probability measure.

Qiaochu Yuan
  • 468,795
  • 1
    Honestly, I think you were right the first time - it fails parsing. As you note, $p(X)$ cannot be understood like we understand $f(X)$. $f(\cdot)$ produces a new r.v. without knowledge of the underlying probabilities, but $p(X)$ needs to "reach under" to get at the underlying prob's. I'd formalize it as being given an r.v. $X$, we can create a new r.v. - let's call it $P_X$ - whose prob's are the same, but whose values are the probabilities themselves. Yes, $X \mapsto P_X$ is function from r.v.s to r.v.s, but not implementable in the $f(X)$ way. (cont...) – JonathanZ Sep 08 '23 at 16:28
  • ... So writing it as $p(X)$ is going to cause confusion. If we're willing to use my $P_X$ notation, then I think we do have $\mathbb E[-\log(P_X)]$ as valid notation. – JonathanZ Sep 08 '23 at 16:32
  • It is at the very least confusing, I agree! – Qiaochu Yuan Sep 08 '23 at 16:40