3

I've been reading John Baez's series of posts on Information Geometry. I'm currently on part 6... Midway through the post he discusses Radon-Nikodym derivatives:

The formula for information gain looks more slick: $$\int_\Omega \log\left(\frac{d\mu}{d\nu}\right)d\mu$$ And by the way, in case you’re wondering, the $d$ here doesn’t actually mean much: we’re just so brainwashed into wanting a $dx$ in our integrals that people often use $d\mu$ for a measure even though the simpler notation $\mu$ might be more logical. So, the function $\frac{d\mu}{d\nu}$ is really just a ratio of probability measures, but people call it a Radon-Nikodym derivative, because it looks like a derivative (and in some important examples it actually is). So, if I were talking to myself, I could have shortened this blog entry immensely by working with directly probability measures, leaving out the $d$'s, and saying:

Suppose $\mu$ and $\nu$ are probability measures; then the entropy of $\mu$ relative to $\nu$, or information gain, is $$S(\mu,\nu) = \int_\Omega \log\left(\frac{\mu}{\nu}\right)\mu$$

I understand the integral when formulated as (a log of) the Radon-Nikodym derivative... since that's just a function on elements of $\Omega$, the integral is just the Lebesgue integral with respect to $d\mu$. However, I don't understand how the second integral is defined... $\log\left(\frac{\mu}{\nu}\right)$ isn't a function of elements of $\Omega$, but if anything of subsets of $\Omega$ (and it clearly isn't itself a measure). What's the right way of thinking about this integral?

Intuitively, my first instinct is to break $\Omega$ into a bunch of disjoint subsets whose maximum measure according to $\mu$ is bounded, and take the limit of a sum over these sets as the bound decreases. Let's say for now all the measures involved are dominated by Lebesgue measure. Something like: Let $A_i$ be a set of subsets of $\Omega$ such that

  • $\cup_i A_i = \Omega$
  • $A_i \cap A_j = \emptyset$ when $i \ne j$
  • $\max_i \mu(A_i) < \varepsilon$

Then $$ \int_\Omega \log\left(\frac{\mu}{\nu}\right)\mu \equiv \lim_{\varepsilon \to 0} \sum_i \log\left(\frac{\mu(A_i)}{\nu(A_i)}\right)\mu(A_i) $$

Clearly that isn't terribly rigorous, but is this on the right track conceptually? Or am I just deeply confused?

Daniel Fischer
  • 211,575
Dan
  • 293
  • 1
    I disagree with Baez's simplification. The d does mean something, its the rate of change of one measure with respect to the change in another measure. It's not just the ratio of two probability measures (prehaps probability densities, but not measures). Note that the radon-nikodym derivative need not be a measure, just a function. http://mathworld.wolfram.com/Radon-NikodymDerivative.html –  Dec 01 '14 at 05:24
  • I have recently answered a very much similar question – SBF Dec 01 '14 at 12:44
  • @William That the OP was reading about information geometry when the question arose doesn't make that tag pertinent, this is a pure integration and measure theory question. – Daniel Fischer May 01 '16 at 17:47
  • You are right; I apologize for the mistake on my part. – Chill2Macht May 01 '16 at 17:49

0 Answers0