Why does relative entropy decrease under pushforward?

Question

I am reading the paper at https://arxiv.org/abs/1006.3028 (J. Lehec, "Representation formula for the entropy and functional inequalities"). The main concept here is the relative entropy of the probability measures $\mu$ and $\gamma$, defined as $$H(\mu | \gamma)=\int \log\left( \frac{d\mu}{d\gamma}\right) d\mu, $$ or $+\infty$ if $\mu$ is not absolutely continuous with respect to $\gamma$ (that is, the density $\frac{d\mu}{d\gamma}$ does not exist). This is also known as the Kullback-Liebler divergence.

Remark on sign conventions. This definition seems to be more common of information theory. With this definition, $H(\mu| \gamma)$ is a nonnegative convex function of $\mu$. The common physicist's definition, on the other hand, has the opposite sign; it is thus a nonpositive concave function of $\mu$.

The first inequality in the second section reads $$\tag{1} H(\mu\circ T^{-1} | \gamma\circ T^{-1})\le H(\mu | \gamma)$$ for all measurable maps $T$.

Main question. What is the fastest proof of (1)?

Following the references in the paper I actually found a proof. In the book "Large deviations and applications" of Varadhan (reference [24], section 10) I see that the relative entropy can be characterized as $$ H(\mu|\gamma)=\inf\left\{ c\,:\, \int F\, d\mu \le c + \log \int e^F\, d\gamma,\ \forall F \text{ bounded and measurable}\right\}.$$ Using this characterization, (1) follows. I wonder if there is a way to avoid the characterization, though.

NOTE. The characterization is an immediate consequence of the convex duality described in this question, which is an application of the Jensen inequality.

Secondary question. The word "entropy" makes me think of the second principle of thermodynamics, and it suggests some quantity that is monotonic in time. Now, the map $\mu\mapsto \mu\circ T^{-1}$ can be interpreted as a step in time for the discrete dynamical system $x\mapsto T(x)$. Can (1) be seen as a version of the second principle of thermodynamics for such discrete systems?

Let me know if you have any questions, specially on the last part where properties of conditional expectations are being used, for example $E\big[E[X|\mathcal{A}]\big]=E[X]$ and the like. Also, there is a nice Dover book by M. C. Mackay Time's arrow: the rorigins of thermodynamic behavior, Dover NY, 1992, that discusses the thermodynamic formalism in mathematics (specially dynamical systems) and the thermodynamic formalism in Statistical mechanics (physics). — Mittens, Aug 04 '21 at 16:46
Wow Oliver, not even one hour after asking here's this wonderful answer! Give me a bit of time to have a thorough look — Giuseppe Negro, Aug 04 '21 at 17:00
sorry for the confusion. I have made so many edits that I am afraid the mods will give me a warning. There were a few errors in my original posting, for starters I was giving you the the inequality in the opposite direction. As it turned out, I was averaging a conditional expectation with respect to the wrong measure. I have corrected that. I am using the definition of entropy more commonly used in statistical mechanics: $H_\gamma(f):=-\int_X \log(f),f,d\mu=\int_X\eta(f),d\gamma$ where $\eta(x)=-x\log(x)$ for $x>0$ and $0$ for x=0$. — Mittens, Aug 05 '21 at 01:30
Dear Oliver, you are really helping me a lot. Sorry if I am so slow in giving feedback, I am not working hard these days. I already upvoted and I will accept as soon as I understand your answers. — Giuseppe Negro, Aug 05 '21 at 17:42
that is all right. I started an answer to question (2) as wiki community in hope that others in the field of dynamical systems, statistical mechanics and/or kinetic PDEs chime in. My knowledge of Statistical Mechanics is very rudimentary and I wrote what I remember from some notes I took years ago in a Summer School on Statistical Physics. The other answer, which I've edited many times already due to typos, is the mathematical part to Question (1). I hope you find it useful and understand it. Cheers! — Mittens, Aug 05 '21 at 17:47
Do you have any questions about the answer I posted to you first problem? Unfortunately no one else has volunteered more on the community-wiki answer that I started for your second question, which is of a more general nature. — Mittens, Aug 14 '21 at 19:29
In regards to your secondary question, the decrease of relative entropy is in fact commonly interpreted as a "kind of second law". A nice description is provided by Thomas Cover (of Cover & Thomas fame) in: Cover, "Which processes satisfy the second law?", in *Physical Origins of Time Asymmetry", https://cqi.inf.usi.ch/qic/94_Cover.pdf — Artemy, Sep 15 '21 at 02:10
@Artemy: this is a fantastic find for me. That's exactly what I was looking for. Thanks a lot — Giuseppe Negro, Sep 15 '21 at 16:57

Mittens · Accepted Answer · 2023-02-15T20:04:03.743

Suppose $\mu$ and $\gamma$ are probability measures on $(X,\mathscr{F})$, $\mu\ll\gamma$, and $T:(X,\mathscr{F})\rightarrow(Y,\mathscr{G})$ measurable.

Then of course $\mu\circ T^{-1}\ll\gamma\circ T^{-1}$, for $\gamma\circ T^{-1}(A)=\gamma(T^{-1}(A))=0$, implies $\mu(T^{-1}(A))=\mu\circ T^{-1}(A)=0$.

Claim:
$$ \mathbb{E}_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]=\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\circ T$$

Let $h:(Y,\mathscr{G})\mapsto(\mathbb{R},\mathscr{B}(\mathbb{R})$ be a measurable function such that $E_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]=h\circ T$ (any function $\phi$ that is measurable with respect to $\sigma(T)$ admits a representation for the form $\phi=h_\phi\circ T$ for some measurable function $h$ on $Y$). Then, for any $B\in\mathscr{G}$, $$\begin{align} \int_Y \mathbb{1}_B\,\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\, d(\gamma\circ T^{-1})&=\int_Y \mathbb{1}_B\,d(\mu\circ T^{-1})=\int_X \mathbb{1}_B\circ T\,d\mu\\ &=\int_X\mathbb{1}_{T^{-1}(B)}\frac{d\mu}{d\gamma}\,d\gamma=\int_X\big(\mathbb{1}_{B}\circ T \big)\,\mathbb{E}_\gamma\Big[\frac{d\mu}{d\gamma}\big|\sigma(T)\Big]\,d\gamma\\ &=\int_X \big(\mathbb{1}_B\circ T\big)\, h\circ T\,d\gamma =\int_Y\mathbb{1}_B\,h\,d(\gamma\circ T^{-1}) \end{align} $$ This proves that (1) $(\gamma\circ T^{-1})$-almost surely $\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}=h$, and so, (2) $\frac{d(\mu\circ T^{-1})}{d(\gamma\circ T^{-1})}\circ T=\mathbb{E}_\gamma\big[\frac{d\mu}{d\gamma}\big|\sigma(T)\big]$
$\Box$

Let $\eta(x)=x \log(x)\mathbb{1}_{(0,\infty)}(x)$ on $[0,\infty)$. It is easy to check that $\eta$ is convex on $[0,\infty)$ , and that for any pair of measures $\mu$, $\gamma$ with $\mu\ll\gamma$ $$H(\mu|\gamma):=\int_X\log\big(\frac{d\mu}{d\gamma}\big)\,d\mu=\int_X\log\big(\frac{d\mu}{d\gamma}\big)\,\frac{d\mu}{d\gamma}\,d\gamma=\int_X\eta\big(\frac{d\mu}{d\gamma}\big)\,d\gamma$$ Finally, applying Jensen's inequality to conditional expectations yields

$$\begin{align} H(\mu\circ T^{-1}|\lambda\circ T^{-1})&=\int_Y\eta\left(\frac{d\mu\circ T^{-1}}{d\gamma\circ T^{-1}}\right)\,d(\gamma\circ T^{-1})\\ &=\int_X\eta\Big(\frac{d\mu\circ T^{-1}}{d\gamma\circ T^{-1}}\circ T\Big)\,d\gamma\\ &=\int_X\eta\Big(\mathbb{E}_\gamma\big[ \frac{d\mu}{d\gamma}\big|\sigma(T)\big]\Big)\,d\gamma\\ &\leq\int_X\mathbb{E}_\gamma\big[ \eta\big(\frac{d\mu}{d\gamma}\big)\big|\sigma(T)\big]\,d\gamma\\ &=\int_X\eta\big(\frac{d\mu}{d\gamma}\big)\,d\gamma=H(\mu|\gamma) \end{align}$$ which is the desired inequality.

I am reading (finally! I've been on holiday). I made an edit, please revert it if you disagree. I removed every occurrence of $(Y, \mathscr{G})$, since actually both measures are defined on $X$ and so the map $T$ must also be $T\colon X\to X$. — Giuseppe Negro, Aug 28 '21 at 17:30
Ok, I finished checking, all is OK. The only minor problem is that your definition of relative entropy is $-$ mine, thus you prove the opposite inequality. — Giuseppe Negro, Aug 28 '21 at 18:20
@GiuseppeNegro: I reverted to the definition of relative entropy given in your posting. I realized that $$H(\nu|\gamma):=\int_X\log\big(\frac{d\nu}{d\gamma}\big),d\nu$$ is more adequate as it is closely related to the Legendre transform of $f\stackrel{\Lambda}{\mapsto}\gamma[e^f]$, $f\in C_b(X)$. There is a little discrepancy in sign with the usual definition of entropy (w.r.t. a $\sigma$-finite measure $m$) $$H(\nu):=-\int\log\big(\frac{d\nu}{dm}\big),d\nu$$ for $\nu\ll m$ (e.g. $m$ is the counting measure on a countable set). This seems to be the way things are defined in modern texts. — Mittens, Feb 16 '23 at 00:22
Thanks for letting me know, I updated the question. I think this convention is more common in information theory, but it seems to me that physicists still put the minus sign in front. Not that it really matters, of course. (I put a small remark in the question to address this sign thing). (For some reason, I can't put a @...: in the beginning of this comment) — Giuseppe Negro, Feb 16 '23 at 10:46
For a more categorical approach to the first claim see https://math.stackexchange.com/q/4190415/169085 — Alp Uzman, Feb 17 '23 at 00:01

Mittens · Answer 2 · 2024-08-19T17:24:53.447

This is more a comment that an answer intending to address question 2 through a rather rudimentary presentation of the thermodynamics formalism. Others are welcome to contribute.

Postulate A. A Thermodynamic system is is equivalent to a measure space $(X,\mathscr{B},\mu)$. $X$ is called the phase space; $\mu$ is a $\sigma$-finite measure.

A dynamical law (or rather and autonomous time dynamical law) in a thermodynamical system is described by a collection of measurable transformations $S=\{S_t:t\in\mathbb{T}\}$. The index set $\mathbb{T}$ denoting time may be either discrete ($\mathbb{Z}$ or $\mathbb{N}\cup\{0\}$ for example) or continuous ($\mathbb{R}$ or $[0,\infty)$). The dynamical law satisfies the following semi group properties:

$S_0(x)=x$ for all $x$
$S_{t+t'}(x)= S_t(S(_{t'}(x))$ for all $t,t'\in\mathbb{T}$ and $x\in X$.
When $\mathbb{T}=\mathbb{Z}$ or $\mathbb{T}=\mathbb{R}$, then the system $S$ is invertible to a time reversible system: $S_{t}\circ S_{-t}=S_0=S_{-t}\circ S_t$. If $S$ is such that not all $S_t$ are invetible ($\mathbb{T}=\mathbb{N}\cup\{0\}$ or $\mathbb{T}=[0,\infty)$ we say that $S$ is a noninvertible system (one can't go back on time)

For any $x\in X$, $\{S_t(x):t\in\mathbb{T}\}$ is called the trajectory of $x$. To study the way in which the dynamics changes over time one may consider the individual trajectories of each point $x$ in the phase space; or, as in Ergodic theory, on can study the way in which the dynamics affect infinite number of points. This is done, in probabilistic terms, by studying how the system alters densities. A density $f$ is a measurable function $f\geq0$ such that $\int_Xf\,d\mu=1$.

Postulate B. A thermodynamic system has, at any given time $t$, a state characterized by a density $f_t$.

At any given time, for any $A\in\mathscr{B}$ $$\mu_t(A)=\int_A f_t(x)\,\mu(dx)$$ denotes the probabilty that at time $t$ the state of the system is in $A$. Typically $$\int_X(\mathbb{1}_{A}\circ S_t)\, f_0\,d\mu=\int_X\mathbb{1}_A\, f_t\,d\mu$$

An observable $\mathcal{O}$ is a measurable function $\mathcal{O}:X\rightarrow\mathbb{R}$. $\mathcal{O}(x)$ characterizes some aspect of the thermodynamic system. The average value of the observable at time $t$ $$\langle \mathcal{O}\rangle_{f_t} = \int_X\mathcal{O}(x)f_t(x)\,\mu(dx)$$ If for some density $f$ the dynamical law is $f\cdot\mu$-invariant, i.e. $\int_X\mathbb{1}_A\circ (S^t)\, f\,\mu=\int_A\mathbb{1}_A\,f\,d\mu$ for all $t\in\mathbb{T}$ and $A\in\mathscr{B}$ then one expects some ergodicity properties: $$\lim_{t\rightarrow\infty}\frac{1}{t}\int^t_0 g\circ S_u\,du =\int_X g\,f d\mu\qquad f\cdot d\mu-\text{a.s.}$$ Such $f$ describes a state of thermodynamic equilibrium.

In his celebrated work Gibbs introduced the concept of index of probability for a system in state $\{f_t:t\in\mathbb{T}\}$ as $\log(f_t)$. Now the quantity \begin{align} H(f_t):=-\int_X\log(f_t(x))f_t(x)\,\mu(dx)\tag{0}\label{BG} \end{align} is called Boltzman-Gibbs entropy of the density $f_t$. To illustrate the intuition behind this quantity, Suppose there are two termodinamical systems $(X_j,\mathscr{F}_j,\mu_j)$, $j=1,2$, and each one having states (densities) $f^j$. We combined these two system to form the system $(X_1\times X_2,\mathscr{F}_1\otimes\mathscr{F}_2,\mu_1\otimes\mu_2)$ with densities $f(x_2,x_2)=f^1(x_1)f^2(x_2)$ (all these means that systems 1 and 2 do not interact with each other). Then its is expected that the entropy of the combined system equals the sum of the entropies of systems 1 and 2. It is easy to check that definition \eqref{BG} satisfies this property.

Formally, the Boltzmann-Gibbs entropy of the density $f$ w.r.t. $\mu$ is defined as $$H(f)=\int_X\eta(f(x))\,\mu(dx),\qquad \eta(w)=-w\log(w)\mathbb{1}_{(0,\infty)}(w)$$ The function $\eta$ is concave ($-\eta$ is convex) over $[0,\infty)$ and so, $$\eta(w)\leq(w-v)\eta'(v)+\eta(v)=-w\log v-(w-v),\qquad w,v>0$$ Then, for any pair of densities $f$ and $g$ such that $\eta\circ f$ and $\eta\circ g$ are $\mu$ intergable (i.e., in $L_1(\mu)$) we have that $$\begin{align} -\int_Xf(x)\log(f(x))\,\mu(dx)\leq -\int_Xf(x)\log(g(x))\,\mu(dx)\tag{1}\label{gibbs-ineq} \end{align}$$

It follows from \eqref{gibbs-ineq} that if $\mu(X)<\infty$, then the density $f_*(x)=\frac{1}{\mu(X)}$ maximizes the entropy amongst all densities. The density $f_*$ is a generalization of what Gibbs called the microcanonical ensemble.
If $\nu$ and $\gamma$ are probability measures and $\nu\ll\gamma$, the relative entropy of $\nu$ relative to $\gamma$ is defined as $$H(\nu|\gamma):=\int_X\log\big(\frac{d\nu}{d\gamma}\big)\,d\nu= \int_X\log\big(\frac{d\nu}{d\gamma}\big)\frac{d\nu}{d\gamma}\,d\gamma=-\int_X\eta\big(\frac{d\nu}{d\gamma}\big)\,d\gamma$$ Since $-\eta$ is convex, $$H(\nu|\gamma)=-\int_X\eta\big(\frac{d\nu}{d\gamma}\big)\,d\gamma\geq -\eta\Big(\int_X\frac{d\nu}{d\gamma}\,d\gamma\Big)=-\eta(1)=0$$ If in addition, $\gamma\ll\mu$ and $d\nu=f\,d\mu$, $d\gamma=g\,d\mu$ $$H(f|g):=H(f\,d\mu| g\,d\mu):=\int_X\log\big(\frac{f(x)}{g(x)}\big)\,f(x)\,\mu(dx)=-\int_X\eta\big(\frac{f}{g}\big)\,g\,d\mu $$ In statistics, $H(\nu|\gamma)$ is known as the Kullback-Liebler divergence and is denoted as $K(\nu|\gamma)$.

When $\mu$ is not finite, there are no entropy maximizing densities. However, under some additional constrains we can find densities that maximize entropy. More concretely, suppose that for real constants $c_1,\ldots,c_k$, and observables $\mathcal{O}_1,\ldots,\mathcal{O}_k$, there are constants $\nu_1,\ldots, \nu_k$ such that

$\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\in L_1(\mu)$,
$c_j=Z^{-1}\int_X\mathcal{O}_j\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\,d\mu$ for each $j=1,\ldots,k$, where $Z=\int_X\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\,d\mu$.

Then, it follows from another application of \eqref{gibbs-ineq} that the density $$\begin{align} f_*=\frac{1}{Z}\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\tag{2}\label{gibbs-2} \end{align}$$ maximizes the entropy $f\mapsto H(f)$ among all densities such that $c_j=\langle \mathcal{O}_j\rangle_f$.

The normalizing factor $Z$ is known as the partition function, and the density $f_*$ generalizes the canonical ensamble of Gibbs. In defining the canonical enables notice that time does not appear.

Postulate C: There exists a one-to-one correspondence between states of thermodynamic equilibrium and the states of maximal entropy.

Postulate D: Given a (nonnegative) observable $\mathcal{O}$ and a constant $c>0$, the entropy maximizing density given by \eqref{gibbs-2} satisfying $c=\langle \mathcal{O}\rangle_{f_*}$ corresponds to a state of thermodynamic equilibrium attained physically.

If there is only one state of thermodynamic equillibriumthat is attained regardless if the way in which the system starts then this is called the a globally stable equilibrium (this is related to the second law of thermodynamics).

This is to address differences and connections between the relative entropy introduced by the OP and conditional entropies of random variables used in information theory.

Recall that if $\nu$ is a probability measure on $(S,\mathscr{S})$ and $\nu\ll m$ for some $\sigma$-finite measure $m$ on $\mathscr{S}$, then the entropy of $\nu$ (w.r.t. $m$) is defined as $$H_m(\nu)=-\mathbb{E}_\nu\big[\log\big(\frac{d\nu}{dm}\big)\big]=-\int_S\log\big(\frac{d\nu}{dm}\big)\,d\nu=-\int_S\log\big(\frac{d\nu}{dm}\big)\,\frac{d\nu}{dm}\,dm$$ Since $g(x)=\log(x)\mathbb{1}_{(0,\infty)}(x)$ is concave and $\frac{d\nu}{dm}>0$ $\nu$-a.s., $$H_m(\nu)=\mathbb{E}_\nu\big[\log\big(1/\tfrac{d\nu}{dm}\big)\big]\leq \log\big(E_\nu\big[1/\tfrac{d\nu}{dm}\big]\big)=\log\Big(\int_X\frac{1}{\tfrac{d\nu}{dm}}\frac{d\nu}{dm}\,dm\Big)=\log(m(S)) $$
Relative entropy refers to probability measures $\nu$ and $\gamma$ on a common measurable space $(S,\mathscr{S})$: If $\nu\ll\gamma$ then $$H(\nu|\gamma):=\mathbb{E}_\nu\big[\log\big(\frac{d\nu}{d\gamma}\big)\big]=\int_S\log\big(\frac{d\nu}{d\gamma}\big)\,d\nu=\int_S \log\big(\frac{d\nu}{d\gamma}\big)\,\frac{d\nu}{d\gamma}\,d\gamma.$$ The convexity of $h(x)=x\log(x)\mathbb{1}_{(0,\infty)}(x)$ implies that $H(\nu|\gamma)\geq0$ with equality iff $\frac{d\nu}{d\gamma}\equiv1$.
If in addition, $\gamma\ll m$, then as $\frac{d\nu}{dm}=\frac{d\nu}{d\gamma}\frac{d\gamma}{dm}$ $m$-a.s. we have that $$H(\nu|\gamma)=\int_S\log\left(\frac{\tfrac{d\nu}{dm}}{\tfrac{d\gamma}{dm}}\right)\frac{d\nu}{dm}\,dm $$
Notice that if $m$ is itself a probability measure, then $$H_m(\nu)=-H(\nu|m)$$ More generally, if $m(S)<\infty$ and $\overline{m}=\frac{1}{m(S)}m$, then $$H(\nu|\overline{m})=\int_S\log\big(m(S)\frac{d\nu}{dm}\big)\frac{d\nu}{dm}\,dm=\log(m(S))-H_m(\nu) $$

Conditional entropies relate random variables taking values in possibly different measure spaces. Suppose $(\Omega,\mathscr{F},\mathbb{P})$ a probability space and $X:(\Omega,\mathscr{F})\rightarrow(S,\mathscr{S})$ and $Y:(\Omega,\mathscr{F})\rightarrow(T,\mathscr{T})$ two random variables. For simplicity, assume there are $\sigma$ finite measures $m_S$ and $m_T$ on $(S,\mathscr{S})$ and $(T,\mathscr{T})$ respectively, and that $(X,Y)$ admits a joint probability measure $P_{X,Y}$ on $(S\times T:\mathscr{S}\otimes\mathscr{T})$ and such that $P_{X,Y}\ll m=m_S\otimes m_T$. Set $p_{XY}:=\frac{dP_{X,Y}}{dm}$. If $P_X$ and $P_Y$ are the laws of $X$ and $Y$ respectively, then $P_X\ll m_S$, $P_Y\ll m_T$ and \begin{align} P_X(A)&:=\mathbb{P}[X\in A]=\int_X\mathbb{1}_A(x)\Big(\int_Y p_{XY}(x,y)\,m_T(dy)\Big)\,m_S(dx)\\ P_Y(B)&:=\mathbb{P}[Y\in B]=\int_Y\mathbb{1}_B(y)\Big(\int_X p_{XY}(x,y)\,m_Y(dy)\Big)\,m_Y(dy) \end{align} It follows that $p_X(x):=\frac{dP_X}{dm_S}(x)=\int_Yp(x,y)\,m_T(dy)$ and $p_Y(y)=\frac{dP_Y}{dm_T}(y)=\int_Xp(x,y)\,m_S(dx)$.

The conditional probability of $X$ given $Y$ is regular and for each $y\in T$ we have $\frac{dP_{X|Y=y}}{dm_S}(x)=\frac{p_{XY}(x,y)}{p_Y(y)}$, that is, $$\mathbb{E}[X\in A|\sigma(Y)](y)=\int_S\mathbb{1}_A(x)\frac{p_{XY}(x,y)}{p_Y(y)}\,m_S(dx), \qquad A\in\mathscr{S}.$$

We use $H(X)$ to denote the entropy of $P_X$ w.r.t. $m_S$, that is $H(X)=H_{m_S}(P_X)$. Similarly, $H(Y)=H_{m_T}(P_Y)$.
The conditional entropy of $X$ given $Y$ is defined as the entropy of the conditional distribution $P_{X|Y}$ (relative to the $\sigma$-finite measure $m_S$), that is, \begin{align} H(P_{X|Y})(y)&:=-\int_S \log\Big(\frac{dP_{X|Y=y}}{dm_S}(x)\Big)\,\frac{dP_{X|Y=y}}{dm_S}(x)\,m_S(dx)\\ &=-\int_S\log\Big(\frac{p_{XY}(x,y)}{p_Y(y)}\Big)\frac{p_{XY}(x,y)}{p_Y(y)}\,m_S(dx) \end{align}
The mean conditional entropy, $H(X|Y)$, of $X$ given $Y$ is then defined as \begin{align} H(X|Y)&:=\mathbb{E}[H(P_{X|Y})]\\ &=-\int_T\Big(\int_S\log\Big(\frac{p_{XY}(x,y)}{p_Y(y)}\Big)\frac{p_{XY}(x,y)}{p_Y(y)}\,m_S(dx)\Big)\,p_Y(y)\,m_T(dy) \end{align}
The identity $\log(p_{X,Y}(x,y))=\log(p_Y(y))+\log(p_{X|Y=y}(x))$ yields \begin{align} H(X,Y)&:=H_m(P_{X,Y})=-\int_{S\times T} \log(p_{X,Y}(x,y)\big) p_{X,y}(x,y)\,m_S\otimes m_T(dx,dy)\\ &=-\int_T \log(p_Y(y))\Big(\int_S p_{X,Y}(x,y)\,m_S(dx)\Big) \,m_T(dy)\\ &\qquad\qquad -\int_T p_Y(y)(\Big(\int_S\log\big(\frac{p_{X,Y}(x,y)}{p_Y(y)}\big)\frac{p_{X,Y}(x,y)}{p_Y(y)}\,m_S(dx)\Big)\,m_T(dy)\\ &=H_{m_T}(P_Y)+H(X|Y)=: H(Y)+H(X|Y) \end{align}
The assumptions on $X$ and $Y$ imply that $P_{X,Y}\ll P_X\otimes P_Y$ (see this posting for a proof). Thus, \begin{align} 0\leq H(P_{XY}|P_X\otimes P_Y)&=\int_{S\times T}\log\Big(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\Big)\,p_{XY}(x,y)\,m_S\otimes m_T(dx,dy)\\ &=\int_{S\times T}\log\Big(\frac{p_{XY}(x,y)}{p_Y(y)}\Big)\,p_{XY}(x,y)\,m_S\otimes m_T(dx,dy)\\ &\qquad - \int_{S\times T}\log(p_X(x))p_{X,Y}(x,y)\,m_S\otimes m_T(dx,dy)\\ &=-H(X|Y)+ H(X) \end{align} and so, $H(X|Y)\leq H(X)$ with equality off $p_{XY}(x,y)=p_X(x)p_Y(y)$, i.e., iff $X$ and $Y$ are independent.
From (8) and (9) we have that \begin{align} H(P_{X,Y}|P_X\otimes P_Y)= H(X)+ H(Y) - H(X,Y) \end{align} The quantity $H(P_{X,Y}|P_X\otimes P_Y)$, typically denoted as $I(X,Y)$, is known as the mutual information of $X$ and $Y$.

Why does relative entropy decrease under pushforward?

2 Answers2

Linked

Related