This is more a comment that an answer intending to address question 2 through a rather rudimentary presentation of the thermodynamics formalism. Others are welcome to contribute.
Postulate A. A Thermodynamic system is is equivalent to a measure space $(X,\mathscr{B},\mu)$. $X$ is called the phase space; $\mu$ is a $\sigma$-finite measure.
A dynamical law (or rather and autonomous time dynamical law) in a thermodynamical system is described by a collection of measurable transformations $S=\{S_t:t\in\mathbb{T}\}$. The index set $\mathbb{T}$ denoting time may be either discrete ($\mathbb{Z}$ or $\mathbb{N}\cup\{0\}$ for example) or continuous ($\mathbb{R}$ or $[0,\infty)$). The dynamical law satisfies the following semi group properties:
- $S_0(x)=x$ for all $x$
- $S_{t+t'}(x)= S_t(S(_{t'}(x))$ for all $t,t'\in\mathbb{T}$ and $x\in X$.
- When $\mathbb{T}=\mathbb{Z}$ or $\mathbb{T}=\mathbb{R}$, then the system $S$ is invertible to a time reversible system: $S_{t}\circ S_{-t}=S_0=S_{-t}\circ S_t$. If $S$ is such that not all $S_t$ are invetible ($\mathbb{T}=\mathbb{N}\cup\{0\}$ or $\mathbb{T}=[0,\infty)$ we say that $S$ is a noninvertible system (one can't go back on time)
For any $x\in X$, $\{S_t(x):t\in\mathbb{T}\}$ is called the trajectory of $x$. To study the way in which the dynamics changes over time one may consider the individual trajectories of each point $x$ in the phase space; or, as in Ergodic theory, on can study the way in which the dynamics affect infinite number of points. This is done, in probabilistic terms, by studying how the system alters densities. A density $f$ is a measurable function $f\geq0$ such that $\int_Xf\,d\mu=1$.
Postulate B. A thermodynamic system has, at any given time $t$, a state characterized by a density $f_t$.
At any given time, for any $A\in\mathscr{B}$
$$\mu_t(A)=\int_A f_t(x)\,\mu(dx)$$
denotes the probabilty that at time $t$ the state of the system is in $A$. Typically $$\int_X(\mathbb{1}_{A}\circ S_t)\, f_0\,d\mu=\int_X\mathbb{1}_A\, f_t\,d\mu$$
An observable $\mathcal{O}$ is a measurable function $\mathcal{O}:X\rightarrow\mathbb{R}$. $\mathcal{O}(x)$ characterizes some aspect of the thermodynamic system. The average value of the observable at time $t$
$$\langle \mathcal{O}\rangle_{f_t} = \int_X\mathcal{O}(x)f_t(x)\,\mu(dx)$$
If for some density $f$ the dynamical law is $f\cdot\mu$-invariant, i.e. $\int_X\mathbb{1}_A\circ (S^t)\, f\,\mu=\int_A\mathbb{1}_A\,f\,d\mu$ for all $t\in\mathbb{T}$ and $A\in\mathscr{B}$ then one expects some ergodicity properties:
$$\lim_{t\rightarrow\infty}\frac{1}{t}\int^t_0 g\circ S_u\,du =\int_X g\,f d\mu\qquad f\cdot d\mu-\text{a.s.}$$
Such $f$ describes a state of thermodynamic equilibrium.
In his celebrated work Gibbs introduced the concept of index of probability for a system in state $\{f_t:t\in\mathbb{T}\}$ as $\log(f_t)$. Now the quantity
\begin{align}
H(f_t):=-\int_X\log(f_t(x))f_t(x)\,\mu(dx)\tag{0}\label{BG}
\end{align}
is called Boltzman-Gibbs entropy of the density $f_t$. To illustrate the intuition behind this quantity, Suppose there are two termodinamical systems $(X_j,\mathscr{F}_j,\mu_j)$, $j=1,2$, and each one having states (densities) $f^j$. We combined these two system to form the system $(X_1\times X_2,\mathscr{F}_1\otimes\mathscr{F}_2,\mu_1\otimes\mu_2)$ with densities $f(x_2,x_2)=f^1(x_1)f^2(x_2)$ (all these means that systems 1 and 2 do not interact with each other). Then its is expected that the entropy of the combined system equals the sum of the entropies of systems 1 and 2. It is easy to check that definition \eqref{BG} satisfies this property.
Formally, the Boltzmann-Gibbs entropy of the density $f$ w.r.t. $\mu$ is defined as
$$H(f)=\int_X\eta(f(x))\,\mu(dx),\qquad \eta(w)=-w\log(w)\mathbb{1}_{(0,\infty)}(w)$$
The function $\eta$ is concave ($-\eta$ is convex) over $[0,\infty)$ and so,
$$\eta(w)\leq(w-v)\eta'(v)+\eta(v)=-w\log v-(w-v),\qquad w,v>0$$
Then, for any pair of densities $f$ and $g$ such that $\eta\circ f$ and $\eta\circ g$ are $\mu$ intergable (i.e., in $L_1(\mu)$) we have that
$$\begin{align}
-\int_Xf(x)\log(f(x))\,\mu(dx)\leq -\int_Xf(x)\log(g(x))\,\mu(dx)\tag{1}\label{gibbs-ineq}
\end{align}$$
- It follows from \eqref{gibbs-ineq} that if $\mu(X)<\infty$, then the density $f_*(x)=\frac{1}{\mu(X)}$ maximizes the entropy amongst all densities. The density $f_*$ is a generalization of what Gibbs called the microcanonical ensemble.
- If $\nu$ and $\gamma$ are probability measures and $\nu\ll\gamma$, the relative entropy of $\nu$ relative to $\gamma$ is defined as
$$H(\nu|\gamma):=\int_X\log\big(\frac{d\nu}{d\gamma}\big)\,d\nu=
\int_X\log\big(\frac{d\nu}{d\gamma}\big)\frac{d\nu}{d\gamma}\,d\gamma=-\int_X\eta\big(\frac{d\nu}{d\gamma}\big)\,d\gamma$$
Since $-\eta$ is convex,
$$H(\nu|\gamma)=-\int_X\eta\big(\frac{d\nu}{d\gamma}\big)\,d\gamma\geq -\eta\Big(\int_X\frac{d\nu}{d\gamma}\,d\gamma\Big)=-\eta(1)=0$$
If in addition, $\gamma\ll\mu$ and $d\nu=f\,d\mu$, $d\gamma=g\,d\mu$
$$H(f|g):=H(f\,d\mu| g\,d\mu):=\int_X\log\big(\frac{f(x)}{g(x)}\big)\,f(x)\,\mu(dx)=-\int_X\eta\big(\frac{f}{g}\big)\,g\,d\mu
$$
In statistics, $H(\nu|\gamma)$ is known as the Kullback-Liebler divergence and is denoted as $K(\nu|\gamma)$.
When $\mu$ is not finite, there are no entropy maximizing densities. However, under some additional constrains we can find densities that maximize entropy. More concretely, suppose that for real constants $c_1,\ldots,c_k$, and observables $\mathcal{O}_1,\ldots,\mathcal{O}_k$, there are constants $\nu_1,\ldots, \nu_k$ such that
- $\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\in L_1(\mu)$,
- $c_j=Z^{-1}\int_X\mathcal{O}_j\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\,d\mu$ for each $j=1,\ldots,k$, where $Z=\int_X\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\,d\mu$.
- Then, it follows from another application of \eqref{gibbs-ineq} that the density
$$\begin{align}
f_*=\frac{1}{Z}\exp\big(-\sum^k_{j=1}\nu_j\mathcal{O}_j\big)\tag{2}\label{gibbs-2}
\end{align}$$
maximizes the entropy $f\mapsto H(f)$ among all densities such that $c_j=\langle \mathcal{O}_j\rangle_f$.
The normalizing factor $Z$ is known as the partition function, and the density $f_*$ generalizes the canonical ensamble of Gibbs. In defining the canonical enables notice that time does not appear.
Postulate C: There exists a one-to-one correspondence between states of thermodynamic equilibrium and the states of maximal entropy.
Postulate D: Given a (nonnegative) observable $\mathcal{O}$ and a constant $c>0$, the entropy maximizing density given by \eqref{gibbs-2} satisfying $c=\langle \mathcal{O}\rangle_{f_*}$ corresponds to a state of thermodynamic equilibrium attained physically.
If there is only one state of thermodynamic equillibriumthat is attained regardless if the way in which the system starts then this is called the a globally stable equilibrium (this is related to the second law of thermodynamics).
This is to address differences and connections between the relative entropy introduced by the OP and conditional entropies of random variables used in information theory.
- Recall that if $\nu$ is a probability measure on $(S,\mathscr{S})$ and $\nu\ll m$ for some $\sigma$-finite measure $m$ on $\mathscr{S}$, then the entropy of $\nu$ (w.r.t. $m$) is defined as
$$H_m(\nu)=-\mathbb{E}_\nu\big[\log\big(\frac{d\nu}{dm}\big)\big]=-\int_S\log\big(\frac{d\nu}{dm}\big)\,d\nu=-\int_S\log\big(\frac{d\nu}{dm}\big)\,\frac{d\nu}{dm}\,dm$$
Since $g(x)=\log(x)\mathbb{1}_{(0,\infty)}(x)$ is concave and $\frac{d\nu}{dm}>0$ $\nu$-a.s.,
$$H_m(\nu)=\mathbb{E}_\nu\big[\log\big(1/\tfrac{d\nu}{dm}\big)\big]\leq \log\big(E_\nu\big[1/\tfrac{d\nu}{dm}\big]\big)=\log\Big(\int_X\frac{1}{\tfrac{d\nu}{dm}}\frac{d\nu}{dm}\,dm\Big)=\log(m(S))
$$
- Relative entropy refers to probability measures $\nu$ and $\gamma$ on a common measurable space $(S,\mathscr{S})$: If $\nu\ll\gamma$ then
$$H(\nu|\gamma):=\mathbb{E}_\nu\big[\log\big(\frac{d\nu}{d\gamma}\big)\big]=\int_S\log\big(\frac{d\nu}{d\gamma}\big)\,d\nu=\int_S \log\big(\frac{d\nu}{d\gamma}\big)\,\frac{d\nu}{d\gamma}\,d\gamma.$$
The convexity of $h(x)=x\log(x)\mathbb{1}_{(0,\infty)}(x)$ implies that $H(\nu|\gamma)\geq0$ with equality iff $\frac{d\nu}{d\gamma}\equiv1$.
- If in addition, $\gamma\ll m$, then
as $\frac{d\nu}{dm}=\frac{d\nu}{d\gamma}\frac{d\gamma}{dm}$ $m$-a.s. we have that
$$H(\nu|\gamma)=\int_S\log\left(\frac{\tfrac{d\nu}{dm}}{\tfrac{d\gamma}{dm}}\right)\frac{d\nu}{dm}\,dm
$$
- Notice that if $m$ is itself a probability measure, then
$$H_m(\nu)=-H(\nu|m)$$
More generally, if $m(S)<\infty$ and $\overline{m}=\frac{1}{m(S)}m$, then
$$H(\nu|\overline{m})=\int_S\log\big(m(S)\frac{d\nu}{dm}\big)\frac{d\nu}{dm}\,dm=\log(m(S))-H_m(\nu)
$$
Conditional entropies relate random variables taking values in possibly different measure spaces. Suppose $(\Omega,\mathscr{F},\mathbb{P})$ a probability space and $X:(\Omega,\mathscr{F})\rightarrow(S,\mathscr{S})$ and $Y:(\Omega,\mathscr{F})\rightarrow(T,\mathscr{T})$ two random variables. For simplicity, assume there are $\sigma$ finite measures $m_S$ and $m_T$ on $(S,\mathscr{S})$ and $(T,\mathscr{T})$ respectively, and that
$(X,Y)$ admits a joint probability measure $P_{X,Y}$ on $(S\times T:\mathscr{S}\otimes\mathscr{T})$ and such that $P_{X,Y}\ll m=m_S\otimes m_T$. Set $p_{XY}:=\frac{dP_{X,Y}}{dm}$. If $P_X$ and $P_Y$ are the laws of $X$ and $Y$ respectively, then $P_X\ll m_S$, $P_Y\ll m_T$ and
\begin{align}
P_X(A)&:=\mathbb{P}[X\in A]=\int_X\mathbb{1}_A(x)\Big(\int_Y p_{XY}(x,y)\,m_T(dy)\Big)\,m_S(dx)\\
P_Y(B)&:=\mathbb{P}[Y\in B]=\int_Y\mathbb{1}_B(y)\Big(\int_X p_{XY}(x,y)\,m_Y(dy)\Big)\,m_Y(dy)
\end{align}
It follows that $p_X(x):=\frac{dP_X}{dm_S}(x)=\int_Yp(x,y)\,m_T(dy)$ and $p_Y(y)=\frac{dP_Y}{dm_T}(y)=\int_Xp(x,y)\,m_S(dx)$.
The conditional probability of $X$ given $Y$ is regular and for each $y\in T$ we have $\frac{dP_{X|Y=y}}{dm_S}(x)=\frac{p_{XY}(x,y)}{p_Y(y)}$, that is,
$$\mathbb{E}[X\in A|\sigma(Y)](y)=\int_S\mathbb{1}_A(x)\frac{p_{XY}(x,y)}{p_Y(y)}\,m_S(dx), \qquad A\in\mathscr{S}.$$
We use $H(X)$ to denote the entropy of $P_X$ w.r.t. $m_S$, that is $H(X)=H_{m_S}(P_X)$. Similarly, $H(Y)=H_{m_T}(P_Y)$.
The conditional entropy of $X$ given $Y$ is defined as the entropy of the conditional distribution $P_{X|Y}$ (relative to the $\sigma$-finite measure $m_S$), that is,
\begin{align}
H(P_{X|Y})(y)&:=-\int_S \log\Big(\frac{dP_{X|Y=y}}{dm_S}(x)\Big)\,\frac{dP_{X|Y=y}}{dm_S}(x)\,m_S(dx)\\
&=-\int_S\log\Big(\frac{p_{XY}(x,y)}{p_Y(y)}\Big)\frac{p_{XY}(x,y)}{p_Y(y)}\,m_S(dx)
\end{align}
The mean conditional entropy, $H(X|Y)$, of $X$ given $Y$ is then defined as
\begin{align}
H(X|Y)&:=\mathbb{E}[H(P_{X|Y})]\\
&=-\int_T\Big(\int_S\log\Big(\frac{p_{XY}(x,y)}{p_Y(y)}\Big)\frac{p_{XY}(x,y)}{p_Y(y)}\,m_S(dx)\Big)\,p_Y(y)\,m_T(dy)
\end{align}
The identity $\log(p_{X,Y}(x,y))=\log(p_Y(y))+\log(p_{X|Y=y}(x))$ yields
\begin{align}
H(X,Y)&:=H_m(P_{X,Y})=-\int_{S\times T} \log(p_{X,Y}(x,y)\big) p_{X,y}(x,y)\,m_S\otimes m_T(dx,dy)\\
&=-\int_T \log(p_Y(y))\Big(\int_S p_{X,Y}(x,y)\,m_S(dx)\Big) \,m_T(dy)\\
&\qquad\qquad -\int_T p_Y(y)(\Big(\int_S\log\big(\frac{p_{X,Y}(x,y)}{p_Y(y)}\big)\frac{p_{X,Y}(x,y)}{p_Y(y)}\,m_S(dx)\Big)\,m_T(dy)\\
&=H_{m_T}(P_Y)+H(X|Y)=: H(Y)+H(X|Y)
\end{align}
The assumptions on $X$ and $Y$ imply that $P_{X,Y}\ll P_X\otimes P_Y$ (see this posting for a proof). Thus,
\begin{align}
0\leq H(P_{XY}|P_X\otimes P_Y)&=\int_{S\times T}\log\Big(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\Big)\,p_{XY}(x,y)\,m_S\otimes m_T(dx,dy)\\
&=\int_{S\times T}\log\Big(\frac{p_{XY}(x,y)}{p_Y(y)}\Big)\,p_{XY}(x,y)\,m_S\otimes m_T(dx,dy)\\
&\qquad - \int_{S\times T}\log(p_X(x))p_{X,Y}(x,y)\,m_S\otimes m_T(dx,dy)\\
&=-H(X|Y)+ H(X)
\end{align}
and so, $H(X|Y)\leq H(X)$ with equality off $p_{XY}(x,y)=p_X(x)p_Y(y)$, i.e., iff $X$ and $Y$ are independent.
From (8) and (9) we have that
\begin{align}
H(P_{X,Y}|P_X\otimes P_Y)= H(X)+ H(Y) - H(X,Y)
\end{align}
The quantity $H(P_{X,Y}|P_X\otimes P_Y)$, typically denoted as $I(X,Y)$, is known as the mutual information of $X$ and $Y$.