What is the difference between moment projection and information projection?

Question

Moment projection is defined as $$\text{arg min}_{q\in Q} D(p||q)$$ while information projection is defined as $$\text{arg min}_{q\in Q} D(q||p)$$. Aside from the difference in the formula, how should one interpret the difference in the two measure intuitively? And when should one use moment projection over information projection, and vice versa?

Anatoly · Answer 1 · 2016-11-09T00:53:28.310

Both the M-projection and the I-projection are projections of a probability distribution $p$ into a set of distributions $Q$. They can be defined as the distribution $q $, chosen among all included in the set $Q $, that is "closest" to $P$. Here the concept of "closest" refers to the distribution that mimimizes the relative entropy from $p$ to $q $, which is a well known measure of distance - also called Kullback–Leibler divergence and commonly denoted as $D(p||q)$. In particular, since the relative entropy expresses the information gained when shifting from $ q$ to $p$, the M-projection and the I-projection can be interpreted as the distributions that mimimize the amount of information lost when $q$ is used as a surrogate of $p $.

Since the relative entropy as a measure of distance is not symmetric, the M-projection and the I-projection are often different. The main differences between them can be well understood if we take into account what they mimimize in terms of entropy and cross entropy. The M-projection is the distribution $q $ that mimimizes

$$D (p||q)=-H_p +E_p (-\log {q}) $$

where $H_p$ is the entropy of the distribution $p $ and $E_p (-\log {q}) $ is the cross entropy between $p$ and $q $. The distribution $q $ that mimimizes this distance usually tends to show high density in all regions that are probable according to $p $ (this is because a small $-\log {q} $ in these regions yields a smaller second term). Also, the distribution $q $ that mimimizes this distance tends to extend over regions with intermediate probability according to $p $ (i.e., it is not strictly concentrated only in the peaks of $p $), because the penalty due to low density in these regions is considerable. The final result is that the M-projection commonly tends to show a relatively large variance.

On the other hand, the I-projection is the distribution $q $ that mimimizes

$$D (q||p)=-H_q +E_q (-\log {p}) $$

where $H_q$ is the entropy of the distribution $q $ and $E_q (-\log {p})$ is the cross entropy between $q $ and $p$. Although the first term gives some penalty for low entropy of $q $, often the effect of the second term predominates, so that the distribution $q $ that mimimizes this distance usually tends to show very high density in all regions where $p $ is large and very low density in all regions where $p $ is small. In other words, the mass of $q $ tends to be concentrated in the peak region of $p$. The final result is that the I-projection commonly tends to show a relatively small variance.

As regards the main applications, both the M-projection and the I-projection play important roles in graphical models. The M-projection is fundamental for learning problems where we have to find a distribution that is closest to the empirical distribution of the data set from which we want to learn. In contrast, the I-projection - easier from a computational point of view - has important applications in information geometry (e.g., thanks to the information-geometric version of Pythagoras' triangle inequality theorem, where the relative entropy is considered as squared distance in a Euclidean space) and to analyze error exponents in various information theory problems such as hypothesis-testing, source coding, and channel coding. Also, it can be used for the management of probability queries, particularly when a distribution $p $ is too complex to allow an efficient answering process. In this case, using a I-projection as an approximation of $p $ may be a good approach to obtain a more efficient elaboration of queries.

As a further note, M-projection is used to find the most likely underlying distribution of an observation with empirical distribution p given q lies in a set which is minimized over. The I-projection, on the other hand, provides the most likely outcome given some measurement of the outcome is already known that restricts the empirical distribution it to some set and the underlying distribution is q. For a finite alphabet and convex constraints, computing the I-projection and M-projection are both convex problems, so they are rather easy. — Deniz Sargun, Jun 10 '18 at 23:35
Do you know why there is no Pythagorean inequality for the M-projection? any counter-example? — yprobnoob, Jul 17 '22 at 13:32

What is the difference between moment projection and information projection?

1 Answers1