8

I'm currently trying to learn Monte Carlo Markov chain, and I'm having troubles to grasp what it is good for and hoping that someone could help me. I want to get some help of general understanding of MCMC, therefore the question are a bit general.

From my text book it is explained:

Given a probability distribution $\pi$, the goal of MCMC is to simulate a random variable $X$ whose distribution is $\pi$. Often, one wants to estimate an expectation or other function of a join distribution from a high dimensional space. The MCMC algorithm construct an ergodic Markov chain whose limiting distribution is the desired $\pi$. One then runs the chain long enough for it to converge, and outputs the final element or elements of the Markov sequence as a sample from $\pi$. MCMC relies on the face that the limiting properties of ergodic Markov Chains have some similarities to independent and identically distributed sequences.

From the lecture is it shortly summarized as:

-Start with a particular prob. distribution
-Define a Markov chain with the prob distribution as limiting distribution
-After many step, a value in simulation is distributed according to the limiting distribution.
-Idea is that it can be difficult to obtain sample otherwise.

My question:
Why use a Markov chain to sample from a known distribution instead of directly sampling from it? Isn’t that unnecessarily. If direct sampling is difficult, wouldn’t constructing a Markov chain with the desired limiting distribution be even harder? How is this done, especially for high-dimensional distributions in MCMC?

Would really be grateful if someone could help me understand this, since I've both googled and searched YouTube for "simple enough" answers, before diving into the theory about this.

Best regards,

River Li
  • 49,125
uoiu
  • 689
  • 11
    Often you know that $\pi(x)\propto f(x)$ for some function $f$. You "know" $\pi$ in theory ($\pi(x)=\frac{f(x)}{\sum_x f(x)}$), but in some cases you can't easily sample from it directly, because it requires computing this normalizing constant $\sum_x f(x)$, which can be intractable. It's much easier to draw samples from the chain you've constructed because it doesn't require computing this normalizing constant. – DDD Feb 10 '25 at 00:11
  • There is a meta question about this question here. – user1729 Feb 13 '25 at 15:42

1 Answers1

1

I will give you an example, and I hope it clears the matter for you. Let $G=(V,E)$ be a graph and let a configuration on this graph be an assignment $\sigma\colon V\to \{\pm 1\}$ of spin values to $G$'s vertices. Let $\Omega$ be the set of such configurations. Define the probability distribution $\pi$ on $\Omega$ given by $$\pi(\sigma)=\frac{1}{Z(\beta)}\text{exp}(-\beta H(\sigma))$$ where $H(\sigma)$ (called the Hamiltonian) is the number of edges with dissimilar spins assigned to its endpoints, $\beta$ is a positive real number and $Z(\beta)$ is a normalizing constant. This is the ferromagnetic Ising model with zero external field. Suppose you want to compute $\pi(\sigma)$ for some configuration $\sigma$. While computing $\text{exp}(-\beta H(\sigma))$ is easy, computing $Z(\beta)=\sum_\sigma\text{exp}(-\beta H(\sigma))$ may require going over all the possible $2^{|V|}$ configurations.

MCMC comes to the rescue here. Start with an arbitrary configuration. At each step choose a vertex $v\in V$ uniformly at random and replace its spin according to the distribution $\pi$ conditioned on the spins at the neighbors of $v$. It is an easy exercise to verify that if $d^+$ and $d^-$ are the numbers of $v$'s neighbors with respectively a positive and negative spin, the conditional distribution of $v$ having a positive spin corresponds to $$ \frac{\lambda^{d^-}}{\lambda^{d^+}+\lambda^{d^-}} $$ where $\lambda=\text{exp}(-\beta)$. So to this point everything is computationally tractable. One could verify that this gives us an ergodic Markov chain with the stationary distribution $\pi$. The process I described is called the Glauber dynamics and can be generalized to many other spin systems.

The only caveat is how fast your Markov chain converges to its stationary distribution. This is another story and a very active area of research. But I hope the above example demonstrates how powerful the MCMC paradigm is.