3

Suppose we are a given a Markov chain $A_0 \in \mathbb{R}^{n \times n}$ and a desired stationary probability vector $\pi_0 \in \mathbb{R}^n$. I would like to find a Markov chain that is as close as possible to $A_0$ and whose stationary probability is as close as possible to $\pi_0$. For example, I would like to solve

$$ \min_{{\rm stochastic}~ A} ||A-A_0||_F + ||\pi(A)-\pi_0||_2$$

Is this doable in polynomial time? For example, for any fixed $\epsilon > 0$, can I find something $\epsilon$-close to the optimal solution in time polynomial in $n$? If not, what sort of approximation guarantees can we obtain for this?

I am pretty flexible on the problem statement; ultimately, the solution will be validated based on performance on a particular application. So if the problem is solvable by replacing the $2$- and Frobenius norms with something else, or via formalizing the problem in some other way, that would also be perfect.

1 Answers1

2

You could use gradient descent. It is possible to compute the gradient of the objective function $\varphi(A)$ in terms of $A$, i.e., the derivative of the objective function with respect to each entry of the matrix $A$. Then you could apply gradient descent.

In particular, let $\varphi(A) = \|A-A_0\|_F + \|\pi(A) - \pi_0\|_2$ denote your objective function. Applying gradient descent to this objective function requires us to be able to compute $\nabla \varphi(A)$, i.e., the derivative ${\partial \over \partial A_{ij}} \varphi(A)$ of the objective function with respect to each individual entry $A_{ij}$ of the matrix $A$. Let me explain how to do that.

One approach is to approximate the gradient numerically, i.e., to compute the gradient $\nabla \varphi(A)$, perturb each element of $A$ by a small amount of measure how much $\varphi(A)$ changes. This might be fast enough if the matrix dimensions are sufficiently small, and it has the benefit of being easy to code. However, it will scale poorly in practice for large matrices, as we need to compute $n^2$ perturbations for each gradient, and each perturbation requires computing the stationary distribution of a perturbed matrix, which naively takes $O(n^3)$ time, for a total of $O(n^5)$ time per gradient. Thus each iteration of gradient descent will take $O(n^5)$ time, and you need to do many iterations of gradient descent, so this might take a while. (You can speed this up by computing the stationary distribution of the perturbed matrix using a few iterations of the power method, initialized with the stationary distribution of $A$; this takes $O(n^2)$ time; but it's still $O(n^4)$ time per iteration of gradient descent.)

A better way is to find an explicit, symbolic formula for the gradient $\nabla \varphi(A)$, and then use that with gradient descent. That is possible but it gets messy, so I'll explain the details below.

It remains to handle the constraint that $A$ must be stochastic. There are multiple ways to do that:

  • One way is to use projected gradient descent: in each iteration you update $A$ in the direction indicated by the gradient, then project the resulting matrix to the nearest stochastic matrix ($A' = A - \gamma \nabla \varphi(A)$, then let $A''$ be the stochastic matrix nearest to $A'$ and replace $A$ with $A''$).

  • Another way is to adjust the objective function, letting $\varphi'(A) = \varphi(A) + c \cdot p(A)$ where $p(A)$ is a penalty term that measures how badly $A$ fails to be stochastic (a simple choice might be $p(A) = \sum_i |(\sum_j A_{ij}) - 1|^2 + \sum_{i,j} \min(a,0)^2$), then choose $c$ sufficiently large and apply gradient descent to minimize $\varphi'(A)$. For instance, you could perform several rounds of gradient descent, where the solution from the previous round is the initial value for next round, and where you double $c$ after each round.

There may be other ways as well.

OK. That's an overview of the approach that I am proposing. On to explaining how to compute the gradient symbolically, as promised.

An explicit formula for the gradient

We can write the objective function in the form $\varphi(A) = \varphi_1(A) + \varphi_2(A)$ where $\varphi_1(A) = \|A-A_0\|_F$ and $\varphi_2(A) = \|\pi(A) - \pi_0\|_2$. It's easy to compute $\nabla \varphi_1(A)$; it is simply

$$\nabla \varphi_1(A) = {1 \over \|A-A_0\|_F} (A - A_0).$$

Computing $\nabla \varphi_2(A)$ is more difficult, as this requires us to compute the gradient of the stationary distribution $\pi(A)$ as a function of $A$. Since the stationary distribution $\pi(A)$ is the principal eigenvector of $A$ (when it is unique), this reduces to computing the gradient of the principal eigenvector of $A$ as a function of $A$, for which there are techniques available. In particular, we have

$$\nabla \varphi_2(A) = {1 \over \|\pi(A) - \pi_0|_2} \sum_k (\pi_k(A) - {\pi_0}_k) \; \nabla \pi_k(A),$$

and we need to find an expression for $\nabla \pi_k(A)$, where $\pi_k(A) = \pi(A)_k$ is the $k$th element of $\pi(A)$, and $\nabla \pi_k(A)$ is the gradient of that with respect to the entries of $A$.

Fortunately, there are nice formulas for how to compute the gradient of the eigenvector of a matrix. In particular,

$$\nabla \pi_k(A) = (\text{Id} - A^\top)^\dagger b_k \pi(A)^\top,$$

where $b_k$ is the $k$th basis vector (1 in the $k$th entry and 0 elsewhere), and $\dagger$ indicates the Moore-Penrose pseudo inverse. (Here I have used that the stationary distribution $\pi(A)$ is the eigenvector that corresponds to eigenvalue 1.) Plugging in, we find that

$$\nabla \varphi_2(A) = {1 \over \|\pi(A) - \pi_0|_2} (\pi(A) - \pi_0)^\top (\text{Id} - A^\top)^\dagger \pi(A)^\top.$$

This then gives a symbolic expression for $\nabla \varphi(A) = \nabla \varphi_1(A) + \nabla \varphi_2(A)$.

D.W.
  • 167,959
  • 22
  • 232
  • 500