Gradient through Cholesky decomposition

Question

For a positive definite matrix $\Sigma$, its Cholesky decomposition is defined as follows:

$$\Sigma = R^T R$$

where $R$ is an upper-triangular matrix where non-zero elements $\in \mathbb{R}$. I want to compute the gradient of $\Sigma$ with respect to $R$.

Here's what I tried (I'm not very familiar with matrix calculus):

\begin{align} \frac{d}{d R}\{R^T R\} &= \frac{d}{d R}\{R^T\} R + R^T \frac{d}{d R}\{ R\} \\ &= \left(\frac{d}{d R}\{R\}\right)^T R + R^T I \\ &= I^T R + R^T I \\ &= R + R^T \end{align}

However, since $R$ is upper-triangular, does that mean I should simply remove the lower-triangular part of the gradient, which is $R^T$?

In other words, the correct gradient looks like below, right?

$$\frac{d}{d R}\{R^T R\} = R$$

Intuitively, this seems to say that: "Yes, we do compute gradient wrt to the entire $R$, since all entries participate in the computation, but we then realize that the stuff below the diagonal are just constants that do not need to be changed by any optimization procedure."

The derivative of a function $f: \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}$ in general involves $n^4$ partial derivatives. If the input is upper-triangular, then you can cut this down to $n^3(n-1)/2$. — angryavian, Feb 11 '22 at 23:40
@angryavian Since my goal is to use that gradient to update $R$, I think that its shape must be $n \times n$. Maybe that involves summing up some stuff in the $n^4$ partial derivatives? — Adam Wilson, Feb 11 '22 at 23:46
@Wiza That would be the case for a function $f: \mathbb{R}^{n \times n} \to \mathbb{R}$ with a scalar output. — angryavian, Feb 12 '22 at 02:18

score 1 · Answer 1 · answered Feb 12 '22 at 10:15

Let's try to use the very definition of derivative: For upper triangular $R, H \in \mathrm{Mat}_n(\mathbf R)$ we have \begin{align*} \Sigma(R+H) &= (R+H)^t(R+H)\\ &= R^tR + H^tR + R^t H + H^tH\\ &= \Sigma(R) + H^t R + R^t H + H^t H\\ &= \Sigma(R) + D\Sigma(R)H + o(\|H\|). \end{align*} So the derivative at $R$ is $D\Sigma(R) \colon H \mapsto H^tR+ R^tH$.

Gradient through Cholesky decomposition

1 Answers1