Derivative of $\sqrt{AA^T}$ with respect to $A$

Question

How to find the derivative of the function $f: M_n(\mathbb{R})\to M_n(\mathbb{R}), A\mapsto \sqrt{AA^T},$ where $A^T$ is the transpose of the matrix $A$?

\begin{align} Df_V(A) & = \lim_{h\to 0}\dfrac{f(A+hV)-f(A)}{h}\\ & = \lim_{h\to 0}\dfrac{\sqrt{(A+hV)(A+hV)^T}-\sqrt{AA^T}}{h}\\ & = \lim_{h\to 0} \dfrac{\sqrt{AA^T+hAV^T+hVA^T+h^2VV^T}-\sqrt{AA^T}}{h} \end{align}

Now, what should I do?

https://en.wikipedia.org/wiki/Tensor_derivative_(continuum_mechanics)#Derivatives_of_tensor_valued_functions_of_second-order_tensors — user550103, Sep 18 '19 at 11:24
@XYZABC the encode the derivative of a function you need to know all the partial derivatives. If both input and output are matrices you have a total of 4 axes to iterate over. — Hyperplane, Sep 19 '19 at 07:15
@XYZABC For a function $f\colon\mathbb R^n \to\mathbb R^m$ the derivative is encoded in the matrix $\Big( \frac{\partial f_i}{\partial x_j}\Big){ij}$. If $f\colon\mathbb R^{n\times m} \to\mathbb R^{k\times l}$ it is encoded in the 4-D tensor $\Big( \frac{\partial f{ij}}{\partial x_{kl}}\Big)_{ijkl}$ — Hyperplane, Sep 19 '19 at 09:10

greg · Accepted Answer · 2019-09-19T14:37:43.443

Define the matrix $F$ such that $$F = \sqrt{AA^T} \;\implies F^2 = AA^T $$ The vec operation can be used to flatten these matrices into vectors. $$a={\rm vec}(A),\quad f={\rm vec}(F)$$ The requested gradient can be calculated as follows. $$\eqalign{ &F\,F &= AA^T \cr &F\,dF\,(I)+(I)\,dF\,F &= A\,dA^T\,(I)+(I)\,dA\,A^T \cr &(I^T\otimes F+F^T\otimes I)\,{\rm vec}(dF) &= (I^T\otimes A)\,{\rm vec}(dA^T)+(A\otimes I)\,{\rm vec}(dA) \cr &\Big(I\otimes F+F\otimes I\Big)\,df &= \Big((I\otimes A)K+(A\otimes I)\Big)\,da \cr &\frac{\partial f}{\partial a} &= \Big(I\otimes F+F\otimes I\Big)^+ \Big((I\otimes A)K+(A\otimes I)\Big) \cr\cr }$$ where $M^+$ denotes the pseudoinverse of $M$, $I$ is the identity matrix, and $K$ is the commutation matrix associated with the Kronecker product. The solution also takes advantage of the fact that $I$ and $F$ are symmetric.

The true gradient is a fourth-order tensor, while the above result is a flattened version of it. If you want the full tensor result, the components are not too difficult to calculate, since there is a one-to-one mapping between the elements of the tensor and the elements of the flattened matrix.

Upon re-reading the question, it looks like you are interested in the directional derivative, where changes in $A$ are restricted to the matrix direction $V$. $$\eqalign{ v &= {\rm vec}(V) \\ df_v &= \bigg(\frac{\partial f}{\partial a}\bigg)\cdot v \\ }$$ Once again, unflattening this vector into a matrix is just a one-to-one mapping, i.e. $$\eqalign{ F &\in {\mathbb R}^{m\times n} \implies f \in {\mathbb R}^{mn\times 1} \\ F_{ij} &= f_{\alpha} \\ \alpha &= i+(j-1)\,m \\ i &= 1+(\alpha-1)\,{\rm mod}\,m \\ j &= 1+(\alpha-1)\,{\rm div}\,m \\ }$$

Hyperplane · Answer 2 · 2019-09-19T10:15:34.797

The biggest challenge with this derivative is that matrices do not necessarily commute. This post should give you a taste for the difficulties one may encounter.

Nevertheless let's give it a try. But first, generalize the problem a bit by considering the map

$$ f\colon\mathbb S^n_+\longrightarrow\mathbb S^n_+,\, X\longmapsto X^{\alpha} \overset{\text{def}}{=} \exp(\alpha\log(X))$$

which maps a symmetric, positive definite matrix to its matrix power, defined via matrix exponential and matrix logarithm. One can easily check that $\sqrt{X} = X^{1/2}$ by diagonalizing.

1. The commutative case

If $\Delta X$ commutes with $X$, then everything is easy, because then $\log(X\cdot\Delta X)=\log(X) + \log(\Delta X)$ and $\exp(X+\Delta X) = \exp(X)\cdot\exp(\Delta X)$ which does not hold true in the general case

In the commuting case we have

$$\begin{aligned} \log(X+\Delta X)-\log(X) &= \log(X+\Delta X) + \log(X^{-1}) \\&= \log((X+\Delta X)X^{-1}) \\&= \log(I+X^{-1}\Delta X) \\&= \sum_{k=1}^{\infty}(-1)^{k+1} \frac{(X^{-1}\Delta X)^{k}}{k} \sim X^{-1}\Delta X \\ \exp(X+\Delta X) - \exp(X) &= \exp(X)\exp(\Delta X) - \exp(X) \\&= \exp(X)\big(\exp(\Delta X) -I\big)\\ &= \exp(X)\sum_{k=1}^{\infty} \frac{1}{k!}\Delta X^k \sim \exp(X)\Delta X \end{aligned}$$ Hence $\partial_{\Delta X}\log(X) = X^{-1}$ and $\partial_{\Delta X} \exp(X) = \exp(X)$ in the commuting case $[X, \Delta X]=0$. In particular via chain rule it follows that we have $\partial_{\Delta X} X^\alpha = \alpha X^{\alpha-1}$ which satisfies our expectation.

2. The non-commutative case

This one is much harder, one can try to use the Baker-Campbell-Hausdorff formula and/or the Zassenhaus formula but the end result won't be pretty. In general, the nice formula from before does not hold anymore. For example, if $\alpha=2$ then once can easily check that

$$(X+\Delta X)^2-X^2 \sim X\cdot\Delta X + \Delta X \cdot X \neq 2X\cdot\Delta X$$

We can verify this numerically as well, for example with the python (3.7) program below. The residuals are stable when you increase the sample size $N$ in the commutative case, but they will get larger and larger in the general case. (randomly sampling matrices with extremely large commutator is rather rare...)

import numpy as np
from scipy.linalg import norm, fractional_matrix_power as powm
from scipy.stats import ortho_group, wishart  # to sample orthogonal/ spd matrices

alphas = np.linspace(0.5, 5, 10)
N = 100 # sample size
eps = 10**-8
n=6  # matrix size

print("Commutative case")
# using simultaneous diagonalizable => commuting
for a in alphas:
    r = 0
    for _ in range(N):
        D = np.diag(np.random.rand(n))
        S = np.diag(np.random.rand(n))
        U = ortho_group.rvs(n)
        X = U.T @ D @ U
        DX = eps* U.T@ S @ U
        num = powm(X+DX, a) - powm(X, a)  # numerical derivative
        ana = a*powm(X, a-1) @ DX         # formula
        r =max(r, norm( num-ana))
    print(F"alpha: {a:.2f}, max residual {r}")
# residuals should be numerically close to zero   

print("General case")
for a in alphas:
    r = 0
    for _ in range(N):
        X = wishart(scale=np.eye(n)).rvs()
        DX= eps*wishart(scale=np.eye(n)).rvs()
        num = powm(X+DX, a) - powm(X, a) # numerical derivative
        ana = a*powm(X, a-1) @ DX        # formula
        r =max(r, norm( num-ana))
    print(F"alpha: {a:.2f}, max residual {r}")
# residuals should be much larger

score 2 · Answer 3 · answered Sep 19 '19 at 14:24

For starters,$$\frac{\partial}{\partial A_{ij}}(AA^T)_{kl}=\frac{\partial}{\partial A_{ij}}(A_{km}A_{lm})=\delta_{ik}\delta_{jm}A_{lm}+A_{km}\delta_{il}\delta_{jm}=\delta_{ik}A_{lj}+\delta_{il}A_{kj}.$$Next, let $X:=\sqrt{AA^T}$ so $(AA^T)_{kl}=X_{kp}X_{pl}$, so$$\frac{\partial X_{kp}}{\partial A_{ij}}X_{pl}+\frac{\partial X_{pl}}{\partial A_{ij}}X_{kp}=\delta_{ik}A_{lj}+\delta_{il}A_{kj}.$$To evaluate the derivative $D_{ijrs}:=\frac{\partial X_{rs}}{\partial A_{ij}}$, note that$$D_{ijrs}(\delta_{kr}X_{sl}+\delta_{sl}X_{kr})=\delta_{ik}A_{lj}+\delta_{il}A_{kj}.$$So the result is of the form $B^{-1}C$, but with $B,\,C$ as rank-$4$ tensors rather than matrices, and their multiplication and inversion suitably defined.

Derivative of $\sqrt{AA^T}$ with respect to $A$

3 Answers3

1. The commutative case

2. The non-commutative case