Matrix derivative formula using the matrix chain rule

Question

Let $X \in \mathbb{C}^{m \times n}$ be a matrix. Let $F(X) \in \mathbb{C}^{m \times m}$ be a matrix, function of $X$, e.g. $F(X) = I_m + X X^{\dagger}$, where $^\dagger$ means conjugate-transpose and $I_m$ is the identity matrix of dimension $m$. Finally, let $\mathbf{g}(X)$ be a (column-)vector-valued function of $X$, e.g. $\mathbf{g}(X) = u - Xv$, with $u,v$ column-vectors of appropriate dimensions. Then, $$ Q(X) = \mathbf{g}(X)^\dagger F(X) \mathbf{g}(X) $$ is clearly a scalar. What I want to find is a formula for $$ \frac{\partial \mathbf{g}(X)^\dagger F(X) \mathbf{g}(X)}{\partial X} = \ ? $$

Edit: By Leibniz's rule, $$ \frac{\partial Q(X)}{\partial X} = \frac{\partial \mathbf{g}^{\dagger}(X)}{\partial X} F(X) \mathbf{g}(X) + \mathbf{g}^{\dagger}(X) \frac{\partial F(X)}{\partial X} \mathbf{g}(X) + \mathbf{g}^{\dagger}(X) F(X) \frac{\partial \mathbf{g}(X)}{\partial X} $$

Anyway, the Leibniz rule can solve this problem (and many others). — Amitai Yuval, Jun 29 '17 at 15:38
Quadratic form as in: $x^T A x$. If by ''Leibniz rule'' you mean the product rule of differential calculus, or more generally the chain rule, than you get to derive a vector by a matrix, since you have to do $d g(X)/dX$, which is exactly what stopped me. — PseudoRandom, Jun 29 '17 at 15:41
The term "quadratic" is not appropriate in this context, as $A$ is also a function of $X$, but never mind that. If you don't know what $/dX$ means, then what are you looking for anyway? I mean, you have $.../dX$ in your question. What kind of a formula are you looking for? — Amitai Yuval, Jun 29 '17 at 15:46
Yes, I know it is not appropriate, that's why I wrote ''essentially'', as in the limiting case ''A(X)=A''. I do know what $/dX$ means, I don't understand your point. The difficulty comes from making sense of this: $dF(X) g(X)/dX$. — PseudoRandom, Jun 29 '17 at 15:49
More specifically: $d(u^T v)/dx = \frac{du}{dx} v + \frac{dv}{dx} u$. By applying this to Q(X), you get: $dQ/dX = \frac{dg(X)}{dX}F(X)g(X) + \frac{d(F(X) g(X))}{dX} g(X)$. And this is a ''vector-by-matrix'' derivative. — PseudoRandom, Jun 29 '17 at 15:52
I still don't understand your question. Why ask about the product of three different matrix-valued functions? I get the impression that what's stopping you is not the complexity of the expression in the numerator, but the $dX$ in the denominator. Wouldn't it make more sense to ask what $dF/dX$ means in this context? — Amitai Yuval, Jun 29 '17 at 15:59
It's both, actually, that's why I asked clarifications. Because I am also not 100% sure that I am not missing any $^T$ during the chain rules, that chain rules applies as I wrote, etc. So I wanted to double-check it. Let me update the question with the (partial) ''answer''. — PseudoRandom, Jun 29 '17 at 16:11
Question updated and fixed (I will fix the title shortly to remove ''quadratic form''). — PseudoRandom, Jun 29 '17 at 16:27

score 1 · Accepted Answer · answered Jun 29 '17 at 17:28

1

To begin with, as discussed in the comments, one should understand what $dX$ in the denominator means. The space of matrices is a vector space, and so, all maps in question are multi-variable maps. Hence, every map from the space of matrices to another space has a differential which can be thought of as a bunch of partial derivatives. In other words, describing the differential of such a map is equivalent to specifying all the partial derivatives.

So, let $e_i$ be a basis of the space of matrices, and let $\frac{\partial}{\partial x^i}$ denote the directional derivative in the $e_i$ direction. By the Leibniz rule,$$\frac{\partial}{\partial x^i}(f_1\cdot\ldots\cdot f_k)=\frac{\partial f_1}{\partial x^i}f_2\ldots f_k+\ldots+f_1\ldots f_{k-1}\frac{\partial f_k}{\partial x^i}.$$ Note that if the $f$'s are matrix-valued (and they are in your example), then you can't change the order in the above equation, as $AB\neq BA$ for general matrices. Taking transpose and/or conjugation commutes with differentiating, and so, transpose and $\dagger$ simply carry through.

answered Jun 29 '17 at 17:28

Amitai Yuval

19,998

Ok, now I get your point. Just a question: what do the central dots mean, like in: $f_1 \cdot f_2$ ? Do you mean scalar product? If you notice, this is exactly the rule I used in my first passage, since I said that: $d ( u \cdot v)/dx = du/dx v + dv/dx u$ ($u^Tv$ is the scalar product). And it didn't work. – PseudoRandom Jun 29 '17 at 17:35
My answer is general and can be applied for any matrix-or-vector-valued functions. So the dots can be either matrix multiplication or scalar product. I don't understand why you change the order of $u$ and $v$. – Amitai Yuval Jun 29 '17 at 17:44
I see, I swapped them. Weird, if you go to the Wikipedia page (https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-vector_identities) you will notice they are swapped (look ''denominator layout''). – PseudoRandom Jun 29 '17 at 17:48
I didn't read the whole Wiki page, but maybe it concerns real vectors. Note that in the real context scalar product is commutative, whereas in the complex context it is not. – Amitai Yuval Jun 29 '17 at 17:51
However, the Leibniz rule also works for matrix-valued functions, and in general, you cannot change the order of matrices because then the product is not even defined. – Amitai Yuval Jun 29 '17 at 17:52
If you look at the last answer in this question (https://math.stackexchange.com/questions/1621948/derivative-of-a-vector-with-respect-to-a-matrix), it appears that deriving a vector with respect to a matrix will get you a tensor. This is why I asked the meaning of $d F(X) g(X)/dX$, not in the sense of differentials, as I very well know that, but in the sense of what kind of object the result is. Because I know the definition only of $df(X)/dX$, where $f$ is a scalar function, and Wiki claims extensions of such are not trivial. – PseudoRandom Jun 29 '17 at 18:04
In your definitions, what kind of object is $df/dX$, where $f$ is a scalar function? – Amitai Yuval Jun 29 '17 at 20:13
It is a matrix, where each $(i,j)$-entry is like this: $\partial f/ \partial x_{ij}$, where $x_{ij}$ is the $(i,j)$ element of $X$. – PseudoRandom Jun 29 '17 at 20:16
1

@PseudoRandom Great, so if $f$ is vector-valued, then $df/dX$ is a matrix whose entries are vectors. If you want to multiply two such matrices, it goes like any two matrices, where two entries can be multiplied by scalar product. – Amitai Yuval Jun 29 '17 at 20:26
Very informative and clear, thank you very much! – PseudoRandom Jun 29 '17 at 20:28

score 1 · Answer 2 · answered Jun 30 '17 at 19:32

First, let's find the differentials of the intermediate variables $$\eqalign{ g &= (u-Xv) &\implies dg = -dX\,v\cr F &= I+XX^\dagger &\implies dF = dX\,X^\dagger+X\,dX^\dagger \cr }$$ Then write the function in terms of the double-contraction product, i.e. $$A:B={\rm tr}(A^TB)$$ and find its differential $$\eqalign{ Q &= F:g^*g^T \cr dQ &= (g^*g^T):dF + F:d(g^*g^T) \cr &= (g^*g^T):(dX\,X^\dagger+X\,dX^\dagger) + F:(dg^*\,g^T+g^*\,dg^T) \cr &= (g^*g^T):(dX\,X^\dagger+X\,dX^\dagger) - F:(dX^*\,v^*g^T + g^*v^T\,dX^T) \cr &= g^*g^TX^*:dX + X^Tg^*g^T:dX^\dagger - Fgv^\dagger:dX^* - vg^\dagger F:dX^T \cr &= g^*g^TX^*:dX + X^Tg^*g^T:dX^\dagger - v^*g^TF^T:dX^\dagger - F^Tg^*v^T:dX \cr &= (g^*g^TX^* - F^Tg^*v^T):dX + (X^Tg^*g^T - v^*g^TF^T):dX^\dagger \cr }$$ Treating $X$ and $X^\dagger$ as independent variables, we obtain the gradient with respect to each $$\eqalign{ \frac{\partial Q}{\partial X} &= g^*g^TX^* - F^Tg^*v^T \cr\cr \frac{\partial Q}{\partial X^\dagger} &= X^Tg^*g^T - v^*g^TF^T \cr\cr }$$

score -1 · Answer 3 · edited Dec 07 '17 at 18:15

There is a nice answer in Seber(08), page 360, result 17.25, for differentiation by vector.

Suppose y=w'Az (scalar) where: w(mX1), A(mxn), z (nx1) and all are functions of vector x. consider: "kp"=kroneker product, ' the transpose. ∂y/∂x'. Note that y=vec(y)=(z'"kp"w')vec(A)=(z"kp"w)'vec(A)=[vec(wz')']'vec(A). Since y=y' so w'Az=z'A'w, Hence ∂y/∂x'=z'A'∂w/∂x'+ [vec(wz')]'∂vec(A)/∂x'+ w'A∂z/∂x'"

SEBER, G. A. F. A Matrix Handbook for Statisticians. Wiley, 2008.

Matrix derivative formula using the matrix chain rule

3 Answers3