Matrix-by-matrix derivative

Question

The definition of the matrix-by-matrix derivative is:

$$ \frac{\partial X_{kl}}{\partial X_{ij}}=\delta_{ik}\delta_{lj} $$

If the matrices are $n\times n$, then the resulting matrix will be $n^2 \times n^2$.

Is the following identity valid for the matrix-by-matrix derivative?

$$ \frac{\partial}{\partial A} AB = \frac{\partial A}{\partial A} B + A\frac{\partial B}{\partial A} $$

If so, I do not understand how we can multiply a $n^2 \times n^2$ matrix by a $n \times n$ matrix?

$$ \frac{\partial}{\partial A} AB = \underbrace{\frac{\partial A}{\partial A}}_{n^2\times n^2} \overbrace{B}^{n\times n} + \overbrace{A}^{n\times n} \underbrace{\frac{\partial B}{\partial A}}_{n^2 \times n^2} $$

Except the derivative of that matrix with respect to itself is not an $n^2 \times n^2$ matrix, it's an $n \times n \times n \times n$ 4-dimensional array. — Ninad Munshi, Oct 19 '19 at 12:58
@Ninad so its even worse than I thought. Can you multiply a $n \times n \times n \times n$ with a $n \times n$ matrice? Is the $n \times n$ the equivalent of a 'scalar' to the $n \times n \times n \times n$ array? — Anon21, Oct 19 '19 at 13:07
Kind of but not quite. There is no matrix operation equivalent description I can give you because matrix multiplication operations are only defined for 2d arrays. You'll have to go to tensor calculus and use indices to get an exact answer. In this case, $$\frac{\partial A_m^jB_j^n}{\partial A_k^l} = \delta_m^k\delta_l^j B_j^n + A_m^j \frac{\partial B_j^n}{\partial A_k^l} = \delta_m^k B_l^n + A_m^j \frac{\partial B_j^n}{\partial A_k^l}$$ — Ninad Munshi, Oct 19 '19 at 13:10
@Ninad Would you happen to know the indices/tensor notation for $\partial X/\partial Y$ matrix-by-matrix derivative? I gather from your answer that in the case $X=Y$, then $\partial X^j_m/\partial X^l_k=\delta^k_m \delta^j_l$. But what about $\partial X^j_m/\partial Y_k^l$? — Anon21, Oct 19 '19 at 13:33
That's like asking what $\frac{dy}{dx}$ when you don't what $x$ or $y$ are or their relationship to each other. — Ninad Munshi, Oct 19 '19 at 13:35

score 2 · Answer 1 · answered Oct 19 '19 at 18:10

As suggested in the comments, computing the gradient of a matrix with respect to a matrix will result in a fourth-order tensor.

The product rule holds if you consider differentials. For example:

$$ \begin{align} F &= AB \\ dF &= dAB + A dB \end{align} $$

Now you may not want to work with fourth-order tensors, thus you can look for a "flattened", matricial represetation of the tensor. Suppose you want to the compute this representation for $\frac{\partial F}{\partial A}$.

You can proceed by vectorizing both sides:

$$ \begin{align} \rm{vec}(dF) &= \rm{vec}(dAB)\\ &= \rm{vec}(I~dAB) \end{align} $$ where I is the identity matrix.

And using the Kronecker-vec relation:

$$ \begin{align} \rm{vec}(dF) &= (B^T \otimes I) \rm{vec}(dA) \\ df &= (B^T \otimes I) ~da \end{align} $$

Thus:

\begin{align} \frac{\partial f}{\partial a} = B^T \otimes I \end{align}

Which is $n^2 \times n^2$ instead of $n\times n \times n \times n$.

If you do want to work with fourth-order tensors, without the vectorization trick, then you can proceed as in this answer.

Matrix-by-matrix derivative

1 Answers1

Linked