I have been trying for the past few days to find a solution to this very simple problem, yet every time I seem to hit a wall. (PS: please excuse the poor notations, it's my first time using this website).
Let's say we have $C = AB$, $A$ and $B$ matrices so that $C$ is defined.
What is $\frac{\partial C}{\partial A}$ and $\frac{\partial C}{\partial B}$, i.e., the derivatives of $C$ w.r.t $A$ and $B$ respectively?
Let us assume size $A = m \times n$, size $B = n \times p$, size $C = m \times p$
By doing $C_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$ for $i = 1, \dots, m$ and $j = 1, \dots, p$
Then $\frac{\partial C_{ij}}{\partial a_{ls}} = \frac{\partial (\sum_{k=1}^{n} a_{ik} b_{kj})}{\partial a_{ls}} = b_{sj}$ for $i = l$
Then shouldn't $\frac{\partial C}{\partial A}$ be the rate of change for each $C_{ij}$ w.r.t each element of $A$?
So that $\frac{\partial C}{\partial A}$ is a matrix $D$ of size $C$ where each $D_{ij}$ is a matrix of size $A$.
For instance $C = AB$, given $A$ of size $1 \times 3$ and $B$ of size $3 \times 2$.
$A = $$\begin{pmatrix}a_1 & a_2 & a_3 \end{pmatrix}$
$B = $$\begin{pmatrix}b_{11} & b_{12} \\ b_{21} & b_{22} \\ b_{31} & b_{32}\end{pmatrix}$
$C = \begin{pmatrix} a_1 b_{11} + a_2 b_{21} + a_3 b_{31} & a_1 b_{12} + a_2 b_{22} + a_3 b_{32} \end{pmatrix}$
Then $\frac{\partial C_1}{\partial B} = \frac{\partial (a_1 b_{11} + a_2 b_{21} + a_3 b_{31})}{\partial B} = \begin{pmatrix} a_1 & 0 & a_2 & 0 & a_3 & 0 \end{pmatrix}$ and $\frac{\partial C_2}{\partial B} = \frac{\partial (a_1 b_{12} + a_2 b_{22} + a_3 b_{32})}{\partial B} = \begin{pmatrix} 0 & a_1 & 0 & a_2 & 0 & a_3 \end{pmatrix}$
So that $\frac{\partial C}{\partial B} = \begin{pmatrix} a_1 & 0 & a_2 & 0 & a_3 & 0 \\ 0 & a_1 & 0 & a_2 & 0 & a_3 \end{pmatrix}$, is of size $2 \times 6$.
$\frac{\partial C}{\partial A} = B^T$.
However, in the machine learning equations, it is given that $\frac{\partial C}{\partial B} = A^T$.
To be more precise, the context of this query is for machine learning and backpropagation.
We have $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$, where $L$ is a scalar function (the loss), $\hat{Y}$ is a $1 \times 2$ matrix (the activated output), $\text{out}$ is a $1 \times 2$ vector (pre-activation), and $W$ is a $3 \times 2$ matrix (the weights).
If you could also explain why we actually write:
$\frac{\partial L}{\partial W} = \frac{\partial \text{out}}{\partial W} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial L}{\partial \hat{Y}}$, instead of the forward notation used earlier $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$
Is there an explanation beyond "cause it fits for the matrix product so we rearrange the terms in the chain rule"?
Asking Chat-GPT these questions yields different answers every time.
Thank you for your time.
Edit : the formatting was done using Chat-GPT. I also checked the link : Not understanding derivative of a matrix-matrix product. and it seems that the result AX is a Kronecker product, which also seems to not be the right size.
The part where I usually get stuck at is computing every $D_{ij}$ for $D = \frac{\partial C}{\partial B}$.
Other EDIT : Here is more context for the chain derivation above.
$L(Y, \hat{Y}) = \frac{1}{2m} \sum_{i=1}^{m}(Y^{(i)} - \hat{Y}^{(i)})^2$ where $\hat{Y}$ is the predicted out
$ \hat{Y} = \sigma_{o}(out) $
$ out = h_{2} W $
where $\sigma_{o}$ is the activation function for the out layer,
$h_{2}$ is of size $1 \times 3$ , $W$ of size $3 \times 2$, $out$ of size $1 \times 2$ and $\hat{y}$ of size $1 \times 2$
Here the activation function $\sigma_{o}(X)$ is applied element-wise to $X$, a matrix.
I suggest that you start to compute the gradient of the loss function wrt $/hat{y}$ and pursue until $W$. Isn't that exactly what we are doing with the chain rule : $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$ we first compute $\frac{\partial L}{\partial \hat{Y}}$ and continue on until we reach $\frac{\partial \text{out}}{\partial W}$.Then why is the order reversed in the actual equations ?
– hau louis Mar 11 '23 at 15:36