let $W$ be a $n\times m$ matrix and $\textbf{x}$ be a $m\times1$ vector. How do we calculate the following then?
$$\frac{dW\textbf{x}}{dW}$$
Thanks in advance.
let $W$ be a $n\times m$ matrix and $\textbf{x}$ be a $m\times1$ vector. How do we calculate the following then?
$$\frac{dW\textbf{x}}{dW}$$
Thanks in advance.
The quantity in question is a $3^{rd}$ order tensor.
One approach is to use index notation $$\eqalign{ f_i &= W_{ij} x_j \cr\cr \frac{\partial f_i}{\partial W_{mn}} &= \frac{\partial W_{ij}}{\partial W_{mn}} \,x_j \cr &= \delta_{im}\delta_{jn} \,x_j \cr &= \delta_{im}\,x_n \cr }$$ Another approach is vectorization $$\eqalign{ f &= W\,x \cr &= I\,W\,x \cr &= (x^T\otimes I)\,{\rm vec}(W) \cr &= (x^T\otimes I)\,w \cr\cr \frac{\partial f}{\partial w} &= (x^T\otimes I) \cr }$$
While it's true that the true answer is a 3rd degree tensors, in the context of (Feed-Forward)NN, taking this gradient as part of the chain rule where you have a final output that is a scalar loss, the calculations can simplify enormously - and be represented as an outer product, where $\frac{d\textbf{x}W}{dW} = \textbf{x}^T \cdot \_\_$
Specifically, if $L$ is the loss, and $z=\textbf{x}W+b$ (or $\textbf{a}W+b$ for any downstream "inputs"/activations), where $\textbf{x}$ (or $\textbf{a}$) are row vectors, then:
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \frac{\partial z}{\partial W} =\frac{\partial L}{\partial z} \frac{\partial (\textbf{x}W+b)}{\partial W} = \textbf{x}^T\cdot\frac{\partial L}{\partial z} $$
I'm in the process of making a YouTube video with more explanation. Will update this as soon as it get's published.
EDIT: here is the video, the related part starts at 09:00
For independent case:
If $\mathbf{x}$ is independent of $W$, this problem can be calculated as follows.
$$\cfrac{\partial W\mathbf{x}}{\partial W}= \cfrac{\partial}{\partial W} \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1m} \\ w_{21} & w_{22} & \cdots & w_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \cdots & w_{nm} \end{bmatrix} \mathbf{x} $$
$$ = \begin{bmatrix} \cfrac{\partial w_{11}}{\partial w_{11}} & \cfrac{\partial w_{12}}{\partial w_{12}} & \cdots & \cfrac{\partial w_{1m}}{\partial w_{1m}} \\ \cfrac{\partial w_{21}}{\partial w_{21}} & \cfrac{\partial w_{22}}{\partial w_{22}} & \cdots & \cfrac{\partial w_{2m}}{\partial w_{2m}} \\ \vdots & \vdots & \ddots & \vdots \\ \cfrac{\partial w_{n1}}{\partial w_{n1}} & \cfrac{\partial w_{n2}}{\partial w_{n2}} & \cdots & \cfrac{\partial w_{nm}}{\partial w_{nm}} \end{bmatrix} \mathbf{x} $$
Therefore, all elements are $1$. Eventually, the result is below.
$$ \cfrac{\partial W\mathbf{x}}{\partial W}= (\mathbf{x}^{\text{T}}\mathbf{1_{m}}) \mathbf{1_{n}} $$
Then $\mathbf{1_{k}} \in \mathbf{R}^{k}$ is
$$\mathbf{1_{k}}=[1 \ 1 \ \cdots 1]^{\text{T}}$$
For dependent case:
If $\mathbf{x}$ is dependent of $W$, it is more difficult than independent case. Likewise,
$$ \cfrac{\partial W\mathbf{x}}{\partial W}= (\mathbf{x}^{\text{T}}\mathbf{1_{m}}) \mathbf{1_{n}} + W \cfrac{\partial F(W) }{ \partial W}\mathbf{x}_{0} $$
Then, $\mathbf{x}$ can be replaced as follows
$$\mathbf{x}=F(W)\mathbf{x}_{0}$$
where, $F(W) \in \mathbf{R}^{m \times n}$ is a matrix function, for which parameters are $W$, and $\mathbf{x}_{0} \in \mathbf{R}^{m}$ is independent of $W$.