28

let $W$ be a $n\times m$ matrix and $\textbf{x}$ be a $m\times1$ vector. How do we calculate the following then?

$$\frac{dW\textbf{x}}{dW}$$

Thanks in advance.

  • 3
    That totally depends on definition being used. – jimjim Jan 22 '16 at 03:49
  • Also, I like your thinking, just make it explicit as what the definition should be. – jimjim Jan 22 '16 at 03:50
  • 6
    The wiki had "?" for this type of Matrix differentiation. "Result of differentiating various kinds of aggregates with other kinds of aggregates" in https://en.wikipedia.org/wiki/Matrix_calculus#Other_matrix_derivatives. I encountered this in the context of neural networks and not sure either how it's defined. – arindam mitra Jan 22 '16 at 04:00

3 Answers3

35

The quantity in question is a $3^{rd}$ order tensor.

One approach is to use index notation $$\eqalign{ f_i &= W_{ij} x_j \cr\cr \frac{\partial f_i}{\partial W_{mn}} &= \frac{\partial W_{ij}}{\partial W_{mn}} \,x_j \cr &= \delta_{im}\delta_{jn} \,x_j \cr &= \delta_{im}\,x_n \cr }$$ Another approach is vectorization $$\eqalign{ f &= W\,x \cr &= I\,W\,x \cr &= (x^T\otimes I)\,{\rm vec}(W) \cr &= (x^T\otimes I)\,w \cr\cr \frac{\partial f}{\partial w} &= (x^T\otimes I) \cr }$$

lynn
  • 1,836
  • Are you missing any $\sum_{}{}$ symbol? – arindam mitra Jan 22 '16 at 20:40
  • 3
    @arindammitra I guess lynn uses some Einstein summation notaion in the exrpression of $f_i$ – Surb Jan 15 '18 at 13:49
  • I don't think the rewriting of f is correct. At least it doesn't produce the same result when I try on a small example. – Sandi Nov 16 '18 at 12:44
  • 1
    I think the order of your $\mathbf{I}$ and $\mathbf{x}^T$ should be changed. – Sandi Nov 16 '18 at 12:52
  • 10
    @Sandi Your edits to lynn's answer are wrong. The correct vectorization formula is $${\rm vec}(IWx)=(x^T\otimes I){\rm vec}(W)$$ Please read the Wikipedia entry. This question must be cursed. The accepted answer is (still) wrong, and (now) lynn's answer has been corrupted. – greg Dec 21 '18 at 04:30
  • Can someone plz fix this? In the answer, I thought that vectorization assumes vectorization by columns, like @greg suggested in his correction, but the current answer uses vectorization by rows and it is not mentioned explicitly! – John Mar 30 '19 at 18:37
11

While it's true that the true answer is a 3rd degree tensors, in the context of (Feed-Forward)NN, taking this gradient as part of the chain rule where you have a final output that is a scalar loss, the calculations can simplify enormously - and be represented as an outer product, where $\frac{d\textbf{x}W}{dW} = \textbf{x}^T \cdot \_\_$

Specifically, if $L$ is the loss, and $z=\textbf{x}W+b$ (or $\textbf{a}W+b$ for any downstream "inputs"/activations), where $\textbf{x}$ (or $\textbf{a}$) are row vectors, then:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \frac{\partial z}{\partial W} =\frac{\partial L}{\partial z} \frac{\partial (\textbf{x}W+b)}{\partial W} = \textbf{x}^T\cdot\frac{\partial L}{\partial z} $$

I'm in the process of making a YouTube video with more explanation. Will update this as soon as it get's published.

EDIT: here is the video, the related part starts at 09:00

  • 1
    Thanks for the explanation in your video. Makes things clear on why most neural net literature writes the derivative $\partial z/\partial W$ as $\mathbf x^T$. However, I am trying to calculate the Hessian by applying the second order chain rule. That involves computing $(\partial z/\partial W)^2$, should it be $(\mathbf x^2)^T$ or $\mathbf{xx}^T$? – Phoenix Dec 26 '23 at 07:38
  • 1
    Great video, by the way!! – Gregor Hartl Watters Feb 29 '24 at 01:24
-14

For independent case:

If $\mathbf{x}$ is independent of $W$, this problem can be calculated as follows.

$$\cfrac{\partial W\mathbf{x}}{\partial W}= \cfrac{\partial}{\partial W} \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1m} \\ w_{21} & w_{22} & \cdots & w_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n1} & w_{n2} & \cdots & w_{nm} \end{bmatrix} \mathbf{x} $$

$$ = \begin{bmatrix} \cfrac{\partial w_{11}}{\partial w_{11}} & \cfrac{\partial w_{12}}{\partial w_{12}} & \cdots & \cfrac{\partial w_{1m}}{\partial w_{1m}} \\ \cfrac{\partial w_{21}}{\partial w_{21}} & \cfrac{\partial w_{22}}{\partial w_{22}} & \cdots & \cfrac{\partial w_{2m}}{\partial w_{2m}} \\ \vdots & \vdots & \ddots & \vdots \\ \cfrac{\partial w_{n1}}{\partial w_{n1}} & \cfrac{\partial w_{n2}}{\partial w_{n2}} & \cdots & \cfrac{\partial w_{nm}}{\partial w_{nm}} \end{bmatrix} \mathbf{x} $$

Therefore, all elements are $1$. Eventually, the result is below.

$$ \cfrac{\partial W\mathbf{x}}{\partial W}= (\mathbf{x}^{\text{T}}\mathbf{1_{m}}) \mathbf{1_{n}} $$

Then $\mathbf{1_{k}} \in \mathbf{R}^{k}$ is

$$\mathbf{1_{k}}=[1 \ 1 \ \cdots 1]^{\text{T}}$$

For dependent case:

If $\mathbf{x}$ is dependent of $W$, it is more difficult than independent case. Likewise,

$$ \cfrac{\partial W\mathbf{x}}{\partial W}= (\mathbf{x}^{\text{T}}\mathbf{1_{m}}) \mathbf{1_{n}} + W \cfrac{\partial F(W) }{ \partial W}\mathbf{x}_{0} $$

Then, $\mathbf{x}$ can be replaced as follows

$$\mathbf{x}=F(W)\mathbf{x}_{0}$$

where, $F(W) \in \mathbf{R}^{m \times n}$ is a matrix function, for which parameters are $W$, and $\mathbf{x}_{0} \in \mathbf{R}^{m}$ is independent of $W$.

  • 1
    Hi, can you please point me to a book where I can learn more about derivatives w.r.t matrix? – arindam mitra Feb 01 '16 at 20:12
  • 17
    This may be the accepted answer, but it's just plain wrong. The problem is that the gradient $$\frac{\partial W}{\partial W} \ne 1_n1_m^T$$ as asserted. It is instead a 4th order tensor which can be written in index notation as $$\frac{\partial W_{ij}}{\partial W_{kl}}=\delta_{ik},\delta_{jl}$$ – greg Nov 06 '17 at 01:52
  • 4
    I fully agree with @greg. This answer is simply plain wrong... I am a bit scared to see that it has 4 upvotes... – Surb Jan 15 '18 at 13:50