2

From ref 1 it is clear that when you have an elementwise operation on a vector; the Jacobian matrix of the function wrto its input vector is a diagonal matrix

For an input vector $\textbf{x} = \{x_1, x_2, \dots, x_n\}$ on which an element wise function is applied; say the activation function sigmoid $\sigma$; and it give the output vector $\textbf{a} = \{a_1, a_2, \dots, a_n\}$ $a_i= f(x_i); \text{ what is } \frac { \partial a}{ \partial x} $

In scalar case this becomes $\frac { \partial f(x)}{ \partial x} = f'(x)$

In Vector case, that is when we take there derivative of a vector with respect to another vector we get the following (square) Jacobian matrix

Example from ref 2

$$ \begin{aligned} \\ \\ \text{The Jacobain, J } = \frac {\partial a}{\partial x} = \begin{bmatrix} \frac{\partial a_{1}}{\partial x_{1}} & \frac{\partial a_{2}}{\partial x_{1}} & \dots & \frac{\partial a_{n}}{\partial x_{1}} \\ \frac{\partial a_{1}}{\partial x_{2}} & \frac{\partial a_{2}}{\partial x_{2}} & \dots & \frac{\partial a_{n}}{\partial x_{2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial a_{1}}{\partial x_{n}} & \frac{\partial a_{2}}{\partial x_{n}} & \dots & \frac{\partial a_{n}}{\partial x_{n}} \\ \end{bmatrix} \end{aligned} $$

Above the diagonal of J are the only terms that can be nonzero:

$$ \begin{aligned} J = \begin{bmatrix} \frac{\partial a_{1}}{\partial x_{1}} & 0 & \dots & 0 \\ 0 & \frac{\partial a_{2}}{\partial x_{2}} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac{\partial a_{n}}{\partial x_{n}} \\ \end{bmatrix} \end{aligned} $$

$$ \text{ As } (\frac{\partial a}{\partial x})_{ij} = \frac{\partial a_i}{\partial x_j} = \frac { \partial f(x_i)}{ \partial x_j} = \begin{cases} f'(x_i) & \text{if $i=j$} \\ 0 & \text{otherwise} \end{cases} $$ And the authors go on to explain that $\frac{\partial a}{\partial x}$ can be written as $\text{diag}(f'(x))$ and the hardmarad or elementwise multiplication ($\odot$ or $\circ$) can be applied instead of matrix multiplication to this Jacobian matrix like $\odot f'(x)$ when applying the Chain Rule and converting from index notation to matrix notation.

Sorry for the rather long explanation. This was mostly to make clear the context. On to the real question.

While implementing the neural network practically the input is not a Vector but an $M*N$ dimensional Matrix ; $M, N > 1$.

Taking a simple $2*2$ input matrix on which the sigmoid activation function is done; the Jacobian of the same is a $8*2$ matrix and no longer a square matrix.

Does it make sense to say the derivative of Matrix $a_{i,j}$ - where an element-wise function is applied; over the input matrix $x_{i,j}$ as a Jacobian.

$$ \frac{\partial a_{i,j}}{\partial x_{i,j}} = J_{k,l} $$

Even if so, there is no certainty that this will be a square matrix and we can generalize to the diagonal ? Am I correct in these above statements?

However, all articles treat this matrix case as a generalization of the Vector case and write $\frac{\partial a}{\partial x}$ as the $\text{diag}(f'(x))$, and then use the element-wise/Hadamard product for the Chain Rule. This way also in implementation. But there is no meaning of diagonal in a non-square matrix; What am I missing ?

1 Answers1

3

$ \def\d{\vec\delta} \def\D{{\mathbb D}} \def\M{{\mathbb M}} \def\o{{\tt1}}\def\p{\partial} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\vecc#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\TDiag#1{\operatorname{TensorDiag}\LR{#1}} \def\Reshape#1{\operatorname{Reshape}\LR{#1}} \def\qiq{\quad\implies\quad} \def\qif{\quad\iff\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $A function and its derivative $(f,f')$ can be applied element-wise to a vector or a matrix argument $(x,X)$ producing a vector or matrix result $$\eqalign{ a &= f(x),\qquad \;a'&=f'(x) \\ A &= f(X),\qquad A'&=f'(X) \\ }$$ The differentials of such functions require element-wise (aka Hadamard) products $$\eqalign{ da &= a'\odot dx,\qquad dA = A'\odot dX \\ }$$ The differentials are related to the gradients/jacobians via dot products
$$\eqalign{ da &= \gradLR{a}{x}\cdot dx &\qif \grad{a_{i}}{x_{p}} = \gradLR{a}{x}_{ip} \\ dA &= \gradLR{A}{X}:dX &\qif \grad{A_{ij}}{X_{pq}} = \gradLR{A}{X}_{ijpq} \\ }$$ In the vector case, a diagonal matrix can be substituted for the Hadamard product so that the corresponding gradient is easily identified as this diagonal matrix, i.e. $$\eqalign{ da &= a'\odot dx = \c{\Diag{a'}}\cdot dx = \c{\gradLR ax}\cdot dx \\ }$$ To make the connection to the matrix case, first introduce the following tensors $$\eqalign{ \D_{ijk} &= \begin{cases} \o\quad{\rm if}\;(i=j=k) \\ 0\quad{\rm otherwise} \\ \end{cases} \\ \M_{ip\,jq\,kr} &= \begin{cases} \o\quad{\rm if}\;(i=j=k)\;{\rm and}\;(p=q=r) \\ 0\quad{\rm otherwise} \\ \end{cases} \\ \M_{ip\,jq\,kr} &= \D_{ijk}\,\D_{pqr} \\ }$$ which allow Hadamard products to be written using dot products $$\eqalign{ a\odot b &= \;a\cdot\D\cdot b \\ A\odot B &= A:\M:B \\ }$$ In the vector case, the tensor is the diagonalization operator $$\eqalign{ \Diag{a} \;=\; \D\cdot a \;=\; a\cdot\D \\ }$$ This same tensor can also be used to create a vector from the diagonal of a matrix, i.e. $$\eqalign{ \diag{B} \;=\; \D:B \;=\; B:\D \\ \\ }$$ In the matrix case, you could define a similar operation $$\eqalign{ \TDiag{A} \;=\; \M:A \;=\; A:\M \\ }$$ and use this to express the gradient of a function
$$\eqalign{ dA &= A'\odot dX = \c{\TDiag{A'}}:dX = \c{\gradLR AX}:dX \\ }$$ However nobody does this because they lack familiarity with tensors and index notation.

Update

Another often used trick is vectorization $$\eqalign{ dA &= A'\odot dX \\ \vecc{dA} &= \vecc{A'} \odot \vecc{dX} \\ &= \Diag{\vecc{A'}} \cdot \vecc{dX} \\ \grad{\vecc{A}}{\vecc{X}} &= \Diag{\vecc{A'}} \\ }$$ which flattens all of the matrices into vectors.

The components of the flattened gradient are exactly the same as those of the fourth-order tensor $$\eqalign{ \grad AX &= \Reshape{\grad{\vecc{A}}{\vecc{X}},\;m,n,\;m,n} \\ \grad{\vecc{A}}{\vecc{X}} &= \Reshape{\grad AX,\;mn,\;mn} \\ }$$

greg
  • 40,033
  • Here is an earlier attempt to explain the same concept – greg Mar 16 '22 at 14:55
  • Thanks for this @greg; I got an explanation in another channel for this and I put the answer there - https://stats.stackexchange.com/a/567537/191675 ; How do you see this – Alex Punnen Mar 16 '22 at 15:04
  • 3
    @AlexPunnen Personally, I find that whuber's posts about Matrix Calculus are often unclear (although his posts about statistics are usually great). But the important thing is that you understand his post. Machine Learning texts and tutorials contain good computer code but terrible math, so I highly encourage you to keep searching StackExchange and other sites for the math underpinning the code. – greg Mar 16 '22 at 16:57
  • In practice which way is used? Like in auto differentiation. – lovetl2002 Aug 05 '22 at 04:36