Computing gradients with chain rule

Question

Let $x_{1}, \dots, x_{N}$ be a sequence of vectors in $\mathbb{R}^{n}$ and $A$ be an $n \times n$ matrix. Let $f: \mathbb{R} \to \mathbb{R}$ be a smooth (as smooth as you want) function. We define $$ h_{i} = f(Ax_{i}) $$ for $i = 1, \dots, N$, where $f$ is applied to the vector $Ax_{i}$ element-wise.

Suppose there is a function $G$ is a scalar function of the vectors $h_{1}, \dots, h_{N}$ (i.e. it maps to $\mathbb{R}$), and we want to find $\nabla_{A} G$, i.e. the gradient of $G$ with respect to the matrix $A$, so an $n \times n$ matrix with the $(i, j)$ entry being $\frac{\partial G}{\partial A_{ij}}$. Suppose we already knew all the gradients $\nabla_{h_{i}} G$, and want to use the chain rule along with these to find $\nabla_{A} G$. How would we go about doing that?

I know that the chain rule says that for generic vectors $x, y$ and a function $F$ mapping to $\mathbb{R}$, $$ \nabla_{x} F = \left(\frac{dy}{dx}\right)^{T} \nabla_{y} F $$ where $\frac{dy}{dx}$ the Jacobian of $y$ w.r.t. $x$. But how do I apply this to the setup above?

So you use the map $f^n : \mathbb R^n \to \mathbb R^n$ which is $f$ componentwise. You want to study the functions $\phi : (\mathbb R^n)^N \to (\mathbb R^n)^N, \phi(x_1,\ldots,x_N) = (f^n(Ax_1), \ldots, f^n(Ax_N))$ and compose it with $G : (\mathbb R^n)^N \to \mathbb R$. — Paul Frost, Nov 15 '21 at 10:54

score 1 · Answer 1 · answered Mar 29 '22 at 20:04

$ \def\a{\alpha}\def\b{\beta}\def\g{\gamma}\def\t{\theta} \def\l{\lambda}\def\s{\sigma}\def\e{\varepsilon} \def\n{\nabla}\def\o{{\tt1}}\def\p{\partial} \def\G{{\cal G}} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\deriv#1#2{\frac{d #1}{d #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} \def\h{h^\prime} \def\H{H^\prime} $The Frobenius product is a concise notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ This is also called the double-dot or double contraction product.
When applied to vectors $(n=\o)$ it reduces to the standard dot product.

The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A \\&= \LR{A^TC}:B \\ }$$ A function and its derivative are defined for a scalar argument $(\l)$
$$\eqalign{ f(\l),\qquad f^\prime(\l) = \deriv{f}{\l} \\ }$$ but can be applied elementwise to a vector argument $(w)$ to create the following set of variables $$\eqalign{ w &= Ax &\qiq &dw = dA\;x \\ h &= f(w) \\ \h &= f^\prime(w) \\ \H &= \Diag{\h} &\qiq &dh = \H\,dw = \H\,dA\;x \\ \n_hG &= \grad{G}{h} &&\n_AG = \grad{G}{A} \\ }$$

Setting $N=1$ makes subscripts unnecessary, so we'll temporarily drop them for typing convenience and calculate the gradient of $G$ $$\eqalign{ G &= G(h) \\ dG &= \gradLR{G}{h}:dh \\ &= \gradLR{G}{h}:\LR{\H\,dA\;x} \\ &= \H\gradLR{G}{h}x^T:dA \\ \grad{G}{A} &= \H\gradLR{G}{h}x^T \\ }$$ For $N>1$ the gradient is a sum of such terms $$\eqalign{ w_k &= Ax_k,\quad h_k = f(w_k),\quad \h_k = f^\prime(w_k),\quad\H_k = \Diag{\h_k} \\ \grad{G}{A} &= \sum_{k=1}^N \H_k\gradLR{G}{h_k}x_k^T \\ }$$

score 0 · Answer 2 · answered Nov 15 '21 at 09:10

I suggest using tensor contraction mentioned in Chain rule with differentiation by vectors and matrices?

For example, Let $X=\nabla_x F$, $Y=\nabla_y F$, and $J=\frac{dy}{dx}$. Then $X_i=Y_j J^j_i,$ where $a_ib^i$ means $\sum_i a_ib^i$ by Einstein Summation. So if you write gradient as a row vector, we have $\nabla_x F=\nabla_y F\left(\frac{dy}{dx}\right),$ no need for transpose.

Back to the question. Formally we have $\nabla_A G=\nabla_{h} G\frac{\partial h}{\partial Ax}\frac{\partial Ax}{\partial A}$ by chain rules, where $h=(h_1,\cdots,h_n),$ $x=(x_1,\cdots,x_n)$ are both $n\times n$ matrices. Write it as $P=QHX.$

Then $P_{ij}=Q_{kl}H^{kl}_{rt}X^{rt}_{ij},$ $P_{ij}=\frac{\partial G}{\partial {A_{ij}}},$ $Q_{kl}=\frac{\partial G}{\partial h_{kl}}=(\nabla_{h_l}G)_k,$ $H^{kl}_{rt}=\frac{\partial h_{kl}}{\partial (Ax)_{rt}}=\frac{\partial f((Ax_l)_k)}{\partial (Ax_t)_r}=\delta_{r}^k\delta_t^l f'((Ax_l)_k),$ where $\delta_i^j=\begin{cases}1,i=j\\0,i\neq j\end{cases}$ is kronecker symbol. $X^{rt}_{ij}=\frac{\partial (Ax_t)_r}{\partial A_{ij}}=\delta^r_i (x_t)_j$.

So you can write $\nabla_A G=\nabla_{h} G\frac{\partial h}{\partial Ax}\frac{\partial Ax}{\partial A}$ simply, while the specific operation is given above. Notice that they are not normal matrices, they are tensors.

Computing gradients with chain rule

2 Answers2