1

In recurrent neural network backpropagation (BPTT), we have the equations: \begin{align} e_t &= E^T x_t \\ a_t &= W_{hx}^T e_t+ W_{hh}^T h_{t-1}\\ h_t &= \text{tanh}(a_t) \\ s_t &= W_{yh}^T h_t \\ \hat{y}_t &= \text{softmax}(s_t) \\ L_t &= \text{CE}(\hat{y}_t, y_t) \end{align} where $e_t$ has shape $(n_i, 1)$, $a_t, h_t$ have shape $(n_h,1)$, $s_t, \hat{y}_t$ have shape $(K,1)$. The matrices $W_{hx}, W_{hh}, W_{yh}$ have respective shapes $(n_i, n_h), (n_h, n_h), (n_h, K)$. $L_t$ is a scalar from the cross entropy loss function. All the $t$ represent a time step.

I would like to express the loss gradient with respect to the weight matrix transpose $W_{hx}^T$ at the first step. That is: \begin{align} \frac{\partial L_1}{\partial W_{hx}^T} &= \frac{\partial h_1^T}{\partial W_{hx}^T} \frac{\partial L_1}{\partial h_1} \end{align} In particular $\frac{\partial h_1^T}{\partial W_{hx}^T}$ requires a vectorisation for the matrix on the denominator. $W_{hx}^T$ has shape $(n_h, n_i)$, then $\text{vec}(W_{hx}^T)$ has shape $(n_hn_i, 1)$.

I am having trouble with:

  1. expressing $\frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)}$.
  2. and expressing $\frac{\partial L_1}{\partial W_{hx}^T}$ from $\frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)}$, that is de-vectorisation operation.

My attempt so far:

  1. $\frac{\partial h_1}{\partial \text{vec}(W_{hx}^T)}$. Let $f = W_{hx}^T e_1 = I_{n_h} W_{hx}^T e_1$, then $f$ also has shape $(n_h,1)$. \begin{align} df &= I_{n_h} d\text{vec}(W_{hx}^T) e_1 \\ \frac{\partial f}{\partial \text{vec}(W_{hx}^T)^T} &= e_1^T \otimes I_{n_h} \\ &= \frac{\partial a_1}{\partial \text{vec}(W_{hx}^T)^T} \end{align}

Then \begin{align} \frac{\partial h_1}{\partial \text{vec}(W_{hx}^T)} &= \frac{\partial a_1^T}{\partial \text{vec}(W_{hx}^T)} \frac{\partial h_1}{\partial a_1} \\ &= (e_1^T \otimes I_{n_h})^T \text{ diag}(1- h_1^2) \\ &= (e_1 \otimes I_{n_h}) \text{ diag}(1- h_1^2) \end{align} this has shape $(n_in_h, n_h)$.

  1. $\frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)}$ seems to have the same result and shape as $\frac{\partial h_1}{\partial \text{vec}(W_{hx}^T)} $. Can anyone confirm if I am right here? \begin{align} \frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)}&= \frac{\partial a_1^T}{\partial \text{vec}(W_{hx}^T)} \frac{\partial h_1^T}{\partial a_1} \\ &= (e_1 \otimes I_{n_h}) \text{ diag}(1- h_1^2) \end{align}

  2. Finally the first-step (vectorised) loss gradient: \begin{align} \frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)} &= \frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)} \frac{\partial L_1}{\partial h_1} \\ &= (e_1 \otimes I_{n_h}) \text{ diag}(1- h_1^2) \frac{\partial L_1}{\partial h_1} \end{align} where $\frac{\partial L_1}{\partial h_1}$ has shape $(n_h, 1)$.

So $\frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)}$ has shape $(n_in_h, 1)$. But how to express $\frac{\partial L_1}{\partial W_{hx}^T}$?

\begin{align} \frac{\partial L_1}{\partial W_{hx}^T} &= \text{ diag}(1- h_1^2) \frac{\partial L_1}{\partial h_1} e_1^T \end{align} will have the right shape $(n_h, n_i)$. But how to proceed from the result of $\frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)}$ to this result, if it is correct?

Thanks in advance for any help.


For anyone who's interested with BPTT gradients, I will state the result for $\frac{\partial L_2}{\partial W_{hx}}$ with @greg 's method.

In particular, $a_2 = W_{hx}^T e_2 + W_{hh}^T h_1$ where $h_1$ depends on $W_{hx}$. Therefore $d a_2 = d w_{hx}^T e_2 + W_{hh}^T dh_1$.

This requires an additional identity of Frobenius product: $(A+C):(B+D) = A:B+A:D+C:B+C:D$. Starting from the loss $L_2$ and working backwards, this identity will be involved when we have $\cdots :da_2$.

The final result: \begin{align} \frac{\partial L_2}{\partial W_{hx}} &= (e_2 \hat{y}_2^T - e_2 y_2^T) W_{yh}^T (I_{n_h} - H_2^2) + (e_1 \hat{y}_2^T - e_1 y_2^T) W_{yh}^T (I_{n_h} - H_2^2) W_{hh}^T (I_{n_h} - H_1^2) \end{align} with shape $(n_i, n_h)$.

siegfried
  • 258

1 Answers1

1

$ \def\o{{\tt1}}\def\p{\partial}\def\l{{\cal L}} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\s#1{\operatorname{softmax}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} $For ease of typing, drop the hats and subscripts, and give each variable a unique name, e.g. $$\eqalign{ a &= W^Te + U^Th &\qiq da = dW^Te \\ h &= \tanh(a) \\ H &= \Diag{h} &\qiq dh = \LR{I-H^2}da \\ s &= V^Th &\qiq ds = V^Tdh = V^T\LR{I-H^2}da \\ z &= \s{s} \\ Z &= \Diag{z} &\qiq dz = \LR{Z-zz^T}ds \\ }$$ Then starting with the loss function, calculate its differential and back-substitute $$\eqalign{ \l &= -y : \log(z) \\ d\l &= -y : Z^{-1}dz \\ &= -Z^{-1}y : \c{dz} \\ &= -Z^{-1}y : \c{\LR{Z-zz^T}ds} \\ &= \LR{zz^T-Z}Z^{-1}y : ds \\ &= \LR{z\o^T-I}y : ds \\ &= \LR{z-y} : \c{ds} \\ &= \LR{z-y} : \c{V^T\LR{I-H^2}da} \\ &= \LR{I-H^2}V\LR{z-y} : \c{da} \\ &= \LR{I-H^2}V\LR{z-y} : \CLR{dW^Te} \\ &= \LR{z-y}^TV^T\LR{I-H^2} : \CLR{e^TdW} \\ &= \LR{ez^T-ey^T}\LR{V^T-V^TH^2} : dW \\ \grad{\l}{W} &= \LR{ez^T-ey^T}\LR{V^T-V^TH^2} \\ \\ }$$ So you can do the entire calculation using differentials without resorting vectorization or awkward higher-order tensors. However, this method almost necessitates the use the Frobenius product, which is a really concise notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ The properties of the underlying trace function (or the double summation) allow the terms in such a product to be rearranged in many different but equivalent ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ \LR{AB}:C &= A:\LR{CB^T} \\&= B:\LR{A^TC} \\ }$$

greg
  • 40,033