How to derive expression for gradient in BPPT

Question

I have the following problem: I am trying to derive final expressions for error gradients in a simple recurrent neural network (Backpropagation through Time, BPPT). The parameters and state update equations are the following:

$\mathbf{x}_t \in R^n, \mathbf{y}_t \in R^m$

$\mathbf{z}_t \in R^m, \mathbf{h}_t \in R^{s},\mathbf{W}_h \in {R}^{s\times s}, \mathbf{W}_x \in {R}^{s\times n}, \mathbf{b}_h \in {R}^{s},\mathbf{b}_x \in {R}^{s}$

Hidden state update: $\mathbf{h}_t = tanh(\mathbf{W}_{h} \cdot \mathbf{h}_{t-1} + \mathbf{b}_h + \mathbf{W}_{x} \cdot \mathbf{x}_t + \mathbf{b}_x)$

Prediction: $\mathbf{z}_t = \mathbf{W}_p \cdot \mathbf{h}_t + \mathbf{b}_p$

Total network error: $E = \sum_{t=1}^{N}E_t,\\ E_t = \frac{1}{2}\sum_{i=1}^m ((\mathbf{z}_t)_i - (\mathbf{y}_t)_i)^{2}$

I arrived at this expression for the gradient of the total error for a training dataset with training samples $\mathbf{x}_t, t=1,...,N$:

\begin{align} {\frac{\partial E}{\partial \mathbf{W}_{h}}} & = \sum_{t=1}^{N}\sum_{\tau=1}^t \frac{\partial E_{t}}{\partial \mathbf{h}_{t}} \frac{\partial \mathbf{h}_{t}}{\partial \mathbf{h}_{\tau}}\frac{\partial\mathbf{h}_\tau}{\partial\mathbf{W}_h}\\ &=\sum_{t=1}^{N}\sum_{\tau=1}^{t} \frac{\partial E_{t}}{\partial \mathbf{h}_{t}} \biggl(\prod_{j=\tau+1}^{t} \frac{\partial \mathbf{h}_{j}}{\partial \mathbf{h}_{j-1}}\biggr)\frac{\partial\mathbf{h}_\tau}{\partial\mathbf{W}_h}\\ &=\sum_{t=1}^{N}\sum_{\tau=1}^t \frac{\partial E_{t}}{\partial \mathbf{z}_{t}} \frac{\partial \mathbf{z}_{t}}{\partial \mathbf{h}_{t}}\biggl(\prod_{j=\tau+1}^{t} \frac{\partial \mathbf{h}_{j}}{\partial \mathbf{h}_{j-1}}\biggr)\frac{\partial\mathbf{h}_\tau}{\partial\mathbf{W}_h} \end{align}

A solution I found suggests the final result should be \begin{align} \sum_{t=1}^{N}\sum_{\tau=1}^t((\mathbf{z}_t-\mathbf{y}_t) \cdot \mathbf{W}_p)(\prod_{j=\tau+1}^{t}(1-\mathbf{h}_{j}^{2}) \cdot \mathbf{W}_{h})((1-\mathbf{h}_{\tau}^{2}) \otimes \mathbf{h}_{\tau}) \end{align} where $\otimes$ is the outer product. But I am confused about the dimensions involved and what is meant by multiplying a row-vector with a matrix, for example $(\mathbf{z}_t-\mathbf{y}_t) \cdot \mathbf{W}_p$. Is this proposed solution correct and I am just not understanding subtleties of notation or is it wrong? Do you have any suggestions what this final expression should actually look like? I tried it myself and and arrived at \begin{align} \sum_{t=1}^{N}\sum_{\tau=1}^t \biggl(\prod_{j=\tau+1}^{t} W_h^{\intercal} \cdot diag(\mathbf{1}-\mathbf{h}_{j}^{2})\biggr) \mathbf{W}_p^{\intercal} (\mathbf{z}_t-\mathbf{y}_t) ((\mathbf{1}-\mathbf{h}_{\tau}^{2}) \circ \mathbf{h}_{\tau-1}^{\intercal}) \end{align} where $\circ$ is elementwise multiplcation of vectors. I am pretty sure it is wrong because I changed the multiplication order. But this is the only way I found to get the dimensions right, i.e. to come out with a matrix in $R^{s\times s}$. Any help is highly appreciated.

score 1 · Answer 1 · answered Feb 12 '23 at 11:31

The cost function writes \begin{equation} \frac{\partial \phi}{\partial \mathbf{W}_h} = \sum_t \frac{\partial \phi_t}{\partial \mathbf{W}_h} \end{equation}

The integrand is computed as \begin{equation} \frac{\partial \phi_t}{\partial \mathbf{W}_h} = \sum_{a\le t} \left[ (1-\mathbf{h}_a^2) \odot \frac{\partial \phi_t}{\partial \mathbf{h}_a} \right] \mathbf{h}_{a-1}^T \end{equation}

The gradient is obtained in a recursive manner $$ \left\lbrace \begin{array}{cc} \frac{\partial \phi_t}{\partial \mathbf{h}_t} &=& \mathbf{W}_p^T (\mathbf{z}_t-\mathbf{y}_t) \\ \frac{\partial \phi_t}{\partial \mathbf{h}_{a}} &=& \mathbf{W}_h^T \left[ (1-\mathbf{h}_{a+1}^2) \odot \frac{\partial \phi_t}{\partial \mathbf{h}_{a+1}} \right], a<t \end{array} \right. $$

How to derive expression for gradient in BPPT

1 Answers1