I have the following problem: I am trying to derive final expressions for error gradients in a simple recurrent neural network (Backpropagation through Time, BPPT). The parameters and state update equations are the following:
$\mathbf{x}_t \in R^n, \mathbf{y}_t \in R^m$
$\mathbf{z}_t \in R^m, \mathbf{h}_t \in R^{s},\mathbf{W}_h \in {R}^{s\times s}, \mathbf{W}_x \in {R}^{s\times n}, \mathbf{b}_h \in {R}^{s},\mathbf{b}_x \in {R}^{s}$
Hidden state update: $\mathbf{h}_t = tanh(\mathbf{W}_{h} \cdot \mathbf{h}_{t-1} + \mathbf{b}_h + \mathbf{W}_{x} \cdot \mathbf{x}_t + \mathbf{b}_x)$
Prediction: $\mathbf{z}_t = \mathbf{W}_p \cdot \mathbf{h}_t + \mathbf{b}_p$
Total network error: $E = \sum_{t=1}^{N}E_t,\\ E_t = \frac{1}{2}\sum_{i=1}^m ((\mathbf{z}_t)_i - (\mathbf{y}_t)_i)^{2}$
I arrived at this expression for the gradient of the total error for a training dataset with training samples $\mathbf{x}_t, t=1,...,N$:
\begin{align} {\frac{\partial E}{\partial \mathbf{W}_{h}}} & = \sum_{t=1}^{N}\sum_{\tau=1}^t \frac{\partial E_{t}}{\partial \mathbf{h}_{t}} \frac{\partial \mathbf{h}_{t}}{\partial \mathbf{h}_{\tau}}\frac{\partial\mathbf{h}_\tau}{\partial\mathbf{W}_h}\\ &=\sum_{t=1}^{N}\sum_{\tau=1}^{t} \frac{\partial E_{t}}{\partial \mathbf{h}_{t}} \biggl(\prod_{j=\tau+1}^{t} \frac{\partial \mathbf{h}_{j}}{\partial \mathbf{h}_{j-1}}\biggr)\frac{\partial\mathbf{h}_\tau}{\partial\mathbf{W}_h}\\ &=\sum_{t=1}^{N}\sum_{\tau=1}^t \frac{\partial E_{t}}{\partial \mathbf{z}_{t}} \frac{\partial \mathbf{z}_{t}}{\partial \mathbf{h}_{t}}\biggl(\prod_{j=\tau+1}^{t} \frac{\partial \mathbf{h}_{j}}{\partial \mathbf{h}_{j-1}}\biggr)\frac{\partial\mathbf{h}_\tau}{\partial\mathbf{W}_h} \end{align}
A solution I found suggests the final result should be \begin{align} \sum_{t=1}^{N}\sum_{\tau=1}^t((\mathbf{z}_t-\mathbf{y}_t) \cdot \mathbf{W}_p)(\prod_{j=\tau+1}^{t}(1-\mathbf{h}_{j}^{2}) \cdot \mathbf{W}_{h})((1-\mathbf{h}_{\tau}^{2}) \otimes \mathbf{h}_{\tau}) \end{align} where $\otimes$ is the outer product. But I am confused about the dimensions involved and what is meant by multiplying a row-vector with a matrix, for example $(\mathbf{z}_t-\mathbf{y}_t) \cdot \mathbf{W}_p$. Is this proposed solution correct and I am just not understanding subtleties of notation or is it wrong? Do you have any suggestions what this final expression should actually look like? I tried it myself and and arrived at \begin{align} \sum_{t=1}^{N}\sum_{\tau=1}^t \biggl(\prod_{j=\tau+1}^{t} W_h^{\intercal} \cdot diag(\mathbf{1}-\mathbf{h}_{j}^{2})\biggr) \mathbf{W}_p^{\intercal} (\mathbf{z}_t-\mathbf{y}_t) ((\mathbf{1}-\mathbf{h}_{\tau}^{2}) \circ \mathbf{h}_{\tau-1}^{\intercal}) \end{align} where $\circ$ is elementwise multiplcation of vectors. I am pretty sure it is wrong because I changed the multiplication order. But this is the only way I found to get the dimensions right, i.e. to come out with a matrix in $R^{s\times s}$. Any help is highly appreciated.