In recurrent neural network backpropagation (BPTT), we have the equations: \begin{align} e_t &= E^T x_t \\ a_t &= W_{hx}^T e_t+ W_{hh}^T h_{t-1}\\ h_t &= \text{tanh}(a_t) \\ s_t &= W_{yh}^T h_t \\ \hat{y}_t &= \text{softmax}(s_t) \\ L_t &= \text{CE}(\hat{y}_t, y_t) \end{align} where $e_t$ has shape $(n_i, 1)$, $a_t, h_t$ have shape $(n_h,1)$, $s_t, \hat{y}_t$ have shape $(K,1)$. The matrices $W_{hx}, W_{hh}, W_{yh}$ have respective shapes $(n_i, n_h), (n_h, n_h), (n_h, K)$. $L_t$ is a scalar from the cross entropy loss function. All the $t$ represent a time step.
I would like to express the loss gradient with respect to the weight matrix transpose $W_{hx}^T$ at the first step. That is: \begin{align} \frac{\partial L_1}{\partial W_{hx}^T} &= \frac{\partial h_1^T}{\partial W_{hx}^T} \frac{\partial L_1}{\partial h_1} \end{align} In particular $\frac{\partial h_1^T}{\partial W_{hx}^T}$ requires a vectorisation for the matrix on the denominator. $W_{hx}^T$ has shape $(n_h, n_i)$, then $\text{vec}(W_{hx}^T)$ has shape $(n_hn_i, 1)$.
I am having trouble with:
- expressing $\frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)}$.
- and expressing $\frac{\partial L_1}{\partial W_{hx}^T}$ from $\frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)}$, that is de-vectorisation operation.
My attempt so far:
- $\frac{\partial h_1}{\partial \text{vec}(W_{hx}^T)}$. Let $f = W_{hx}^T e_1 = I_{n_h} W_{hx}^T e_1$, then $f$ also has shape $(n_h,1)$. \begin{align} df &= I_{n_h} d\text{vec}(W_{hx}^T) e_1 \\ \frac{\partial f}{\partial \text{vec}(W_{hx}^T)^T} &= e_1^T \otimes I_{n_h} \\ &= \frac{\partial a_1}{\partial \text{vec}(W_{hx}^T)^T} \end{align}
Then \begin{align} \frac{\partial h_1}{\partial \text{vec}(W_{hx}^T)} &= \frac{\partial a_1^T}{\partial \text{vec}(W_{hx}^T)} \frac{\partial h_1}{\partial a_1} \\ &= (e_1^T \otimes I_{n_h})^T \text{ diag}(1- h_1^2) \\ &= (e_1 \otimes I_{n_h}) \text{ diag}(1- h_1^2) \end{align} this has shape $(n_in_h, n_h)$.
$\frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)}$ seems to have the same result and shape as $\frac{\partial h_1}{\partial \text{vec}(W_{hx}^T)} $. Can anyone confirm if I am right here? \begin{align} \frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)}&= \frac{\partial a_1^T}{\partial \text{vec}(W_{hx}^T)} \frac{\partial h_1^T}{\partial a_1} \\ &= (e_1 \otimes I_{n_h}) \text{ diag}(1- h_1^2) \end{align}
Finally the first-step (vectorised) loss gradient: \begin{align} \frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)} &= \frac{\partial h_1^T}{\partial \text{vec}(W_{hx}^T)} \frac{\partial L_1}{\partial h_1} \\ &= (e_1 \otimes I_{n_h}) \text{ diag}(1- h_1^2) \frac{\partial L_1}{\partial h_1} \end{align} where $\frac{\partial L_1}{\partial h_1}$ has shape $(n_h, 1)$.
So $\frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)}$ has shape $(n_in_h, 1)$. But how to express $\frac{\partial L_1}{\partial W_{hx}^T}$?
\begin{align} \frac{\partial L_1}{\partial W_{hx}^T} &= \text{ diag}(1- h_1^2) \frac{\partial L_1}{\partial h_1} e_1^T \end{align} will have the right shape $(n_h, n_i)$. But how to proceed from the result of $\frac{\partial L_1}{\partial \text{vec}(W_{hx}^T)}$ to this result, if it is correct?
Thanks in advance for any help.
For anyone who's interested with BPTT gradients, I will state the result for $\frac{\partial L_2}{\partial W_{hx}}$ with @greg 's method.
In particular, $a_2 = W_{hx}^T e_2 + W_{hh}^T h_1$ where $h_1$ depends on $W_{hx}$. Therefore $d a_2 = d w_{hx}^T e_2 + W_{hh}^T dh_1$.
This requires an additional identity of Frobenius product: $(A+C):(B+D) = A:B+A:D+C:B+C:D$. Starting from the loss $L_2$ and working backwards, this identity will be involved when we have $\cdots :da_2$.
The final result: \begin{align} \frac{\partial L_2}{\partial W_{hx}} &= (e_2 \hat{y}_2^T - e_2 y_2^T) W_{yh}^T (I_{n_h} - H_2^2) + (e_1 \hat{y}_2^T - e_1 y_2^T) W_{yh}^T (I_{n_h} - H_2^2) W_{hh}^T (I_{n_h} - H_1^2) \end{align} with shape $(n_i, n_h)$.