I am doing a multi-part homework assignment about differentiating neural networks. The first part asked me to derive the gradient of the log softmax indexed at output dimension w in terms of one the inputs i.
The softmax function indexed at output dimension w was written as:
$log([S(y)]_w)$ where $[S(y)]_w$ = $\frac{e^{y_w}}{\sum_j e^{y_j}}$
I found:
$\frac{\partial log([S(y)]_w)}{\partial y_i} = \begin{cases} -[S(y)]_i & w\neq i \\ 1 - [S(y)]_i & w = i\\ \end{cases}$
Then I wask asked to find:
$\frac{\partial}{\partial B} log(p_w)$ where $p_w = [S(Bh)]_w$
So I did the following:
$=\frac{\partial}{\partial B} log([S(Bh)]_w)$ substitution
$=\frac{\partial}{\partial Bh}log([S(Bh)]_w)\frac{\partial}{\partial B} Bh $ chain rule
I calculated: $\frac{\partial}{\partial B} Bh = h$
Now for my trouble. I have already calculated $\frac{\partial}{\partial y_i}log([S(y)]_w)$ so I should be able to just use this for $\frac{\partial}{\partial Bh}log([S(Bh)]_w)$. But my result has cases. How can I handle these? Have I made an error? Also I need the vector result from that to be a row vector to multiply by $h$ to get the right dimensionality in my answer. Do I have to transpose it?
Edit*
I found a one hot vector notation that might work. Would it be OK to represent it like so?
$\frac{\partial}{\partial Bh}log([S(Bh)]_w) = e_w - [S(Bh)]_w$
Edit**
So my final answer is:
$\frac{\partial}{\partial B} log(p_w) = (e_w - [S(Bh)]_w)h^T$
Is this notation correct?