Gradient of a softmax layer cases issue

Question

I am doing a multi-part homework assignment about differentiating neural networks. The first part asked me to derive the gradient of the log softmax indexed at output dimension w in terms of one the inputs i.

The softmax function indexed at output dimension w was written as:

$log([S(y)]_w)$ where $[S(y)]_w$ = $\frac{e^{y_w}}{\sum_j e^{y_j}}$

I found:

$\frac{\partial log([S(y)]_w)}{\partial y_i} = \begin{cases} -[S(y)]_i & w\neq i \\ 1 - [S(y)]_i & w = i\\ \end{cases}$

Then I wask asked to find:

$\frac{\partial}{\partial B} log(p_w)$ where $p_w = [S(Bh)]_w$

So I did the following:

$=\frac{\partial}{\partial B} log([S(Bh)]_w)$ substitution

$=\frac{\partial}{\partial Bh}log([S(Bh)]_w)\frac{\partial}{\partial B} Bh $ chain rule

I calculated: $\frac{\partial}{\partial B} Bh = h$

Now for my trouble. I have already calculated $\frac{\partial}{\partial y_i}log([S(y)]_w)$ so I should be able to just use this for $\frac{\partial}{\partial Bh}log([S(Bh)]_w)$. But my result has cases. How can I handle these? Have I made an error? Also I need the vector result from that to be a row vector to multiply by $h$ to get the right dimensionality in my answer. Do I have to transpose it?

Edit*

I found a one hot vector notation that might work. Would it be OK to represent it like so?

$\frac{\partial}{\partial Bh}log([S(Bh)]_w) = e_w - [S(Bh)]_w$

Edit**

So my final answer is:

$\frac{\partial}{\partial B} log(p_w) = (e_w - [S(Bh)]_w)h^T$

Is this notation correct?

You can use $\delta$ notation to combine the case. $\delta_{ij} = 0$ if $i \neq j$ and $\delta_{ij} = 1$ if $i = j$. $$\frac{\partial \log([S(y)]w)}{\partial y_i} = \delta{wi}-S(y)_i$$ — Guangliang, Feb 26 '17 at 18:20
Thank you! I had not heard of this notation. For the whole vector can I just drop the i? — user5505266, Feb 26 '17 at 18:28
It's called Kronecker delta. Here is the wiki page of it: https://en.wikipedia.org/wiki/Kronecker_delta. — Guangliang, Feb 26 '17 at 18:31
You probably better off to deal this as a matrix instead of vector, since it has two subscripts, $i$ and $w$. — Guangliang, Feb 26 '17 at 18:32
Found a related post: http://math.stackexchange.com/questions/2074217/how-to-compute-the-gradient-of-the-softmax-function-w-r-t-matrix?rq=1 — Guangliang, Feb 26 '17 at 18:33
But when you do partial derivative, you do it with respect of one component of the vector at a time. Think the parallel between your $Bh$ and $y$. By the way, is $h$ a scaler? — Guangliang, Feb 26 '17 at 18:45
Well I know each element of the vector is $\delta_{wi} - [S(Bh)]_i$ how can I express this as a vector? $w$ is fixed so I don't think it is a matrix. — user5505266, Feb 26 '17 at 18:48
When $w$ is fixed, $\delta_{wi}$ can be seen as a vector with 1 at $w$-th element and 0 everywhere else. — Guangliang, Feb 26 '17 at 19:18

Gradient of a softmax layer cases issue

0 Answers0