8

Can someone explain step by step how to to find the derivative of this softmax loss function/equation.

\begin{equation} L_i=-log(\frac{e^{f_{y_{i}}}}{\sum_j e^{f_j}}) = -f_{y_i} + log(\sum_j e^{f_j}) \end{equation}

where: \begin{equation} f = w_j*x_i \end{equation} let:

\begin{equation} p = \frac{e^{f_{y_{i}}}}{\sum_j e^{f_j}} \end{equation}

The code shows that the derivative of $L_i$ when $j = y_i$ is:

\begin{equation} (p-1) * x_i \end{equation}

and when $j \neq y_i$ the derivative is:

\begin{equation} p * x_i \end{equation}

It seems related to this this post, where the OP says the derivative of:

\begin{equation} p_j = \frac{e^{o_j}}{\sum_k e^{o_k}} \end{equation}

is:

\begin{equation} \frac{\partial p_j}{\partial o_i} = p_i(1 - p_i),\quad i = j \end{equation}

But I couldn't figure it out. I'm used to doing derivatives wrt to variables, but not familiar with doing derivatives wrt to indxes.

duhaime
  • 207
  • 2
  • 11
Khon
  • 249
  • 2
    I had a similar question and got an amazing answer: http://stackoverflow.com/questions/37790990/derivative-of-a-softmax-function-explanation/37791611#37791611 – Roshini Jun 13 '16 at 14:10
  • Cool, thanks @roshini I'll check that out! – Khon Jun 14 '16 at 17:20

2 Answers2

6

We have a softmax-based loss function component given by: $$L_i=-log\left(\frac{e^{f_{y_i}}}{\sum_{j=0}^ne^{f_j}}\right)$$

Where:

  1. Indexed exponent $f$ is a vector of scores obtained during classification
  2. Index $y_i$ is proper label's index where $y$ is column vector of all proper labels for training examples and $i$ is example's index

Objective is to find: $$\frac{\partial L_i}{\partial f_k}$$

Let's break down $L_i$ into 2 separate expressions of a loss function component: $$L_i=-log(p_{y_i})$$

And vector of normalized probabilities:

$$p_k=\frac{e^{f_{k}}}{\sum_{j=0}^ne^{f_j}}$$

Let's substitute sum:

$$\sigma=\sum_{j=0}^ne^{f_j}$$

For $k={y_i}$ using quotient rule:

$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{e^{f_k}\sigma-e^{2f_k}}{\sigma^2}$$

For $k\neq{y_i}$ during derivation $e^{f_k}$ is treated as constant:

$$\frac{\partial p_k}{\partial f_{y_i}} = \frac{-e^{f_k}e^{f_{y_i}}}{\sigma^2}$$

Going further:

$$\frac{\partial L_i}{\partial p_k}=-\left(\frac {1}{p_{y_i}}\right)$$

Using chain rule for derivation:

$$\frac{\partial L_i}{\partial f_k}=-\left(\frac {1}{\frac{e^{f_{k}}}{\sigma}}\right)\frac{\partial p_k}{\partial f_{y_i}}=-\left(\frac {\sigma}{{e^{f_{k}}}}\right)\frac{\partial p_k}{\partial f_{y_i}}$$

Considering $k$ and $y_i$, for $k=y_j$ after simplifications:

$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}-\sigma}{\sigma}=\frac{e^{f_k}}{\sigma}-1=p_k-1$$

And for $k\neq y_j$:

$$\frac{\partial L_i}{\partial f_k}=\frac{e^{f_k}}{\sigma}=p_k$$

These two equations can be combined using Kronecker delta:

$$\frac{\partial L_i}{\partial f_k}=p_k-\delta_{ky_i}$$

1

$ \def\R#1{{\mathbb R}^{#1}} \def\e{\epsilon} \def\ee{\varepsilon} \def\a{\alpha} \def\b{\beta} \def\g{\gamma} \def\d{\delta} \def\z{\zeta} \def\th{\theta} \def\s{\sigma} \def\l{\lambda} \def\k{\otimes} \def\bx{\boxtimes} \def\t{\times} \def\h{\odot} \def\_h{\oslash} \def\n{\nabla} \def\o{{\tt1}} \def\cv{\circledast} \def\L{{\cal L}} \def\lR#1{\Big(#1\Big)} \def\bR#1{\Big[#1\Big]} \def\LR#1{\left(#1\right)} \def\BR#1{\left[#1\right]} \def\CR#1{\left\lbrace #1 \right\rbrace} \def\op#1{\operatorname{#1}} \def\vc#1{\op{vec}\LR{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\sym#1{\op{Sym}\LR{#1}} \def\skew#1{\op{Skew}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\cross#1{\,\left[\,#1\,\right]_\times} \def\frob#1{\left\| #1 \right\|_F} \def\q{\quad} \def\qq{\qquad} \def\qif{\q\iff\q} \def\qiq{\q\implies\q} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\rr#1{\color{red}{#1}} \def\gg#1{\color{green}{#1}} \def\bb#1{\color{blue}{#1}} \def\RLR#1{\rr{\LR{#1}}} \def\BLR#1{\bb{\LR{#1}}} \def\GLR#1{\gg{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $Instead of index notation, apply the log()/exp() functions elementwise to vectors arguments, and define the all-ones vector $\o$ and the following variables $$\eqalign{ x &= \exp(f) &\qiq dx=x\h df \\ p &= \frac{x}{\o^Tx} &\qiq dp=\frac{\LR{\o^Tx}dx-x\LR{\o^Tdx}}{\LR{\o^Tx}^2} \;\equiv\; p\h df\,-\,pp^T df \\ P &= \Diag p &\qiq \rr{dp=\LR{P-pp^T}df} \qiq \grad pf = P-pp^T \\ }$$ where $\h$ is the elementwise (aka Hadamard) product.

Now write the loss function and differentiate $$\eqalign{ \L &= -y^T\log\LR{p} \\ d\L &= -y^T\,d\log\LR{p} \\ &= -y^T{P^{-1}\rr{dp}} \\ &= -y^T{P^{-1}\rr{\LR{P-pp^T}df}} \\ &= y^T\,{\LR{\o p^T-I}df} \\ &= \LR{p-y}^Tdf \\ \grad{\L}{f} &= p-y \\ }$$

greg
  • 40,033