Derivative of the Cross Entropy loss function with the Softmax function

Question

I am currently teaching myself the basics of neural networks and backpropagation but some steps regarding the derivation of the derivative of the Cross Entropy loss function with the Softmax activation function I do not understand. Given the loss function: $$ L=-\sum_{k}y_k\ln{(\hat{y}_k)} $$ and the activation function: $$ \hat{y}_k=\mathrm{softmax}{(\mathbf{z})}_{k}=\frac{\exp{(z_k)}}{\sum_{j}\exp{(z_j)}} $$ If I take the partial derivative of $L$ w.r.t. the weight $w_{r,i}$, I get: \begin{align*} \frac{\partial L}{\partial w_{r,i}}&=\frac{\partial L}{\partial \hat{y}_r}\frac{\partial \hat{y}_r}{\partial z_r}\frac{\partial z_r}{\partial w_{r,i}} \end{align*} Applying the chain rule: \begin{align*} \frac{\partial L}{\partial w_{r,i}}&=\frac{\partial }{\partial \hat{y}_r}\left[-\sum_{k}y_k\ln{(\hat{y}_k)}\right]\\ &=-\frac{y_r}{\hat{y}_r}\frac{\partial}{\partial z_r}\left[\frac{\exp{(z_r)}}{\sum_{j}\exp{(z_j)}}\right]\tag{*}\\ &=-\frac{y_r}{\hat{y}_r}\hat{y}_r(1-\hat{y}_r)\frac{\partial z_r}{\partial w_{r,i}}\\ &=-y_r(1-\hat{y}_r)x_i \end{align*} My mistake seems to happen right at the beginning at (*) because I take the derivative of $L$ w.r.t. $\hat{y}_r$, which eliminates the sum for each $k\ne r$. In all the sources I looked at, they would distribute the derivative inside the sum w.r.t. $\hat{y}_k$ instead of $\hat{y}_r$. But why is that? If one takes the derivative of the MSE function, the sum typically vanished due to the same method as mentioned here. Why do we have to distribute the derivative inside the sum of the Cross Entropy loss function, while we do not have to do this when dealing with the MSE function?

Thanks for your help in advance!

$\def\p{\partial} \def\y{\widehat y}$ Contrary to popular belief, the gradient of the softmax function is actually $$ \frac{\p\y}{\p z} = {\rm Diag}(\y) - \y\y^T $$So, assuming $,\displaystyle\sum_ky_k\equiv{\tt1},,$ the gradient of $L$ is $$\eqalign{ \frac{\p L}{\p w } = \big(\y-y\big),x^T \ }$$ — greg, Aug 02 '24 at 16:01

score 2 · Answer 1 · answered Nov 06 '24 at 03:12

$\require{cancel}$ I was doing some research myself on the question that you proposed and came across some good reading material on how to solve the problem. You will find this question helpful in solving your problem.

Derivative of Softmax loss function

To begin to solve your problem I would first right it out in these three terms:

$$z=Wx^T+b$$ $$\hat{y}_k=\mathrm{softmax}{(z)}_{k}=\frac{\exp{(z_k)}}{\sum_\limits{j}\exp{(z_j)}}$$ $$L=-\sum_{k}y_j\ln{(\hat{y}_j)}$$

So I would begin by doing the following:

\begin{align} \frac{\delta L}{\delta W}&=\frac{\delta L}{\delta \hat{y}}\frac{\delta \hat{y}}{\delta z}\frac{\delta z}{\delta W} \\ &= -y_i(1-\hat{y_i})-\sum_\limits{k\ne i}(\frac{y_k}{\hat{y}_k}(-\hat{y}_k\hat{y}_i)x^T \\ &= (-y_i+y_i\hat{y_i}+\sum_\limits{k\ne i}(y_k\hat{y}_i))x^T \end{align}

The final answer being: \begin{equation} (\hat{y}_i -y_i)x^T \end{equation}

Derivative of the Cross Entropy loss function with the Softmax function

1 Answers1