I am trying to calculate the softmax gradient: $$p_j=[f(\vec{x})]_j = \frac{e^{W_jx+b_j}}{\sum_k e^{W_kx+b_k}}$$ With the cross-entropy error: $$L = -\sum_j y_j \log p_j$$ Using this question I get that $$\frac{\partial L}{\partial o_i} = p_i - y_i$$ Where $o_i=W_ix+b_i$
So, by applying the chain rule I get to: $$\frac{\partial L}{\partial b_i}=\frac{\partial L}{\partial o_i}\frac{\partial o_i}{\partial b_i} = (p_i - y_i)1=p_i - y_i$$ Which makes sense (dimensionality wise) $$\frac{\partial L}{\partial W_i}=\frac{\partial L}{\partial o_i}\frac{\partial o_i}{\partial W_i} = (p_i - y_i)\vec{x}$$ Which has a dimensionality mismatch
(for example if dimensions are $W_{3\times 4},\vec{b}_4,\vec{x}_3$)
What am I doing wrong ? and what is the correct gradient ?