0

I am trying to understand the answer of this question. How do you get this?

$$\nabla_{\mathrm W}\left(\mbox{tr} \left( \mathrm W^{\top} \mathrm X^{\top} \mathrm X \mathrm W - \mathrm Y^{\top} \mathrm X \mathrm W - \mathrm W^{\top} \mathrm X^{\top} \mathrm Y + \mathrm Y^{\top} \mathrm Y \right)\right)$$ $$= 2 \, \mathrm X^{\top} \mathrm X \mathrm W - 2 \, \mathrm X^{\top} \mathrm Y$$

Specifically, I want to know what kind of magic happens to these:

$$-\mathrm Y^{\top} \mathrm X \mathrm W - \mathrm W^{\top} \mathrm X^{\top} \mathrm Y$$

Thank you so much.

CaTx
  • 171

3 Answers3

1

Alternative approach

The Frobenius product by a colon can be defined as \begin{align} {\rm Tr}\left( A^T B C \right) := A: BC \end{align}

We will use the cyclic property of trace, e.g., \begin{align} A: BCD = B^T A: CD = B^TAD^T: C \end{align}

To find the gradient, we will exploit differential. To this end, we can rewrite the problem at hand as \begin{align} f &:= {\rm Tr}\left( W^TX^TXW - Y^TXW - W^TX^TY + Y^TY \right) \\ &\equiv XW : XW - Y:XW - XW:Y + Y:Y \end{align}

Compute the differential and then gradient. \begin{align} df &= XdW : XW + XW : XdW - Y:XdW - Y:XdW \\ &= 2X^TXW : dW - X^TY:dW - X^TY:dW \\ &= \left(2X^TXW - 2X^TY\right):dW \end{align}

The gradient is \begin{align} \frac{\partial f}{\partial W} &= 2X^TXW - 2X^TY . \end{align}

user550103
  • 2,773
  • Thanks, but I am looking to uncover the magic with those two terms. – CaTx Jun 13 '21 at 18:19
  • As commented by mathcounterexamples.net, let $M=Y^T XW$, then $\operatorname{Tr}\left( M \right) = \operatorname{Tr}\left( M^T\right)$, which explains your mystery... – user550103 Jun 13 '21 at 21:50
0

The trace of a matrix is equal to the trace of the transpose. So

$$\operatorname{tr}(Y^T XW+ W^TX^TY)= 2\operatorname{tr}(Y^TXW)$$ and

$$W \mapsto 2\operatorname{tr}(Y^TXW)$$ is linear so its derivative is equal to itself. See derivative in product in trace if required.

  • For the first line, you take the individual trace? Can you provide the details that are not shown? – CaTx Jun 13 '21 at 18:21
0

And for the other part, $\operatorname{tr}(AB)=\operatorname{tr}(BA)$, so $$\operatorname{tr}(W^{\top}X^{\top} X W)=\operatorname{tr}(X^{\top} X W W^{\top})$$ and he seems to be differentiating inside the trace to take a constant out $$\nabla_W(\operatorname{tr}(X^{\top} X W W^{\top}))=X^{\top} X \nabla_W(\operatorname{tr}(W W^{\top}))=2X^{\top} X W$$ but I guess this needs some justification even if true.