2

The above equation appears without proof on page 9 (equation 3) of Andrew Ng's notes on Machine Learning

I have tried various approaches to prove this to no avail. From the notes it seems that it should be provable from first principles. I tried using the ordinary chain rule on an element by element basis, but this quickly gets unwieldy. Any hint on the approach to resolve this would help a lot.

2 Answers2

1

The matrix cookbook in Section 2.4 gives rules for differentiation of the trace of a matrix. Use those in conjunction with the fact that trace doesn't change for cyclic permutation of the matrices i.e. tr(ABCD)=tr(DABD) to derive the above expression.

elexhobby
  • 1,597
  • 11
  • 18
1

Since the Kronecker inner product on matrices is defined by $$ \langle A,B \rangle = \operatorname{tr}(A B^T), $$ it follows, by the general definition of the gradient, that $\nabla_A \operatorname{tr}(ABA^T C)$ is the unique matrix satisfying $$ \operatorname{tr}((A+h)B(A+h)^T C) = \operatorname{tr}(ABA^TC) + \langle \nabla_A \operatorname{tr}(ABA^T C),h\rangle + o(\|h\|)\\ = \operatorname{tr}(ABA^TC) + \operatorname{tr}(\nabla_A \operatorname{tr}(ABA^T C)h^T)+ o(\|h\|). $$ However, by repeated application of the cyclic identity for traces and the invariance of the trace under transposition, $$ \operatorname{tr}((A+h)B(A+h)^T C) = \operatorname{tr}(ABA^TC + hBA^TC + ABh^TC+hBh^TC)\\ = \operatorname{tr}(ABA^TC) +\operatorname{tr}(hBA^TC) + \operatorname{tr}(ABh^TC) + \operatorname{tr}(hBh^TC)\\ = \operatorname{tr}(ABA^TC) + \operatorname{tr}(C^TAB^Th^T) + \operatorname{tr}(CABh^T) + \operatorname{tr}(hBh^TC)\\ = \operatorname{tr}(ABA^TC) + \operatorname{tr}((C^TAB^T+CAB)h^T) + \operatorname{tr}(hBh^TC), $$ yielding the desired result.