2

I am looking at the matrix cookbook here which says

$$\frac{\partial \text{Tr}(f(X))}{\partial X} = f'(X)^T$$

where $f'$ is just the derivative of the scalar function $f$. This is from Section 2.5. How does one prove this?

Andrews
  • 4,293
user1936752
  • 1,828
  • 1
  • 15
  • 33

1 Answers1

2

$\DeclareMathOperator{\trace}{\text{Tr}}$

By chain rule for the Frechet derivative,

$$ d(\trace(f(X))[H] = (d\trace)(f(X))[ df(X) [H] ]$$ Since $(d\trace)(Y)[V] = \trace V$ for any $Y$,

$$ d(\trace(f(X))[H] = \trace( df(X) [H] )$$

Recall that if you usually write vectors as columns, the matrix representation of a linear map $\mathbb R^n \to \mathbb R$ should be a row vector; thus if $\nabla u(x)$ is a column vector, we have $d u(x) h = \nabla u(x)^T h$, i.e. the matrix representation of $du(x)$ is $\nabla u(x)^T$. The "same" is true for functions with matrix input: the matrix representative of $dU(X)$ is $\left(\frac{\partial U}{\partial X}(X)\right)^T$, with $$ dU(X)[H] = \left(\frac{\partial U}{\partial X}(X)\right)^T : H \overset{\Delta}{=} \trace \left(\left(\frac{\partial U}{\partial X}(X)\right) H\right)$$ Here, the double contraction $A:B \overset{\Delta}{=} \trace (A^T B) = A_{ij}B_{ij} $ was introduced. Thus, $$ \left(\frac{\partial }{\partial X}\trace(f(X))\right)^T : H = \trace \left(\left(\frac{\partial f}{\partial X}(X)\right)^T H \right) =\frac{\partial f}{\partial X}(X) : H $$

Comparing LHS and RHS gives the result.

Calvin Khor
  • 36,192
  • 6
  • 47
  • 102
  • Thank you. Just a couple of clarifying questions - 1) Could you explain the meaning of the notation [H] or [V] that you use? 2) The convention you've adopted uses for example $d u(x) \Delta u(x)^T h$. Could one use a different convention e.g. $d u(x) \Delta u(x)^\dagger h$, where $\dagger$ represents the transpose conjugate? – user1936752 Feb 26 '19 at 23:29
  • 1
    @user1936752 The brackets $[]$,$()$ strictly speaking are just brackets, but I consistently used $df(x)[h]$ to mean the linear map $df(x)$ evaluated at $h$. (cf Frechet derivative) You could have used the other convention, but this one seems to be the norm – Calvin Khor Feb 26 '19 at 23:40
  • 1
    @user1936752 Actually, no I misunderstood sorry. If your functions are complex valued, you should definitely be using the conjugate transpose. I assumed everything was real valued – Calvin Khor Feb 26 '19 at 23:51
  • 1
    Thank you, it works very nicely with complex conjugation - I failed to realize it was a convention to pick transpose! – user1936752 Feb 26 '19 at 23:55
  • Sorry, I was rereading this answer and somehow missed asking this but why is $(d Tr)(Y)[V] = Tr(V)$ for any $Y$? Even using the fact that the trace is linear and bounded (so its derivative is equal to itself), I'm not sure how you got rid of the $Y$. – user1936752 Mar 14 '19 at 17:38
  • 1
    @user1936752 that's exactly it; The derivative is equal to the map itself(and doesn't depend on the point where you differentiated). Perhaps you should try to understand why this holds also in dimension one: in what sense is the derivative of the linear map $2x$ equal to itself? – Calvin Khor Mar 14 '19 at 18:12
  • Ah I see. Correct me if I'm wrong but you're using the argument that since it's the same at every point, $Y$ may as well be replaced by any other function, in particular the identity? – user1936752 Mar 14 '19 at 18:40
  • @user1936752 I guess you could do that but I don't see any simplification coming from choosing a special $Y$? – Calvin Khor Mar 14 '19 at 19:08
  • I thought choosing Y = I was how you got rid of Y? In the example where the map is $2x$, the derivative of this linear map is equal to the map at $x = 1$. Or perhaps I am missing the point you are trying to make? – user1936752 Mar 14 '19 at 19:17
  • 1
    No, there is just no $Y$, nothing needs to be done to remove it. If $L$ is linear then the derivative map $dL: X \mapsto L(X,Y)$ is the constant map $dL(x)=L$ for every $x$, and I don't think its made simpler by choosing a particular $x$? – Calvin Khor Mar 14 '19 at 19:35
  • 1
    @user1936752 i was a little busy and you seem to get it but let me expand slightly anyway... (a) i wanted you to see that the frechet derivative of $2x$ is the linear map $h\mapsto 2h$. (On $\mathbb R$, linear maps and constants are the ‘same’) (b) The fact that the best affine approximation to an affine function is itself should sound obvious, and in symbols the required approximation$$f(x+h)-f(x)\approx df(x)h$$ holds exactly with no error term by choosing $df(x)=f$ which is a complete proof of the statement due tk uniqueness – Calvin Khor Mar 15 '19 at 12:24
  • 1
    (...that should have been linear function not affine) – Calvin Khor Mar 15 '19 at 12:46
  • Yes, I think I get it now. Thank you very much for the explanation! – user1936752 Mar 15 '19 at 14:04