Taking a derivative with respect to a matrix

Question

I'm studying about EM-algorithm and on one point in my reference the author is taking a derivative of a function with respect to a matrix. Could someone explain how does one take the derivative of a function with respect to a matrix...I don't understand the idea. For example, lets say we have a multidimensional Gaussian function:

$$f(\textbf{x}, \Sigma, \boldsymbol \mu) = \frac{1}{\sqrt{(2\pi)^k |\Sigma|}}\exp\left( -\frac{1}{2}(\textbf{x}-\boldsymbol \mu)^T\Sigma^{-1}(\textbf{x}-\boldsymbol \mu)\right),$$

where $\textbf{x} = (x_1, ..., x_n)$, $\;\;x_i \in \mathbb R$, $\;\;\boldsymbol \mu = (\mu_1, ..., \mu_n)$, $\;\;\mu_i \in \mathbb R$ and $\Sigma$ is the $n\times n$ covariance matrix.

How would one calculate $\displaystyle \frac{\partial f}{\partial \Sigma}$? What about $\displaystyle \frac{\partial f}{\partial \boldsymbol \mu}$ or $\displaystyle \frac{\partial f}{\partial \textbf{x}}$ (Aren't these two actually just special cases of the first one)?

Thnx for any help. If you're wondering where I got this question in my mind, I got it from reading this reference: (page 14)

http://ptgmedia.pearsoncmg.com/images/0131478249/samplechapter/0131478249_ch03.pdf

UPDATE:

I added the particular part from my reference here if someone is interested :) I highlighted the parts where I got confused, namely the part where the author takes the derivative with respect to a matrix (the sigma in the picture is also a covariance matrix. The author is estimating the optimal parameters for Gaussian mixture model, by using the EM-algorithm):

$Q(\theta|\theta_n)\equiv E_Z\{\log p(Z,X|\theta)|X,\theta_n\}$

enter image description here

Possibly helpful: http://math.stackexchange.com/questions/94562/matrix-vector-derivative — dreamer, Dec 30 '13 at 12:18
How is the function $Q(\theta|\theta_n)$ defined in the screenshot? — dreamer, Dec 30 '13 at 12:24
@Dreamer I'll add it into the post. One sec, you can also see it from the reference pages 7-8 — jjepsuomi, Dec 30 '13 at 12:25
I added the definition for $Q(\theta|\theta_n)$. However, I only require to understand how the author is doing the calculations in the reference. A simple example using multidimensional Gaussian is enough :) I can calculate it then myself for my specific problem (EM-algorithm). — jjepsuomi, Dec 30 '13 at 12:29

score 37 · Accepted Answer · answered Dec 30 '13 at 12:22

37

It's not the derivative with respect to a matrix really. It's the derivative of $f$ with respect to each element of a matrix and the result is a matrix.

Although the calculations are different, it is the same idea as a Jacobian matrix. Each entry is a derivative with respect to a different variable.

Same goes with $\frac{\partial f}{\partial \mu}$, it is a vector made of derivatives with respect to each element in $\mu$.

You could think of them as $$\bigg[\frac{\partial f}{\partial \Sigma}\bigg]_{i,j} = \frac{\partial f}{\partial \sigma^2_{i,j}} \qquad \text{and}\qquad \bigg[\frac{\partial f}{\partial \mu}\bigg]_i = \frac{\partial f}{\partial \mu_i}$$ where $\sigma^2_{i,j}$ is the $(i,j)$th covariance in $\Sigma$ and $\mu_i$ is the $i$th element of the mean vector $\mu$.

answered Dec 30 '13 at 12:22

user88595

4,579

2

+1 Ahaa :) that solves it then. Thank you, nothing further ;) – jjepsuomi Dec 30 '13 at 12:24
3

What you're saying is right, however, this doesn't quite explain how to do the calulations (it's not so easy to do this elementwise). – dreamer Dec 30 '13 at 12:25
@Dreamer It'd be glad to see some explanations on the calculations if someone wants to show them :) – jjepsuomi Dec 30 '13 at 12:28

score 11 · Answer 2 · answered Dec 30 '13 at 14:19

You can view this in the same way you would view a function of any vector. A matrix is just a vector in a normed space where the norm can be represented in any number of ways. One possible norm would be the root-mean-square of the coefficients; another would be the sum of the absolute values of the matrix coefficients. Another is as the norm of the matrix as a linear operator on a vector space with its own norm.

What is significant is that the invertible matrices are an open set; so a derivative can make sense. What you have to do is find a way to approximate $$ f(x,\Sigma + \Delta\Sigma,\mu)-f(x,\Sigma,\mu)$$ as a linear function of $\Delta\Sigma$. I would use a power series to find a linear approximation. For example, $$ (\Sigma+\Delta\Sigma)^{-1}=\Sigma^{-1}(I+(\Delta\Sigma) \Sigma^{-1})^{-1} =\Sigma^{-1} \sum_{n=0}^{\infty}(-1)^{n}\{ (\Delta\Sigma)\Sigma^{-1}\}^{n} \approx \Sigma^{-1}(I-(\Delta\Sigma)\Sigma^{-1})$$ Such a series converges for $\|\Delta\Sigma\|$ small enough (using whatever norm you choose.) And, in the language of derivatives, $$ (\frac{d}{d\Sigma} \Sigma^{-1})\Delta\Sigma = -\Sigma^{-1}(\Delta\Sigma)\Sigma^{-1} $$ Remember, that the derivative is a linear operator on $\Delta\Sigma$; if you squint you can almost see the classical term $\frac{d}{dx}x^{-1} =-x^{-2}$. Chain rules for derivatives apply. So that's how you can handle the exponential composed with matrix inversion.

"using whatever norm you choose." - Is this because all norms are equivalent in $\mathbb{R}^n$ and the space of matrices is "kind of" one of these spaces? — soap, Apr 04 '17 at 09:43
@Simoes Yes, all norms on the same finite-dimensional linear space are equivalent. — Disintegrating By Parts, Apr 04 '17 at 14:13

Taking a derivative with respect to a matrix

2 Answers2

Linked