I'm trying to follow 'Advanced Multivariate Statistics with Matrices', chapter 1.4. I know that this book is quite old, however I'm rather constrained with time and the papers I'm reading references this book.
Here, we define the derivative of a matrix $Y$ w.r.t $X$ as $$\frac{dY}{dX} = \frac{d}{d\text{vec}(X)}\text{vec}^T(Y),$$ which is rather straight-forward.
Now, back in chapter 1.3, the book talked about 'structured matrices'. Roughly, given a matrix $A$, we take a subset of unique elements from $A$ to build a 'structured matrix', say $A(K)$. As it stands, $A(K)$ isn't a rectangular matrix, rather 'just an array'.
A core result in chapter 1.3 is the construction of a transformation matrix $T(K)$, where $\text{vec}(A(K)) = T(K)\text{vec}(A)$. This is exploited in chapter 1.4, where a critical definition arises, $$\frac{dY(K_2)}{dX(K_1)} = \frac{d}{d\text{vec}(X(K_1))}\text{vec}^T(Y(K_2)) = T(K_1) \frac{dY}{dX} T^T(K_2).$$
I should also mention that $T$ has a pseudo-inverse $T^+$, such that $TT^+ = I$, and the above result is extended to $$\frac{dY}{dX} = T^+(K_1)\frac{dY(K_2)}{dX(K_1)}(T^{+T}(K_2)).$$
These are nifty derivations, but how am I supposed to use it? Given symmetric matrices $A, B$, how do I obtain the derivative $\frac{dB}{dA}$? This part was less clear to me in the book. Going through a couple examples in the book, I think the idea is to pretend that $A$ and $B$ are unstructured, obtain $\frac{dY}{dX}$ as usual, then pre/post-multiply with the transformation matrices. However I'm having a hard-time translating that into the theory developed in the book, it might be that I don't grasp the whole 'structured matrices' thing yet.
Edit: I might be wrong here, but this is what I can currently come up with, based on re-reading the book carefully.
$X$ and $Y$ already have structure (toep/symm/etc), and $X(K_1)$ and $Y(K_2)$ are the unique elements respectively. When we say we want the derivative of 'structured matrix $X$' wrt 'structured matrix $Y$', what we actually want is $\frac{dY(K_2)}{dX(K_1)}$. Thus, a simple calculation of $T(K_1) \frac{dY}{dX} T^T(K_2)$ is all that is needed.
Intuitively this kind of make sense to me. I'm using these matrix derivations in the context of maximizing likelihoods. If I would like to maximize the likelihood with respect to some symmetric matrix $\Sigma \in \mathbb{R}^{n \times n}$, then my likelihood isn't a function of $n^2$ variables, rather $\frac{n(n+1)}{2}$. So, when we maximize the likelihood, we only maximize over a smaller set of 'unique' variables.