Derivative of Mahalanobis pairwise distance matrix respect to transformation matrix

Question

For a set of vectorial observations $x_i$ stored in the matrix $X$, I would like to obtain the gradient of the pairwise Mahalanobis distance matrix $D$ with respect to the Mahalanobis transformation matrix $M$ ($\frac{\partial D}{\partial M}$), given that:

$D_{ij} = (x_i - x_j)^TM(x_i - x_j)$

edit: M should be positive semi-definite, or alternatively $M = A^T A$

I have found how the euclidean pairwise distance matrix can be expressed as in a vectorized form (Pairwise distance matrix), and how the Mahalanobis distance between two elements can be derived with respect to the matrix M (Differentiating mahalanobis distance), but I have not been able to put everything together.

greg · Accepted Answer · 2022-01-14T02:53:50.763

$ \def\o{{\tt1}}\def\p{\partial} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Big(#1\Big)} \def\bR#1{\big(#1\big)} \def\vecc#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Sym#1{\operatorname{Sym}\!\BR{#1}} \def\sym#1{\operatorname{Sym}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} \def\c#1{\color{red}{#1}} $First, define some new variables $$\eqalign{ B &= \tfrac 12D \\ y_k &= Mx_k \\ X &= \m{x_1&x_2&\ldots&x_n}\;\in{\mathbb R}^{m\times n} \\ Y &= \m{y_1&y_2&\,\ldots&\,y_n}\;= MX \qiq \c{dY = dM\,X} \\ J &=X\oslash X \qiq J_{ij}=\o\quad\big({\rm all\!-\!ones\;matrix}\big) \\ X &= J\odot X \\ {\cal D}_Y &= {\rm Diag}\BR{\!\vecc{Y}}\,\in{\mathbb R}^{mn\times mn} \\ }$$ where $\{\odot,\oslash\}$ denote elementwise/Hadamard multiplication and division.

Then rewrite the problem in pure matrix notation and calculate its differential. $$\eqalign{ 2B &= (X\odot Y)^TJ + J^T(X\odot Y) - X^TY - Y^TX \\ &= 2\,\Sym{J^T(X\odot{Y}) - X^T{Y}} \\ dB &= \Sym{J^T(X\odot{\c{dY}}) - X^T{\c{dY}}} \\ &= \Sym{J^T(X\odot\LR{\c{dM\,X}}) - X^T{\c{dM\,X}}} \\ }$$ where $\;\sym{A} \doteq \tfrac 12\LR{A+A^T}$

At this point, one can calculate a component-wise gradient $$\eqalign{ \grad{B}{M_{ij}} &= \Sym{J^T(X\odot\LR{E_{ij}\,X}) - X^T{E_{ij}\,X}} \\ }$$ where $E_{ij}$ is a single-entry matrix whose $(i,j)$ element is $\o$ and all others are zero.

In terms of the original variable $$\eqalign{ \grad{B}{M_{ij}} &= \frac 12 \LR{\grad{D}{M_{ij}}} \\ }$$ The problem with the requested gradient $\LR{\grad DM}$ is that it's a fourth-order tensor, which cannot be written in standard matrix notation.

Update

You've added a constraint to your original question, i.e. $$M=A^TA \qiq \c{dM=dA^TA+A^TdA}$$ The procedure to find the new gradient is straightforward. Write the differential, change the independent variable from $P\to A$, then isolate the gradient.

Here is the calculation for the component-wise gradient $$\eqalign{ dB &= \Sym{J^T(X\odot\LR{\c{dM\,X}}) - X^T\LR{\c{dM\,X}}} \\ &= \Sym{J^T(X\odot\LR{\c{dA^TAX+A^TdA\,X}} - X^T\LR{\c{dA^TAX+A^TdA\,X}}} \\ &= 2 \Sym{J^T\bR{\!\LR{AX}\odot\LR{dA\,X}\!} - X^TA^TdA\,X} \\ \grad{B}{A_{ij}} &= 2 \Sym{J^T\bR{\!\LR{AX}\odot\LR{E_{ij}X}\!}-X^TA^TE_{ij}X} \\ }$$ The linked paper is pretty good when compared to most Machine Learning papers, but the mathematics is still far too "hand-wavy". It's geared towards an audience who just want to code the gradient expression directly into their Python programs without bothering to check the math.

Deriving their gradient result using matrix notation would be a herculean effort. Even converting their result into matrix notation would be very difficult, because the nested sum violates the usual summation convention since it contains an expression with four $i$-indices and four $j$-indices. Conventional index notation only permits two of any single index.

If you want someone to decode that mess for you then, as Steph suggested, you should post a new question $-$ I'll let him answer that one.

Thanks a lot for the answer. Would it be easier to express if M was not a matrix but a vector? — Rayamon, Jan 13 '22 at 12:36
@Rayamon Unfortunately, a matrix-by-vector gradient is a third-order tensor, so it still cannot be written using standard matrix notation. — greg, Jan 13 '22 at 14:22
How do they do it then when they learn Mahalanobis distances in metric learning algorithms (e.g. http://proceedings.mlr.press/v2/weinberger07a/weinberger07a.pdf)? — Rayamon, Jan 13 '22 at 15:36
I think the statement of the question is different from what you need... In the article, you mention there is NO need to compute the Derivative of Mahalanobis pairwise distance matrix respect to transformation matrix but rather compute the derivative of the loss function wrt it, i.e. $\frac{\partial \phi}{\partial \mathbf{M}}$. Such sensitivity will depend from the kernel and so on... I think you should repost another question to adress this topic. — Steph, Jan 13 '22 at 18:51
If you are interested, the python library metric-learn has an implementation of the method in the paper (http://contrib.scikit-learn.org/metric-learn/generated/metric_learn.MLKR.html). In their code, they use normal matrix operations to compute the gradient, but I have not been able to completely connect the implementation with the expressions in the article. — Rayamon, Jan 14 '22 at 08:54

Derivative of Mahalanobis pairwise distance matrix respect to transformation matrix

1 Answers1

Update