Computing $\frac{\partial(X^Tb)}{\partial X}$

Question

In the matrix cookbook there is an identity $$\frac{\partial (a^TX^T b)}{\partial X} = ba^T$$

I recently ran into a problem where I had to compute $$\frac{\partial (X^T b)}{\partial X}$$

but I couldn't find a formula for this. However it seems that, at least for my example,

$$\frac{\partial (X^T b)}{\partial X} = b$$.

Does this formula hold in general?

Does it even make sense to take the derivative $$\frac{\partial (X^T b)}{\partial X}$$.

The problem where this came up was chapter 3.1.5 of Pattern Recognition and Machine Learning specifically taking the derivative wrt W of 3.33:

$$ln(p(T|X,W,\beta))=\frac{NK}{2}ln(\frac{\beta}{2\pi}) - \frac{\beta}{2}\sum_{n=1}^N || t_n -W^T \phi (x_n)||^2$$ where I used the chain rule to compute:

$$\frac{\partial}{\partial W}ln(p(T|X,W,\beta))=- \frac{\beta}{2}\sum_{n=1}^N \frac{\partial}{\partial W}(t_n -W^T \phi (x_n))^T(t_n -W^T \phi (x_n)) $$

$$=- \frac{\beta}{2}\sum_{n=1}^N \frac{\partial}{\partial (t_n-W^T \phi (x_n))}(t_n -W^T \phi (x_n))^T(t_n -W^T \phi (x_n))\frac{\partial}{\partial W} (t_n-W^T \phi (x_n)) $$

Then I used $$\frac{\partial (x^Tx)}{\partial x}=2x$$ and to compute the derivative

$$\frac{\partial}{\partial W} (t_n-W^T \phi (x_n))$$ I used

$$\frac{\partial (X^T b)}{\partial X} = b$$

which seems to give the correct results.

Furthermore a proof in a similar vein to this seems to work. Although I'm not sure if this is valid.

This seems possibly relevant https://math.stackexchange.com/q/2044191/280789 — tail_recursion, Jun 04 '21 at 06:42
Be careful that for the first formula to make sense (the one with $a^T X^Tb$), you need $X$ to be a matrix, while for the second (the one you are asking to prove) $X$ is a vector — SolubleFish, Jun 04 '21 at 07:01
@SolubleFish Why does $X$ need to be a vector in the second formula? — tail_recursion, Jun 04 '21 at 11:53
Because if $X$ is a matrix and $b$ a vector, then $f(X) = X^Tb$ is a vector value function, therefore the gradient $\nabla_Xf$ is a $(2,1)$-tensor, which cannot be equal to $b$ — SolubleFish, Jun 04 '21 at 13:10

Alex · Answer 1 · 2024-03-06T15:15:56.653

In this case it is much clearer to think of the derivative as a linearization of the map $$ F\colon \mathbb R^{m\times n} \to \mathbb R^n,\, X\mapsto X^T b. $$ Indeed since $$ F(X+V)=(X+V)^Tb = X^Tb+V^Tb=F(X)+V^Tb $$ we find that $$ D_XF(V)=\frac{\partial (X^Tb)}{\partial X}(V)=V^Tb. $$

With this there is no need for partial derivatives or writing out the matrices at play. Generally, this definition of the derivative (See here) is often helpful in similar settings.

score 0 · Answer 2 · answered Jun 04 '21 at 06:49

0

Yes, this formula is true. Assuming $X\in\mathbb R^n$ and $b\in\mathbb R^n$, we obtain $$X^T b = \sum_{i=1}^n X_i b_i.$$

The partial derivative of this linear combination with respect to $X_k$ is $b_k$, which proves your formula:

\begin{align*} \frac{\partial(X^T b)}{\partial X} = b. \end{align*}

answered Jun 04 '21 at 06:49

L. Milla

785

1

I'm asking for the case where $X$ is a matrix. – tail_recursion Jun 04 '21 at 11:58

score 0 · Answer 3 · answered Mar 06 '24 at 14:49

Let's consider $X \in \mathbb{R}^{m \times n}$, so $b \in \mathbb{R}^m$ and $X^Tb \in \mathbb{R}^n$.

The derivative $\frac{\partial X^Tb}{\partial X}$ has $n \times m \times n$ terms, so it would be helpful to compute each $\frac{\partial (X^Tb)_k}{\partial X}$ separately:

$$\frac{\partial (X^Tb)_k}{\partial X} = \begin{bmatrix} \frac{\partial (X^Tb)_k}{\partial X_{11}} & \frac{\partial (X^Tb)_k}{\partial X_{12}} & \cdots & \frac{\partial (X^Tb)_k}{\partial X_{1n}} \\ \frac{\partial (X^Tb)_k}{\partial X_{21}} & \frac{\partial (X^Tb)_k}{\partial X_{22}} & \cdots & \frac{\partial (X^Tb)_k}{\partial X_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial (X^Tb)_k}{\partial X_{m1}} & \frac{\partial (X^Tb)_k} {\partial X_{m2}} & \cdots & \frac{\partial (X^Tb)_k}{\partial X_{mn}} \\ \end{bmatrix}$$

If we compute for each element:

$$\frac{\partial (X^Tb)_k}{\partial X_{ij}} = \frac{\partial}{\partial X_{ij}} \sum_{l = 1}^{m}X^T_{kl}b_l = \frac{\partial}{\partial X_{ij}} \sum_{l = 1}^{m}X_{lk}b_l = \begin{cases} 0 & \text{if $k \ne j$} \\ b_i & \text{if $k = j$} \end{cases}$$

So the columns of matrix are all zero except $k$th one:

$$\frac{\partial (X^Tb)_k}{\partial X} = \begin{bmatrix} 0 & \cdots & b_1 & \cdots & 0 \\ \vdots & & \vdots & & \vdots \\ 0 & \cdots & b_m & \cdots & 0 \\ \end{bmatrix}$$

As you mentioned, in this example the results are correct if we take simply $\frac{\partial X^Tb}{\partial X} = b$, but that's not guaranteed to find a vector answer for every derivative of vector w.r.t. matrix.

To prevent dealing with tensors, I recommend to not use chain rule and calculate each element of the main derivative separately.

Computing $\frac{\partial(X^Tb)}{\partial X}$

3 Answers3