Closed form solution of linear regression via least squares using matrix derivatives

Question

How is the closed form solution to linear regression derived using matrix derivatives as opposed to using the trace method as Andrew Ng does in his Machine learning lectures. Specifically, I am trying to understand how Nando de Frietas does it here.

We want to find the value of $ \theta $ that minimizes $ J(\theta)=(X\theta-Y)^{T}(X\theta-Y) $, where $\theta \in \mathbb{R}^{N \times 1}, X \in \mathbb{R}^{M \times N}$, and $Y \in \mathbb{R}^{M \times 1}$

$\nabla_{\theta}J(\theta) = \nabla_{\theta} (X\theta-Y)^{T}(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}-Y^{T})(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}X\theta-\theta^{T} X^{T}Y - Y^{T}X\theta + Y^{T}Y) $

Note that $\theta^{T} X^{T}Y$ is a scalar, so $\theta^{T} X^{T}Y = (\theta^{T} X^{T}Y)^{T} = Y^{T} X \theta$

$\nabla_{\theta}J(\theta) = \nabla_{\theta}(\theta^{T} X^{T}X\theta-Y^{T} X \theta - Y^{T}X\theta + Y^{T}Y)$

$ = \nabla_{\theta}(\theta^{T} X^{T}X\theta- 2 Y^{T} X \theta + Y^{T}Y)$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta + \nabla_{\theta} Y^{T}Y$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta $

How do I apply the matrix derivatives described in that video to solve this? He skip steps.

Edit: Below is the suggested strategy of removing theta by differentiating, then taking the inverse of both sides. So looking at one term at a time, we have

$ \nabla_{\theta} \theta^{T} X^{T}X\theta = ? $ How do I differntiate this? This is like differntiating $x\alpha_{1} \alpha_{2} x$ w.r.t. x in the scalar case. I need to combine those $\theta$ terms to hit them with the derivative. Transposing seems to result in the same expression: $$ (\nabla_{\theta} \theta^{T} X^{T}X\theta)^{T} = \nabla_{\theta} \theta^{T} X^{T}X\theta$$

Looking at the second term, we have

$ \nabla_{\theta} 2 Y^{T} X \theta = 2 X^{T} Y$.

Putting together this, we have: $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T} Y$$

Knowing the solution is $\theta = (X^{T}X)^{-1}X^{T}Y$ we can reverse engineer the problem, but I am just not seeing it. And how do we get rid of that 2 factor?

score 1 · Accepted Answer · answered Mar 15 '18 at 13:25

Matrix derivatives work a bit different than regular ones. The scalar parallel of $\nabla_{\theta} \theta^{T} X^{T}X\theta$ you make should be more like $\theta x^2 \theta$ which in turn is just $x^2 \theta^2$. (Note that I changed $\alpha$ for $x$ to avoid confusion with $X$.) You do not need to modify $\theta^{T} X^{T}X\theta$ so that the two $\theta$'s are beside each other, this actually is what you want.

Just like you usually have $\frac{d}{d\theta} (x^2 \theta^2) = 2 x^2 \theta$, then in matrix notation the rule is $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T}X \theta.$$

Hence, I feel he's not really "skipping steps", but applying a different step than the one you expected.

Closed form solution of linear regression via least squares using matrix derivatives

1 Answers1