5

How is the closed form solution to linear regression derived using matrix derivatives as opposed to using the trace method as Andrew Ng does in his Machine learning lectures. Specifically, I am trying to understand how Nando de Frietas does it here.

We want to find the value of $ \theta $ that minimizes $ J(\theta)=(X\theta-Y)^{T}(X\theta-Y) $, where $\theta \in \mathbb{R}^{N \times 1}, X \in \mathbb{R}^{M \times N}$, and $Y \in \mathbb{R}^{M \times 1}$

$\nabla_{\theta}J(\theta) = \nabla_{\theta} (X\theta-Y)^{T}(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}-Y^{T})(X\theta-Y)$

$ = \nabla_{\theta} (\theta^{T} X^{T}X\theta-\theta^{T} X^{T}Y - Y^{T}X\theta + Y^{T}Y) $

Note that $\theta^{T} X^{T}Y$ is a scalar, so $\theta^{T} X^{T}Y = (\theta^{T} X^{T}Y)^{T} = Y^{T} X \theta$

$\nabla_{\theta}J(\theta) = \nabla_{\theta}(\theta^{T} X^{T}X\theta-Y^{T} X \theta - Y^{T}X\theta + Y^{T}Y)$

$ = \nabla_{\theta}(\theta^{T} X^{T}X\theta- 2 Y^{T} X \theta + Y^{T}Y)$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta + \nabla_{\theta} Y^{T}Y$

$ = \nabla_{\theta} \theta^{T} X^{T}X\theta - \nabla_{\theta} 2 Y^{T} X \theta $

How do I apply the matrix derivatives described in that video to solve this? He skip steps.

Edit: Below is the suggested strategy of removing theta by differentiating, then taking the inverse of both sides. So looking at one term at a time, we have

$ \nabla_{\theta} \theta^{T} X^{T}X\theta = ? $ How do I differntiate this? This is like differntiating $x\alpha_{1} \alpha_{2} x$ w.r.t. x in the scalar case. I need to combine those $\theta$ terms to hit them with the derivative. Transposing seems to result in the same expression: $$ (\nabla_{\theta} \theta^{T} X^{T}X\theta)^{T} = \nabla_{\theta} \theta^{T} X^{T}X\theta$$

Looking at the second term, we have

$ \nabla_{\theta} 2 Y^{T} X \theta = 2 X^{T} Y$.

Putting together this, we have: $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T} Y$$

Knowing the solution is $\theta = (X^{T}X)^{-1}X^{T}Y$ we can reverse engineer the problem, but I am just not seeing it. And how do we get rid of that 2 factor?

user8919
  • 53
  • 5

1 Answers1

1

Matrix derivatives work a bit different than regular ones. The scalar parallel of $\nabla_{\theta} \theta^{T} X^{T}X\theta$ you make should be more like $\theta x^2 \theta$ which in turn is just $x^2 \theta^2$. (Note that I changed $\alpha$ for $x$ to avoid confusion with $X$.) You do not need to modify $\theta^{T} X^{T}X\theta$ so that the two $\theta$'s are beside each other, this actually is what you want.

Just like you usually have $\frac{d}{d\theta} (x^2 \theta^2) = 2 x^2 \theta$, then in matrix notation the rule is $$\nabla_{\theta} \theta^{T} X^{T}X\theta = 2 X^{T}X \theta.$$

Hence, I feel he's not really "skipping steps", but applying a different step than the one you expected.

Perochkin
  • 311
  • 2
  • 6