0

I am trying to understand why $\theta_{MLE} = (X^TX)^{-1}X^Ty$ for multinomial linear regression in which we have the Frobenius norm for $min||y-X\theta||^2$ Looking at this tutorial, I have hard time to followup with the steps. Is there a better link you could suggest or better way to derive? enter image description here

Also, using this solution, I have: $||X\theta-y||^2 = tr((X\theta-y)^T(X\theta-y))$ $= tr(\theta^TX^TX\theta - \theta^TX^Ty-y^TX\theta+y^Ty)$

differentiate wrt $\theta$: $\Delta_\theta f(\theta) = 2X^TX\theta-2X^Ty = 2X^T(X\theta-y)$

I am not sure how could I continue this to end up with $(X^TX)^{-1}X^Ty$

1 Answers1

2

To understand why $\theta_{MLE} = (X^TX)^{-1}X^Ty$, we have to derive the MLE starting from the likelihood function as such: \begin{equation} y = X\theta + \epsilon \end{equation} where \begin{equation} \epsilon \sim \mathcal{N}(0,\sigma^2 I) \end{equation} Then, the PDF of $y$ given all unknown parameters are \begin{equation} P(y \vert \theta,\sigma^2) = \frac{1}{\sqrt {\pi^n{\det( \sigma^2 I)}}} exp(-\frac{1}{2}(y - X\theta)^T(\sigma^2 I)^{-1}(y - X\theta)) \end{equation} we have that \begin{equation} \det( \sigma^2 I) = \sigma^{2n} \end{equation} and \begin{equation} (\sigma^2 I)^{-1} = \frac{1}{\sigma^2} I \end{equation} So \begin{equation} P(y \vert \theta, \sigma^2) = \frac{1}{\sqrt {\pi^n{\det( \sigma^2 I)}}} exp(-\frac{1}{2\sigma^2}(y - X\theta)^T(y - X\theta)) \end{equation} The above is the likelihood function, take the log likelihood and maximize with respect to $\theta$, you get \begin{equation} l(\theta) = \log P(y \vert \theta, \sigma^2) = -\log (\pi^n \sigma^{2n})^{0.5} -\frac{1}{2\sigma^2}(y - X\theta)^T(y - X\theta) \end{equation} Since we optimize with respect to $X$, then the first term doesn't really affect the optimizaiton, so deriving w.r.t $\theta$ will cancel the first term as \begin{equation} \frac{\partial}{\partial \theta} l(\theta) = -\frac{1}{2\sigma^2} (- 2X^Ty + 2 X^TX \theta ) = 0 \end{equation} which is equivalent to \begin{equation} - 2X^Ty + 2 X^TX \theta = 0 \end{equation} i.e. \begin{equation} X^Ty - X^TX \theta = 0 \end{equation} or \begin{equation} X^Ty = X^TX \theta \end{equation} If $X^TX$ is invertible, then \begin{equation} \theta = (X^TX)^{-1}X^Ty \end{equation}


Why $\det \sigma^2 I = \sigma^{2n}$

Because $\sigma^2 I$ is a diagonal matrix of entries $\sigma^2$ so the determinant would be a product of the diagonal entries.

Ahmad Bazzi
  • 12,238
  • can you please share a link how you obtained PDF of y ? thanks – Mona Jalal Sep 16 '18 at 21:28
  • also a link on why det sigma^2*I is what you wrote would be really helpful – Mona Jalal Sep 16 '18 at 21:29
  • i have edited @MonaJalal at the end why the determinant is such and here https://en.wikipedia.org/wiki/Multinomial_distribution is the link that you need to get the PDF of a multinormal distribution – Ahmad Bazzi Sep 16 '18 at 21:39