To understand why $\theta_{MLE} = (X^TX)^{-1}X^Ty$, we have to derive the MLE starting from the likelihood function as such:
\begin{equation}
y = X\theta + \epsilon
\end{equation}
where
\begin{equation}
\epsilon \sim \mathcal{N}(0,\sigma^2 I)
\end{equation}
Then, the PDF of $y$ given all unknown parameters are
\begin{equation}
P(y \vert \theta,\sigma^2)
=
\frac{1}{\sqrt {\pi^n{\det( \sigma^2 I)}}}
exp(-\frac{1}{2}(y - X\theta)^T(\sigma^2 I)^{-1}(y - X\theta))
\end{equation}
we have that
\begin{equation}
\det( \sigma^2 I) = \sigma^{2n}
\end{equation}
and
\begin{equation}
(\sigma^2 I)^{-1} = \frac{1}{\sigma^2} I
\end{equation}
So
\begin{equation}
P(y \vert \theta, \sigma^2)
=
\frac{1}{\sqrt {\pi^n{\det( \sigma^2 I)}}}
exp(-\frac{1}{2\sigma^2}(y - X\theta)^T(y - X\theta))
\end{equation}
The above is the likelihood function, take the log likelihood and maximize with respect to $\theta$, you get
\begin{equation}
l(\theta) = \log P(y \vert \theta, \sigma^2)
=
-\log (\pi^n \sigma^{2n})^{0.5}
-\frac{1}{2\sigma^2}(y - X\theta)^T(y - X\theta)
\end{equation}
Since we optimize with respect to $X$, then the first term doesn't really affect the optimizaiton, so deriving w.r.t $\theta$ will cancel the first term as
\begin{equation}
\frac{\partial}{\partial \theta} l(\theta)
=
-\frac{1}{2\sigma^2}
(- 2X^Ty + 2 X^TX \theta )
= 0
\end{equation}
which is equivalent to
\begin{equation}
- 2X^Ty + 2 X^TX \theta
=
0
\end{equation}
i.e.
\begin{equation}
X^Ty - X^TX \theta
=
0
\end{equation}
or
\begin{equation}
X^Ty
=
X^TX \theta
\end{equation}
If $X^TX$ is invertible, then
\begin{equation}
\theta
=
(X^TX)^{-1}X^Ty
\end{equation}
Why $\det \sigma^2 I = \sigma^{2n}$
Because $\sigma^2 I$ is a diagonal matrix of entries $\sigma^2$ so the determinant would be a product of the diagonal entries.