Let's say we want to solve a linear regression problem by choosing the best slope and bias with the least squared errors. As example, let the points be $x=[1,2,3]$ and $y=[1,2,2]$.
To solve this problem in linear algebra, I would find the orthogonal projection of points in the column space: $p=Ax = (A^TA)^{-1} A^Tb$. You can see solution here.
Although, quadratic minimization problem can also be solved by using partial derivatives and finding the local minimum of the error function.
These steps are displayed in this answer, although I wasn't able to understand the mechanism.
I know that minimization problem can be represented as $||Ax-b||^2$ if $A$ and $x$ are basis for the column space and $x$ represents unknown linear coefficients (since we are working on linear function) and $b$ is the output of the function. Thus $Ax-b$ represents an error between "predicted" value and actual value.
In the answer linked above, author states that two partial derivatives must be solved:
$\frac{\partial}{\partial x_1}||Ax-b||^2 = 0$ and $\frac{\partial}{\partial x_2}||Ax-b||^2 = 0$.
Suppose we have n ai datapoints, then if $x_1$ represents the slope of the function and $x_2$ represents bias, we would have:
$\frac{\partial}{\partial x_1}||Ax-b||^2 = 2\sum_{i=1}^{n}a_i(x_1a_i+x_2-b_i) = 0$
and
$\frac{\partial}{\partial x_2}||Ax-b||^2 = 2\sum_{i=1}^{n}(x_1a_i+x_2-b_i) = 0$
Which finally implies that:
$x_1\sum_{i=1}^{n}a_i(x_1a_i+x_2-b_i)+x_2\sum_{i=1}^{n}(x_1a_i+x_2-b_i) = 0$
$\implies \sum_{i=1}^{n} (x_1a_i+x_2)(x_1a_i+x_2-b_i)=0=Ax\cdot (Ax-b)$
This solution leaves some concerns.
Why are there two derivatives for different x's?
Why are partial derivatives used (is it because we are working in higher dimensions?)
Why is there constant 2 in the front of summation $\frac{\partial}{\partial x_1}||Ax-b||^2 = 2\sum_{i=1}^{n}a_i(x_1a_i+x_2-b_i) = 0$ (isn't this normal derivative power rule?)
Shortly, How could we solve minimization problem with partial derivatives?
Thank you!