Moore-Penrose pseudoinverse solves the least squares problem (SVD framework)

Question

I am a computer science researcher who has to learn some numerical linear algebra for my work. I have been struggling with the SVD and Moore-Penrose pseudoinverse as of late. I am trying to solve some problems to get more comfortable with what should probably be routine manipulations.

First of all, I have gone through similar questions on Stack Exchange but I believe they were more general and are not equivalent. I am working in the framework where $A^{\dagger} = V\Lambda^{\dagger}U^T$. So, basically, I am using SVD's. The matrix $A$ of course is identified with $U\Lambda V^T$

Problem

Consider the matrix equation $Ax=y$, where $A\in R^{m\times n}$. The corresponding least squares problem is to find a least squares solution $x_{\text{LS}}$ that minimizes the Euclidean norm of the residual, i.e.,

$$\|Ax_{\text{LS}}-y\| = \min_{x \in \Bbb R^n} \|Ax-y\| = \min_{z \in \mbox{Ran}(A)}\|z-y\|$$

a) Show that $A^{\dagger}y$ is a least-squares solution and satisfies the normal equation $A^TAx=A^Ty$. Why is this solution special?

b) Show that $\ker(A^TA) = \ker(A)$.

c) Use the above results to deduce that $x \in \Bbb R^n$ is a least-squares solution if and only if it satisfies the normal equation.

Help on any or all of these parts is appreciated. I'd also appreciate links to relevant posts. Like I said, I've read similar questions but did not understand them as they were in a more general framework.

Edit: I have solved b). It didn't depend on a) as I had initially thought and it is pretty straightforward to solve, see eg. here: Prove that for a real matrix $A$, $\ker(A) = \ker(A^TA)$

Edit: I realize that part a) might be more involved than I had expected... Assuming parts a) and b), can someone help me with part c)?

I suggest using $\Lambda$ (\Lambda) instead: $\wedge$ is typeset as a binary operator, and more familiar to people that way. I know that when I read that question, I first assumed you were defining some kind of exterior algebra of matrices. — Misha Lavrov, May 01 '22 at 14:40
@MishaLavrov Okay, just changed it, thanks for the suggestion. Do you have an answer?? — maximiliandebian, May 01 '22 at 14:57
you really should be able to solve (b). This question has been asked many times on this site and is much simpler than concepts like moore-penrose or SVD. — user8675309, May 01 '22 at 16:46
Yup I've solved b); sorry I forgot to update. But I haven't got the other parts yet — maximiliandebian, May 01 '22 at 17:09
@maximiliandebian You might find this post and this post to be helpful — Ben Grossmann, May 01 '22 at 18:10
Thanks for the links. Can you help me with part c), assuming parts a) and b)? I have proved b), and a) is more involved than I had expected so I'm going to need to spend some time understanding it — maximiliandebian, May 02 '22 at 05:57
Since there are no answers yet, there's no need to append Edits. Just refine the question, incorporating what you learned from the comments, and then flag the orphaned comments as no longer needed. Comments are ephemeral. — Rodrigo de Azevedo, May 02 '22 at 07:36

José C Ferreira · Answer 1 · 2022-05-06T13:49:12.237

Answer to a) and b) follows from Why does $\operatorname{null}(A) = \operatorname{null}(A^TA)$, intuitively? and Why does SVD provide the least squares and least norm solution to $Ax=b$? as poited out in comments.

Let $\tilde{f}$ being a fixed minimizer of $$Q(f)=\|Af-g\|_K^2,$$ in which $$ \qquad A\in \mathbb{R}^{m\times n},\quad f\in \mathbb{R}^{n}=H,\quad g\in \mathbb{R}^{m}=K.$$

Let $h\in H$, and note that $$Q(f+h)=Q(f)+\langle Af-g,Ah\rangle_K+\langle Ah,Af-g\rangle_K+\|Ah\|_K^2\,\quad \tag{1},$$ in which $\langle\cdot,\cdot\rangle$ denotes the inner product. In particular $$Q(f+h)=Q(f),\qquad f\in H, \quad h\in \ker(A).$$

This and $(1)$ means that $$Q(\tilde{f})\leq Q(\tilde{f}+h)=Q(\tilde{f})+2\langle A^T(A\tilde{f}-g),h\rangle_H+\|Ah\|_K^2,\qquad h\in H.$$

Since $\tilde{f}$ is a critical point of $Q$, it follows that $$\nabla Q(\tilde{f})=2A^T(A\tilde{f}-g)=0,$$ and $(A\tilde{f}-g)\in \ker(A^*)=range(A)^\perp$.

You can find more details, in a more general case, in Theorem 1.1 of The mathematics of computerized tomography. You can also find related results searching for "$f=A^+g$" on SearchOnMath.

Mittens · Answer 2 · 2022-12-30T11:46:56.500

Here we show that $x=A^+b$, where $A^+$ is the Moore-Penrose pseudo inverse $A^+$ of a matrix $A$ solves the ordinary least square problem (OLS): $$x=\operatorname{arg}.\min\|Ax-b\|_2$$ where $A\in M(\mathbb{C},m, n)$ and $b\in\mathbb{C}^n$, and that if $x'$ is another solution, then $\|A^+b\|_2\leq\|x'\|_2$ (This is one reason that makes $A^+$ special).

Notice that if $b\in \operatorname{range}(A)$, then $x_*=A^+b$ is a solution to $Ax=b$, and has the smallest $\|\;\|_2$-norm.

First some linear algebra facts:

Using simple linear algebra, we now that if $R$ is a linear subspace of $\mathbb{C}^n$, then

(A) The orthogonal projection $P_R:\mathbb{C}^n\rightarrow R$ defined for each $\mathbf{y}\in \mathbb{C}^n$ as a vector $P_r\mathbf{y}\in R$ such that $$\|P_R\mathbf{y}-\mathbf{y}\|_2\leq \|\mathbf{z}-\mathbf{y}\|_2,\qquad \mathbf{z}\in R$$ exists and is unique.
(B) Furthermore, for each $\mathbf{y}\in\mathbb{C}^n$, $P_R\mathbf{y}$ is the unique vector that satisfies $$\mathbf{y}- P_R\mathbf{y}\perp R$$ that is, $\langle \mathbf{y}- P_R\mathbf{y},\mathbf{z}\rangle=0$ for all $\mathbf{z}\in R$ where $\langle\cdot,\cdot\rangle$ is the standard inner product in $\mathbb{C}^n$ (hence the name orthogonal projection).

Existence of solutions to the OLS

Consider the linear space $R:=\{A\mathbf{w}:\mathbf{w}\in\mathbb{C}^m\}$, that is, the range of $A$. The geometrical digression made above shows that there is exist a unique vector $\hat{\mathbf{y}}\in R$ such that $$\widehat{\mathbf{y}}=\operatorname{arg.min}_{\mathbf{z}\in R}\|\mathbf{y}-\mathbf{z}\|_2$$ namely the orthogonal projection of $\mathbf{y}$ onto $R$, $\widehat{\mathbf{y}}=P_R\mathbf{y}$. In particular, this shows that there is at least one $\boldsymbol{\omega}\in\mathbb{C}^m$ such that $\widehat{\mathbf{y}}=A\boldsymbol{\omega}$. In fact, for any solution $\mathbf{w}_H$ to the homogeneous equation $A\mathbf{w}=\mathbf{0}$ we have that $\widehat{\mathbf{y}}=A(\boldsymbol{\omega}+\mathbf{w}_H)$.

The problem is now reduced to finding an $\boldsymbol{\omega}\in \mathbb{C}^m$ for which $$\begin{align} \widehat{\mathbf{y}}=A\boldsymbol{\omega}\tag{1}\label{one} \end{align}$$

Now, as the range of the linear transformation $\mathbf{w}\mapsto A\mathbf{w}$ is generated by the columns of $A$, statement (B) shows that $\widehat{\mathbf{y}}=A\boldsymbol{\omega}$ is the unique vector in $\mathbb{C}^n$ satisfying $$\begin{align} A^*(\mathbf{y}-A\boldsymbol{\omega})=\mathbf{0}\tag{2}\label{two} \end{align}$$ where $A^*$ is the conjugate transpose of $A$ (when $A\in M(\mathbb{R},m.n)$ then $A^*=A^\intercal$). Equivalently, any $\boldsymbol{\omega}\in\mathbb{C}^m$ satisfying \eqref{one} (which we have already established exists) satisfies the equation $$\begin{align} A^* A\boldsymbol{\omega}=A^*\mathbf{y}\tag{2'}\label{twop} \end{align}$$

If the rank of $A$ is $m$ (i.e. $n\geq m$ and the columns of $A$ are linearly independent), then $A^*A$ is a $m\times m$ invertible matrix and so, there is a unique solution $$\boldsymbol{\omega}=(A^* A)^{-1}A^*\mathbf{y}$$ Conversely, if $A^* A$ is invertible, then the rank of $A$ is $m$.
In any other case, there will be infinitely many solutions to \eqref{two} (still, $\widehat{\mathbf{y}}=P_R\mathbf{y}$ is unique) since then $\operatorname{rank}(A^* A)<m$ and so, the homogeneous equation $A^* A\mathbf{w}=\mathbf{0}$ and thus, the homogeneous equation $A\mathbf{w}=\mathbf{0}$ will have infinitely many solutions.

Solutions to \eqref{twop} with different properties (minimal norm for example)can be obtained by appealing to generalized inverse matrices.

Minimal solution to the OLS problem and the Moore-Penrose pseudoinverse:

Recall that for any $A\in M(\mathbb{C},m, n)$, matrix $X\in M(\mathbb{C}, n, m)$ is a generalized inverse matrix of $A$ if $$AXA=X$$ Generalized inverse matrices exists and, when $A$ is invertible, there is only one such a matrix, namely $X=A^{-1}$.

Amongst all generalized inverse matrices os a matrix $A$, there is very special one, denoted by $A^+$, which satisfies $$\begin{align} AXA&=A\\ XAX&=X\\ (AX)^*&=AX\\ (XA)^*&=XA\end{align}$$

In terms of the singular value decomposition (SVD), if $A= VDU$, where $V$, $U$ are othogonal matrices and $D$ has zeroes entries except on the main diagonal (which contains the nonnegative square roots of the matrix $A^*A$), then is is easy to see that $U^*D^+V^*$ is the Moore-Penrose pseudo inverse $A^+$ of $A$. It can also be shown that $$A^+=(A^* A)^+A^*$$

There are different methods to show that $A^+$ exists and is unique! This is the Moore-Penrose inverse of $A$. The most efficient method to estimate $A^+$ is through eigenvalue methods, for example the singular value decomposition. Many computer packages (R, Octave, etc) estimate $A^+$ this way (usually coded as pinv(A))

The most important property of $A^+$ is that following:

Theorem: the minimal norm solution to the problem $$\begin{align} \operatorname{min}_{\mathbf{x}\in\mathbb{R}^n}\|A\mathbf{x}-\mathbf{b}\|_2\tag{3}\label{three} \end{align} $$ is $A^+\mathbf{b}$. That is,
$$A^+\mathbf{b}=\operatorname{arg.min}\{\|x\|_2: x=\operatorname{arg.min}\|Ax-b\|_2\} $$

Here is a short proof: Suppose $r=\operatorname{rank}(A)$, and let $A=VDU$ be a singular decomposition of $A$, so that $D\in M(\mathbb{C},m, n)$ has diagonal $\sigma_\geq\sigma_2\geq\ldots\geq\sigma_r>0$. $V$ and $U$ are orthogonal matrices that preserve the distances in $\mathbb{C}^n$ and $\mathbb{C}^m$ respectively (or $\mathbb{R}^n$ and $\mathbb{R}^m$ if you are dealing with real spaces only), that is $$\| Vy\|_2=\|y\|_2,\qquad \|Ux\|_2=\|x\|_2$$ for all $y\in\mathbb{C}^n$ and $x\in\mathbb{C}^m$. Thus $$\begin{align} \rho=\inf_x\|Ax-b\|_2&=\inf_x\|VDU-b\|_2=\inf_x\|V^*(VDUx-b)\|_2\\ &=\inf_x\|DUx-V^*b\|_2=\inf_{y}\|Dy-c\|_2 \end{align} $$ Notice that $$\|Dy-c\|^2_2=\sum^r_{k=1}|\sigma_ky_k-c_k|^2+\sum^m_{k=r+1}|c_k|^2$$ This is minimize when $y_k=c_k/\sigma_k$ for $1\leq k\leq r$ and by letting the remaining $y_{r+1},\ldots,y_n$ to be arbitrary. Therefore $$\rho^2=\sum^m_{k=r+1}|c_k|^2$$ Amongs all vectors $y$ that minimize \eqref{three}, the one with $y_{r+1}=\ldots y_m=0$ has minimum Euclidean norm. Clearly, this vector is given by $$ y=D^+c$$ where $D^+\in M(\mathbb{R},n, m)$ is the Moore-Penrose pseudo inverse of $D$, and is given by the matrix that has the $n$-vector $[\sigma^{-1}_1,\ldots,\sigma^{-1}_r,0,\ldots,0]$ ins the main diagonal and zeroes everywhere else.

Putting this together, the minimal norm solution to the minimum square problem is $$x=U^*y=U^*D^+c=U^*D^+V^*b=A^+b$$

Summary:

As we can see, if $(A^* A)^g$ is a generalized inverse of $A^*A$, then $$\boldsymbol{\omega}=(A^* A)^gA^* \mathbf{y}$$ is a solution to \eqref{one}. Unless $A^* A$ is invertible, there are infinitely many generalized inverse of $(A^* A)$ and thus, many solution to \eqref{one}. (The kernel $A$ provides the infinitely many solutions)
The particular choice $(A^* A)^+$ however provides the minimal norm solution $\boldsymbol{\omega}$ to \eqref{one}, namely $$\widehat{\boldsymbol{\omega}}=(A^* A)^+A^*\mathbf{y}=A^+\mathbf{y}$$ This is typically the solution that is implemented in OLS algorithms.
When $A$ is of full rank, $(A^* A)^+=(A^*A)^{-1}$ and in this case $\hat{\boldsymbol{\omega}}=A^+\mathbf{y}=(A^* A)^{-1}A^*\mathbf{y}$ is the unique solution to \eqref{one}

Refereces:

Albert, A. Regression and the Moore-Penrose pseudo inverse, AP New York, 1972.
Rao, C. R. and Mitra, S. K. Generalized Inverse Matrices and its Applications, John Wiley and Sons, New York, 1971
Cheney, W. and Kincaid, D. Numerical Analysis, Brooks/Cole Publishing, Pacific Groove California, 1991.

+1. Nice exhaustive answer. Maybe you could explain (one sentence) how to construct $D^+$ explicitly before introducing $U^D^+V^$? Also missed a word in "nonnegative square roots of the eigenvalues of the matrix $A^*A$". — G Frazao, Dec 30 '22 at 11:29
@GFrazao: Thanks! There are already other postings that show how $D^+$ is constructed. Here is one of mine for example — Mittens, Dec 30 '22 at 11:34

Moore-Penrose pseudoinverse solves the least squares problem (SVD framework)

2 Answers2

Linked