9

In Deep Learning (Goodfellow, et al), the optimization objective of PCA is formulated as

$$D^* = \arg\min_D ||X - XDD^T||_F^2 \quad \text{s.t.} \quad D^T D=I$$

The book gives the proof of the $1$-dimension case, i.e.

$$\arg\min_{d} || X - X dd^T||_F^2 \quad \text{s.t.} \quad d^T d = 1$$

equals the eigenvector of $X^TX$ with the largest eigenvalue. And the author says the general case (when $D$ is an $m \times l$ matrix, where $l>1$) can be easily proved by induction.

Could anyone please show me how I can prove that using induction?

I know that when $D^T D = I$:

$$D^* = \arg\min_D ||X - XDD^T||_F^2 = \arg\min_D tr D^T X^T X D$$

and

$$tr D^T X^T X D = \left(\sum_{i=1}^{l-1} \left(d^{(i)}\right)^T X^TX d^{(i)}\right) + \left(d^{(l)}\right)^T X^TX d^{(l)}$$

where the left-hand side of the addition reaches maximum when $d^{(i)}$ is the $ith$ largest eigenvector of $X^T X$ according to induction hypothesis. But how can I be sure that the result of the addition in a whole is also maximal?

  • Perhaps the idea is that if you define $X'^TX' = W'^T\Lambda' W'$ where $\Lambda'$ and $W'$ are identical to $\Lambda$ and $W$ from $X^TX = W^T\Lambda W$ but without the largest eigenvalue and the corresponding eigenvector. By solving it you will end up with $d'$ which is, again, the eigenvector of $X'^TX'$ with the largest eigenvalue or the second largest eigenvector of the original matrix $X^TX$. – Kentzo Nov 28 '18 at 00:46
  • Is there a proof of the $l=1$ case as well? – user 6663629 Oct 14 '19 at 04:55

3 Answers3

3

We will start from $$\begin{align} D^* &= \underset{D}{arg\max}\;Tr\ (D^TX^TXD)\\ &= \underset{D}{arg\max}\left[Tr\ (D_{l-1}^TX^TXD_{l-1}) + d^{(l)T}X^TXd^{(l)}\right] \end{align} $$ Where we used the notation $D_{k}$ to denote the matrix with first $l-1$ columns of $D$.

The 2 summands in the expression share no common terms of $D$ and hence can be maximized independently adhering to the constraints $D_{l-1}$ has orthonormal columns and $d^{(l)}$ is unit norm and orthogonal to all columns of $D_{l-1}$.

Using the induction hypothesis, we conclude that $Tr\ (D_{l-1}^TX^TXD_{l-1})$ (with the constraint that the columns of $D_{l-1}$ are orthonormal) is maximized when $D_{l-1}$ comprises of the orthonormal eigenvectors corresponding the $l-1$ largest eigenvalues.

Notation: Suppose $\lambda_1 \geqslant ... \geqslant\lambda_n$ are the eigenvalues and $v_1, ..., v_n$ are the corresponding orthonormal eigenvectors.
Denote $H_{l-1} = span\{v_1, ...,v_{l-1}\}$ and $H_{l-1}^{\bot}$ the orthogonal subspace of $H_{l-1}$ i.e. $H_{l-1}^{\bot} = span\{v_l,...,v_n\}$

Lemma: $$\begin{align}\lambda_l &= \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} \quad s.t. \Vert d^{(l)}\Vert = 1, d^{(l)} \in H_{l-1}^\bot \\ &=v_l^TX^TXv_l \end{align}$$

Proof: Let $\Sigma = X^TX$. Because it's a symmetric positive semidefinite matrix, eigendecomposition exists and let it be $\Sigma = V\Lambda V^T$ where columns of $V$ are $v_1,...,v_n$ in that order and hence $\Lambda=diag(\lambda_1,...,\lambda_n)$.
$$ \begin{align} d^{(l)T}\Sigma d^{(l)} &= d^{(l)T} V\Lambda V^T d^{(l)}\\ &= q^T \Lambda\ q \qquad [where\ q = V^Td^{(l)}]\\ &= \sum_{i=1}^n q_i^2 \lambda_i \qquad [where\ q_i = (V^Td^{(l)})_i = v_i^T d^{(l)}]\\ &= \sum_{i=l}^n q_i^2 \lambda_i \qquad [\because d^{(l)} \in H_{l-1}^\bot \implies q_i = v_i^T d^{(l)} = 0\ \forall i < l]\\ \end{align} $$ Now, $$ \begin{align} \sum_{i=l}^n q_i^2 \lambda_i &= \sum_{i=1}^n q_i^2 \lambda_i = \Vert q \Vert = \Vert V^T d^{(l)} \Vert \\ &= \Vert d^{(l)} \Vert \qquad [\because V\ and\ hence\ V^T is\ orthogonal] \\ &= 1 \qquad \quad\ [\because \Vert d^{(l)} \Vert = 1] \end{align} $$

Therefore $d^{(l)T} \Sigma d^{(l)}$ is a convex combination of $\lambda_l,...,\lambda_n$ and $$\underset{d^{(l)}}{max}\ d^{(l)T}\Sigma d^{(l)} = \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} = v_l^TX^TXv_l = \lambda_l \ (qed)$$

We conclude that $D^*$ is obtained by augmenting $D_{l-1}$ with the column $v_l$ which completes the original proof.

2

Suppose we need to project the data set X onto a vector $u_{j}$. $u_{j}$ is a unit vector namely $u_{j}^{T}u_{j} = 1$. Deleting this dimension will cause an error which is given by, $J_{j} = \frac{1}{m}\left \| X^{T}u_{j} \right \|^{2} = \frac{1}{m}\left ( X^{T}u_{j} \right )^{T}\left ( X^{T}u_{j} \right ) = \frac{1}{m}u_{j}^{T}XX^{T}u_{j}$

Let S denotes $\frac{1}{m}XX^{T}$ and suppose we need to get rid of $t$ dimensions, the cost function is $J = \sum_{j = n-t}^{n}u_{j}^{T}Su_{j}$

$s.t. u_{j}^{T}u_{j} = 1$

Using Lagrange multiplier,

$\widetilde{J} = \sum_{j = n-t}^{n}u_{j}^{T}Su_{j} + \lambda _{j}\left ( 1 - u_{j}^{T}u_{j} \right )$

Minimize this equation,

$\frac{\partial \widetilde{J}}{\partial u_{j}} = Su_{j} - \lambda _{j}u_{j} = 0$

Hence,

$Su_{j}=\lambda _{j}u_{j}$

It is obvious that $u_{j}$ is the eigenvector of S and $\lambda _{j}$ is the eigenvalue. Then,

$J = \sum_{j = n-t}^{n}u_{j}^{T}Su_{j} = \sum_{j = n-t}^{n}u_{j}^{T}\lambda _{j}u_{j} = \sum_{j = n-t}^{n}\lambda _{j}$

If we want to obtain the minimum J, we need to set aside the large eigenvalues and get rid of the small ones(These values correspond to the $\lambda _{j}$ in the previous equation).

Hope these can help you.

referrence: https://blog.csdn.net/Dark_Scope/article/details/53150883

David
  • 21
1

I am at the same point actually, and wondering if you found an answer.

My idea was to use the Theorem of Fan which says:

Let $X \in S^n$ be a symmetric matrix with eigenvalues $\lambda_1 \geqslant \ldots \geqslant \lambda_k$, then

$\lambda_1+ \cdots + \lambda_k = max \{Tr(XDD^T): D \in \mathbb{R}^{n \times k} D^TD=I_k\}$

M. Ro
  • 31