Questions about SVD, Singular Value Decomposition

Question

I am not a mathematician, so I need to understand what SVD does and WHY more than how it works exactly from the math perspective. (I understand at least what is the decomposition though).

This guy on youtube gave the only human explanation of SVD saying, that the U matrix maps "user to concept correlation" Sigma matrix defines the strength of each concept, and V maps "movie to concept correlation" given that initial matrix M has users in the rows, and movie (ratings) in the columns.

He also mentioned two concept specifically "sci fi" and "romance" movies. See the picture below.

My questions are:

How SVD knows the number of concepts. He as human mentioned two - sci fi, and romance, but in reality in resulting matrices are 3 concepts. (for example matrix U - that one with blue titles - has 3 columns not 2).
How SVD knows what is the concept after all. I mean, what If i shuffle the columns randomly how SVD then knows what is sci fi, what is romance. I mean, I suppose there is no rule, group the concepts together in the column order. What if scifi movie is the first and last one? and not first 3 columns in the initial matrix M?
What is the practical usage of either U, Sigma or V matrices? (Except that you can multiply them to get the initial matrix M)
Is there also any other possible human explanation of SVD than the guy up provided, or it is the only one possible function? Matrices of correlations.

There is an important aspect missing from the example presented: when applying principal component analysis (which is what the speaker is doing), it is important to subtract away the column mean so that $A$ has a mean column of zero. — Ben Grossmann, Mar 23 '21 at 18:12
If you look at the picture in the top left of the page I linked (PCA of a multivariate Gaussian distribution), you get what I find to be a better idea of what the SVD actually detects. The vectors are really trend directions in the data. The fact that the SVD picks out the "scifi concept" means that there is a group of users that gives scifi movies relatively high ratings and that this effect (which happens to be tied to scifi movies) is the one that has the largest impact on the data. — Ben Grossmann, Mar 23 '21 at 18:36
@Ben GrossmannI I fully agree. It must be said that principal component analysis is a way to consider SVD, very useful for statisticians. But this concept is multi-faceted and you can arrive at it in different ways. One of them is to consider it as the best possible extension of eigenvalues/eigenvectors decomposition for rectangular matrices. — Jean Marie, Mar 23 '21 at 18:39
@JeanMarie I don't mean to imply that this is how I think of SVD; I like the geometric (Rayleigh-Ritz/ellipsoid) interpretation of singular values. I just mean that my second comment is closer to the truth for the use of SVD in the statistical context. — Ben Grossmann, Mar 23 '21 at 18:42
@Ben Grossman : In fact, I was agreeing to your first comment. — Jean Marie, Mar 23 '21 at 18:43
@luky You might find the explanations on this post to be useful — Ben Grossmann, Mar 23 '21 at 18:44
@luky Also, you might like the "intuitive interpretations" section of the SVD wiki article. I particularly like the gif used in that section. — Ben Grossmann, Mar 23 '21 at 18:47
@BenGrossmann hi Ben, I will read all what you wrote. Btw I tried to shuffle the columns randomly and it seems that the number of user to concept correlations and also the sigma matrix remains still same, which is quite magical to me.. doesn't matter if "sci fi" movies are the first 2 columns or first and last, the correlation number is still same. that is quite amazing then. if i am not wrong then. it would mean that SVD is really able to identify some patterns in the data. — luky, Mar 23 '21 at 18:49
@BenGrossmann can you please tell little bit more about the mean column of zero? does it mean it needs to have mean zero for each column or for all columns in sum and why? thanks — luky, Mar 23 '21 at 18:57
@luky It is indeed able to identify these patterns. However, the strength of these patterns is underestimated because of the skewing that results from the mean-column being non-zero. I would suggest that you repeat your experiment, but this time take each column and subtract the overall mean. — Ben Grossmann, Mar 23 '21 at 18:57
@luky For some intuition, consider the picture that I reference in my second comment and shift all of the points away from $(0,0)$. The large arrow can be interpreted as a line of best fit for the data cloud, with a constraint that this line must pass through the point $(0,0)$. If the data cloud is shifted, note that it is possible to change the slope of this line of best fit and thereby change the first column of the matrix $U$ from the SVD. — Ben Grossmann, Mar 23 '21 at 19:02
@luky How have you been computing these SVD's? Are you using Matlab? Python? — Ben Grossmann, Mar 23 '21 at 19:18
@BenGrossmann yes python, there it is easy just two lines of code like M = np.array([[5,5,0,0,0,0], [0,0,0,0,5,5]]) np.linalg.svd(M) — luky, Mar 23 '21 at 19:19
@BenGrossmann in my project involving NLP i was using TruncatedSVD from Sklearn for dimensionality reduction (they said it is used for LSA - latent semantic analysis) but i wanted to understand how SVD works. i understand though that truncated SVD is yet something bit another than classical SVD, and that there the dimensionallity reduction probably happens, but first i wanted to understand also the normal SVD — luky, Mar 23 '21 at 19:21
@BenGrossmann I see that Sklearn mentions your idea about removing the mean from data "Contrary to PCA, this estimator does not center the data before computing the singular value decomposition." https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html — luky, Mar 23 '21 at 19:26
@BenGrossmann this video will be helpful it seems https://www.youtube.com/watch?v=yA66KsFqUAE — luky, Mar 23 '21 at 20:09

Jean Marie · Answer 1 · 2021-03-24T17:44:42.187

Here is a way to understand from a different point of view what the SVD means, using an algorithm based on a balanced weighting between rows and columns.

I will use two slides of the Linear Algebra lectures I have been giving for many years (adapted from its french version):

First slide.

It deals with the following SVD:

$$\underbrace{\begin{pmatrix} 0&1\\1&2\\3&3\end{pmatrix}}_A = \underbrace{\begin{pmatrix} \color{red}{0.1595} & \ \ 0.7077 & \ \ 0.6882\\ \color{red}{0.4520} & \ \ 0.5675 & -0.6882\\ \color{red}{0.8776} & -0.4208 & \ \ 0.2294\end{pmatrix}}_U \underbrace{\begin{pmatrix} \ \ \color{red}{4.8146} & 0\\ 0 & 0.9054\\ 0 & 0\end{pmatrix}}_{\Sigma} \underbrace{\begin{pmatrix}\color{red}{0.6407} & -0.7678\\ \color{red}{0.7678} & \ \ 0.6407\end{pmatrix}^T}_{V^T}$$

This algorithm (not the most efficient) produces, when stopped after a certain number of steps, a numerical approximation of the first singular element $(U_1,V_1,\sigma_1)$ where $U_1$ and $V_1$ are the first columns of $U$ and $V$ resp. and $\sigma_1$ the dominant singular value.

Why does this "alternate balancing" works ? Here is an explanation.

Second slide.

In fact, it is not surprizing that matrices $AA^T$ and $A^TA$ have "poped up". We find back a property that has been used for a long time to introduce the SVD:

If $A$ is $m \times n$ with $m>n$, the $m \times m$ matrix $AA^T$ and the $n \times n$ matrix $A^TA$, being squared symmetric semi-definite positive matrices have the resp. eigendecompositions :

$$AA^T=U \Sigma' U^T \ \ \ \& \ \ \ A^TA=V \Sigma'' V^T\tag{1}$$

with $U$ and $V$ orthogonal matrices and the same eigenvalues (set apart the zero eigenvalues). More exactly, $\Sigma''$ is the upper diagonal block of $\Sigma'$, the remaining (diagonal) entries of $\Sigma'$ being filled by zeroes.

Remarks:

Of course, (1) is an immediate consequence of the multiplication of

$$A=U \Sigma V^T \ \ \ \text{by} \ \ \ A^T=(U \Sigma V^T)^T=V \Sigma^T V^T$$

due to the fact that $U^TU=I$ and $V^TV=I$ (orthogonality property).

Once $(U_2,V_2,\sigma_2)$ has been obtained, a so-called "deflation" $A'=A-\sigma_1U_1V_1^T$ is operated on matrix $A$, then the same algorithm is applied to $A'$ in order to get the second singular element $(U_2,V_2,\sigma_2)$. Here, from a numerical point of view, there is a drawback for big matrices : deflation accumulates errors. But efficiency and precision are not here our main concerns, our objective being a better insight about what SVD really is...
Matrix $\Sigma$ is a kind of "tradeoff" between $\Sigma'$ and $\Sigma''$.

Nick Alger · Answer 2 · 2021-03-24T07:14:48.470

A matrix, $A \in \mathbb{R}^{M \times N}$, maps vectors in $\mathbb{R}^N$ to vectors in $\mathbb{R}^M$ via matrix multiplication, $$v \mapsto A v$$ If you multiply the matrix with every vector on the unit sphere in $\mathbb{R}^N$, the resulting vectors you get out will form an ellipsoid in $\mathbb{R}^M$. The primary axes of the ellipsoid in $\mathbb{R}^M$ are orthogonal. Also, it is possible to prove that the input vectors that get mapped to those primary axes are orthogonal in $\mathbb{R}^N$.

The singular value decomposition $A = U \Sigma V^T$, is an encoding of the sphere and the ellipsoid.

The left singular vectors, $u_i$, are primary axes of the ellipsoid, except scaled to be unit length.
The singular values, $\sigma_i$, are the lengths of the primary axes of the ellipsoid
The right singular vectors, $v_i$, are the unit vectors on the sphere that get mapped to the primary axes of the ellipsoid.

This was all assuming the dimensions of the input and output are the same. But the idea extends to the case where dimensions of the input and output are different ($M \neq N$).

If the input dimension is larger than the output dimension ($N > M$), then there are $N-M$ orthogonal vectors on the unit sphere that get squashed to zero, but in the other $M$ complementary directions everything is the same as described above.
If the output dimension is larger than the input dimension ($M > N$), then the output is still an ellipsoid, but it is contained in a $N$-dimensional hyperplane, and is totally flat in the $M-N$ directions complementary to the hyperplane.

Very nice : I like this geometrical point of view. – Jean Marie Mar 24 '21 at 06:45 — Jean Marie, Mar 24 '21 at 06:45

Ben Grossmann · Answer 3 · 2021-03-24T15:18:54.187

I've made the following code to illustrate what I mean about the line of best fit. I hope you find it useful

import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
Adjustable parameters
N = 20           # Number of data points
var1 = 3         # Horizontal stretch factor
var2 = 5         # Vertical stretch factor 
mean1 = 0        # Horizontal mean (distribution)
mean2 = 1        # Vertical mean
line_scale = 5   # determines length of best fit line
#Generate data
R = np.random.rand(2,N)              #Random matrix to generate data
R = np.random.normal(size = (2,N))
A = (R-.5) * np.asarray([[var1],[var2]]) + np.asarray([[mean1],[mean2]])
Plot data
plt.scatter(A[0,:],A[1,:])                 # Plot data points
mu = np.average(A,axis=1)

plt.scatter(mu[0],mu[1],marker='*',s=250)  # Plot data mean (as orange star)
Find best-fit line using SVD
U = la.svd(A)[0]
v = np.reshape(U[:,0],(-1,1))
B = np.hstack((-v,v))*line_scale
plt.plot(B[0,:],B[1,:])
Uncomment to set x and y scales equal
plt.xlim((-5,5))
plt.ylim((-5,5))

Here's an example output. The trendline inferred by the SVD slopes upward, whereas the true trendline should be more horizontal. This can be seen as the effect of the mean (the orange star).

Update: this version includes the true best fit line (which requires zeroing the column-mean)

import numpy as np
import scipy.linalg as la
import matplotlib.pyplot as plt
Adjustable parameters
N = 20           # Number of data points
var1 = 3         # Horizontal stretch factor
var2 = 5         # Vertical stretch factor 
mean1 = 0        # Horizontal mean (distribution)
mean2 = 1        # Vertical mean
line_scale = 3   # determines length of best fit line
#Generate data
R = np.random.rand(2,N)              #Random matrix to generate data
R = np.random.normal(size = (2,N))
A = (R-.5) * np.asarray([[var1],[var2]]) + np.asarray([[mean1],[mean2]])
Plot data
plt.scatter(A[0,:],A[1,:])                 # Plot data points
mu = np.average(A,axis=1)

plt.scatter(mu[0],mu[1],marker='*',s=250)  # Plot data mean (as orange star)
mu = np.reshape(mu,(-1,1))
Find best-fit 1-D subspace using SVD
U = la.svd(A)[0]
v = np.reshape(U[:,0],(-1,1))
B = np.hstack((-v,v))*line_scale
plt.plot(B[0,:],B[1,:])
Find best-fit line (1-D affine subspace) using PCA
M = A - mu
U = la.svd(M)[0]
w = np.reshape(U[:,0],(-1,1))
C = mu + np.hstack((-w,w))*line_scale
plt.plot(C[0,:],C[1,:])
Uncomment to set x and y scales equal
plt.xlim((-5,5))
plt.ylim((-5,5))

Sample output:

Very valuable comparison between SVD and PCA. On my side, I have also attempted to convey an algorithmic feeling of what SVD is as well. — Jean Marie, Mar 23 '21 at 23:02
@Ben Grossmann hi Ben may I ask you. I used the Truncate SVD to create a matrix of shape (400k, 100 (components of SVD i guess)) and then i wanted to use cosine similarity i even tried to use one bit optimized from github that runs multi threaded, but it ran 1.5h using 29 CPU cores. for comparison, without using Truncate SVD (LSA / PCA on text classification) it takes only 10 minutes. Do you have any idea why it is so slow on the dense matrix with 100 dimensions produced by SVD? It is extremely slow for 400k rows and 29 CPUs. thanks — luky, Mar 25 '21 at 18:06
I don't understand your explanation of what you did so I don't have an explanation. I would suggest that you make a new post about it on Cross-validated and include the relevant code — Ben Grossmann, Mar 25 '21 at 18:22

Questions about SVD, Singular Value Decomposition

3 Answers3

Adjustable parameters

R = np.random.normal(size = (2,N))

Plot data

Find best-fit line using SVD

Uncomment to set x and y scales equal

plt.xlim((-5,5))

plt.ylim((-5,5))