Distributed PCA or an equivalent

Question

We normally have fairly large datasets to model on, just to give you an idea:

over 1M features (sparse, average population of features is around 12%);
over 60M rows.

A lot of modeling algorithms and tools don't scale to such wide datasets.

So we're looking for a dimensionality reduction implementation that runs distributely (i.e. in Spark/Hadoop/ etc). We think to bring number of features down to several thousand.

Since PCA operate on matrix multiplication, which don't distribute very well over a cluster of servers, we're looking at other algorithms or probably, at other implementations of distributed dimensionality reduction.

Anyone ran into similar issues? What do you do to solve this?

There is a Cornell/Stanford Abstract on "Generalized Low-Rank Models" http://web.stanford.edu/~boyd/papers/pdf/glrm.pdf that talks specifically into this:

page 8 "Parallelizing alternating minimization" tells how it can be distributed;
also page 9 "Missing data and matrix completion" talks how sparse/ missing data can be handled.

GLRM although seems to be what we are looking for, but we can't find good actual implementations of those ideas.

Somebody wrote a version for Spark (e.g. https://github.com/rezazadeh/spark/tree/glrm/examples/src/main/scala/org/apache/spark/examples/glrm but it looks more of a proof of a concept rather than a version that we could use in production);
H2O has a version of GLRM http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glrm.html but it actually doesn't scale to our dataset size (see above).

Update 7/15/2018: Another Abstract is Fast Randomized SVD from Facebook (read here http://tygert.com/spark.pdf ) and also idea to do low-rank matrix approximation using ALS - http://tygert.com/als.pdf . Although there is no clear way how to use them now - see discussion at https://github.com/facebook/fbpca/issues/6

Any other ideas how to tackle this? Other available GLRM or other distributed dimensionalaity reduction implementations?

Brian Spiering · Answer 1 · 2018-07-16T20:39:08.553

2

There is principal component analysis (PCA) in Spark's Machine Learning Library (MLlib).

edited Jul 16 '18 at 20:39

answered Jul 16 '18 at 17:47

Brian Spiering

23,131
2
29
113

score 1 · Accepted Answer · answered Jul 21 '18 at 17:42

From the problem description what strikes me most relevant is the X-wing like autoencoder. Basically you have 2 neural nets that could have any of the popular neural net architectures like fully connected, convolutional and pooling layers or even sequential units like LSTM/GRU, the encoder and the decoder. If the encoding dimension is much smaller than the original one it could be used as a lower dimension representation of the input. The decoder is used to retrieve the original dimension/information. There are many types of autoencoders but for this use case you can take a look at sparse and denoising autoencoders. You could read more about autoencoders in the deep learning book: https://www.deeplearningbook.org/contents/autoencoders.html

I don't really understand why you definitely need to do the training process distributed but even for that there are distributed implementations of Tensorflow so you could do some research on the Tensorflow docs. Also if you want to learn a new framework Uber's Horovod is a distributed framework for writing Tensorflow solutions: https://github.com/uber/horovod

One last comment that I would like to make is about the underlying dimension of the data. You mentioned that an acceptable dimension would be in the thousands. In my experience sparse data reside in much smaller manifolds. So I would suggest to treat the encoding dimension as a hyper-parameter and optimize for the corresponding loss function.

Distributed PCA or an equivalent

2 Answers2