I have a pyspark ML pipeline that uses PCA reduction and ANN. My understanding is that PCA performs best when given standardized values while NN perform best when given normalized values. Does it make sense to standardize the values before PCA and then normalize the PCA values before the ANN?
1 Answers
As PCA uses variance, the data should indeed be standardized before.
But as a preprocessing for Deep Neural Network (DNN), standardization and normalization could be used depending on the data. With images we will divide by 255 to get into [0, 1] range but for tabular data if it follows a gaussian law or if there are outliers it would make more sense to use standardization instead of normalization (one could also try both out of curiosity to see if there is a big difference in the final result).
To answer your question, yes it makes sense to do standardization + PCA + scaling + DNN.
Note that PCA doesn't always give a better model (though it will make it faster to run) and if you use it to decrease the number of features maybe the DNN can learn this by itself. So I would also try a pipeline without standardization + PCA.
- 1,801
- 4
- 13