6

I have a traditional prediction setting, with a training data set train and a test data set test. I do not know the outcome y of the test set.

I found that tsne separates my binary classification setting quite well. However, tsne cannot really be used for prediction, as in predict(tsne, newdata=test) which can be done for PCA.

What is the best approach here?

Should I combine my train and test set (i.e., rbind) and run tsne on the whole data set?

spore234
  • 613
  • 8
  • 14

3 Answers3

2

Here's an approach:

  1. Get the lower dimensional embedding of the training data using t-SNE model.
  2. Train a neural network or any other non-linear method, for predicting the t-SNE embedding of a data point. This will essentially be a regression problem.
  3. Use the model trained in step 2 to first predict the t-SNE embedding of a test data point and then assign it to a class using kNN.

This ensures that there is no data leakage between your training and test set. I believe this approach is a hacky way of bringing t-SNE into the binary classification picture.

Do note that t-SNE was mainly intended for visualization of high dimensional data points and not to extract good features for a classification model. The fact that you could observe a clear separation between classes using the t-SNE visualisation implies that the data can be easily modeled as a binary classification task using a non-linear classification algorithm.

If I were you, I would consider using ELMs, SVMs with non-linear kernels or good old Logistic Regression with regularisation.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
1

You could try performing t-SNE on the combined test and train data, then assigning the class of each test point based on $k$-NN with the training data in t-SNE coordinates.

However, I don't believe there are any guarantees with t-SNE, so you might need to find another method. My question on the possibility of using $k$-NN with t-SNE is here.

geometrikal
  • 533
  • 1
  • 5
  • 14
1

t-SNE is not really designed that way. Since t-SNE is non-parametric there isn't a function that maps data from an input space to the map. The standard approach usually is to train a multivariate regression to predict the map location from input data. You can read more about this in this paper t-SNE. In the paper you should note that the author takes the approach to minimize the t-SNE loss directly.

Tophat
  • 2,470
  • 12
  • 16