9

I'm learning the GMM clustering algorithm. I don't understand how it can used as a classifier. Here are my thought:

1) GMM is an unsupervised ML algorithm. At least that's how sklearn categorizes it.

2) Unsupervised methods can cluster data, but can't make predictions.

However, sklearn's user guide clearly applid GMM as a classifier to the iris dataset.

If I have to guess, maybe after clustering, each cluster is assigned to a class label based on some kind of majority voting. However, I can't find any documentation. Could someone shed more light on this process from unsupervised to supervised learning?


A related question: when using GMM as a classifier, is it common practice to simply make n_components=n_classes, instead of checking AIC, BIC, etc.?

F.S.
  • 193
  • 1
  • 4

1 Answers1

8

Some unsupervised models can make predictions, but not ones that necessarily match the original class labels. Once a GaussianMixture model has been fitted, it can predict which of the clusters a new example belongs to. This is exactly what the predict and predict_proba functions do in this case, and given that the number of clusters is set to 3, the number of classes, the predict function will predict a label from $\{0, 1, 2\}$.

However, this still raises the question of how does the GaussianMixture assign particular labels to the clusters? In general, it is arbitrarily chosen, but in the example from sklearn that you linked, they cheat when initialising the cluster centers:

# Since we have class labels for the training data, we can
# initialize the GMM parameters in a supervised manner.
estimator.means_init = np.array([X_train[y_train == i].mean(axis=0)
                                for i in range(n_classes)])

The initial position for each cluster center is at the center of each class, which also has the additional consequence of correctly ordering the cluster labelling to match the original class labels. This means that the GMM predicting which cluster a new instance belongs to is equivalent to predicting which class it might belong to in this case. I believe this has been done for easy visualisation of the different covariance matrix options.

Has QUIT--Anony-Mousse
  • 8,134
  • 1
  • 16
  • 31
timleathart
  • 3,960
  • 22
  • 35