9

I try to use k-means clusters (using SQLserver + R), and it seems that my model is not stable : each time I run the k-means algorithm, it finds different clusters. But if I set nstart (in R k-means function) high enough (10 or more) it becomes stable.

The default value for this parameter is 1 but it seems that setting it to a higher value (25) is recommended (I think I saw somewhere in the documentation).

So I'm a bit confused... Any advice ?

irimias
  • 277
  • 1
  • 3
  • 7

4 Answers4

9

nstart option attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial random centroids and choose the best one for the algorithm. Hope this helps!

You can read more here...

FrlUn
  • 121
  • 1
  • 2
  • 4
4

Stability of the clusters is highly dependent on your dataset, for clear cut cases running it multiple times is a waste of resources. I think that is the rationale behind the default value of 1. But I agree that for most smaller cases setting it much higher makes a lot of sense.

Jan van der Vegt
  • 9,448
  • 37
  • 52
0

Simply put, nstart will create multiple configurations and will show the best one.

For example, nstart = 25 , will create 25 initial configurations.

Use nstart argument to try multiple random initial values.

Lohith Arcot
  • 103
  • 4
-2

the explanation doesn't appear naywhere in doc. but I think that setting nstart higher let kmeans run nstart(say 50) random initialization of centroids and choose the one that minimize better the cost; therefore you end up with a much more stable clusters becouse of kmeans always choose the better separation for your data