k-means in R, usage of nstart parameter?

Question

I try to use k-means clusters (using SQLserver + R), and it seems that my model is not stable : each time I run the k-means algorithm, it finds different clusters. But if I set nstart (in R k-means function) high enough (10 or more) it becomes stable.

The default value for this parameter is 1 but it seems that setting it to a higher value (25) is recommended (I think I saw somewhere in the documentation).

So I'm a bit confused... Any advice ?

FrlUn · Answer 1 · 2017-05-05T23:31:46.550

9

nstart option attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial random centroids and choose the best one for the algorithm. Hope this helps!

You can read more here...

edited May 05 '17 at 23:31

answered May 05 '17 at 23:12

FrlUn

121
1
2
4

score 4 · Accepted Answer · answered Apr 28 '16 at 16:16

Stability of the clusters is highly dependent on your dataset, for clear cut cases running it multiple times is a waste of resources. I think that is the rationale behind the default value of 1. But I agree that for most smaller cases setting it much higher makes a lot of sense.

score 0 · Answer 3 · answered Dec 23 '17 at 10:45

0

Simply put, nstart will create multiple configurations and will show the best one.

For example, nstart = 25 , will create 25 initial configurations.

Use nstart argument to try multiple random initial values.

answered Dec 23 '17 at 10:45

Lohith Arcot

103
4

score -2 · Answer 4 · answered Apr 05 '17 at 10:50

the explanation doesn't appear naywhere in doc. but I think that setting nstart higher let kmeans run nstart(say 50) random initialization of centroids and choose the one that minimize better the cost; therefore you end up with a much more stable clusters becouse of kmeans always choose the better separation for your data

k-means in R, usage of nstart parameter?

4 Answers4