Can we automatically chose k value in k-means algorithm?

Question

Can we choose automatically the K value, trying every possible values (k=1,.., n) where n is the number of instances to be clustered. We then keep the value of K for which we obtained the minimum value by the method of sum of least squares.

Can this strategy work?

score 2 · Answer 1 · answered Dec 26 '20 at 18:32

This is a known problem of automatic clustering, how to choose / adapt the number of clusters so it represents the "real" clusters.

Hierarchical clustering is more helpful in this regard. For algorithms like K-means this is not so easy and research has tried various approaches to determine the optimum number of clusters (eg employing Information-Theoretic criteria like Akaike information criterion - AIC).

An overview in given in wikipedia article and references therein.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

[..]The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.

score 1 · Answer 2 · answered Dec 22 '20 at 23:45

Inspection oif within cluster sum of squares (WCSS) is one of the approaches used in selecting the number of clusters for k-means. There are other well known methods such as the elbow method.

See this R package for range of other methods for selecting number of clusters (for k-means and some others) https://cran.r-project.org/web/packages/NbClust/NbClust.pdf

score 0 · Answer 3 · answered Apr 11 '22 at 17:44

To complement the answers above.

Using the elbow method, you can also determine the number of clusters quantitatively in an automatic way (as opposed to doing it by eye using this method), if you introduce the quantity called the "elbow strength". Basically, it is based on the derivative of the elbow-plot with some more information-enhancing tricks. More details about the elbow strength can be found in the supplementary information of the following publication:

https://iopscience.iop.org/article/10.1088/2632-2153/abd87c

Can we automatically chose k value in k-means algorithm?

3 Answers3

Linked