2

Could someone explain how to calculate the following 3 evaluative properties:

  • Intercluster Variability (IV) - How different are the data points within the same cluster
  • Extracluster Variability (EV) - How different are the data points that are in distinct clusters
  • Optimum value of k

The third one I believe is by min EV/IV

Anyway please advise.

Tesla
  • 43
  • 4

3 Answers3

3

$C$ is a cluster, $c$ is centroid, $d$ is provided distance function, $\delta$ is Dirac measure, $x$ is element.
$IV = \sum \limits_C \sum \limits_{x \in C} d(x, c)$ meaning for each cluster calculate the sum of distances from it's centroid, and sum the results.

$EV = \frac{1}{N} \sum \limits_i \sum \limits_j\delta(C(x_i) \ne C(x_j))d(x_i, x_j)$ meaning sum the distances between elements from different clusters and divide the result by number of elements.
One possibility for optimal $k$ is to minimize $\frac{IV}{EV}$. But there are some other techniques of determining optimal number of clusters or referenced from there eight methods of finding $k$. In fact I would not state that any given method is better, preferable or can be determined blindly in advance, this is the part where you should define your optimum that makes sense to you, your data and outcome of what you use clusters for.

Evil
  • 9,525
  • 11
  • 32
  • 53
3

How to get the optimal value for $k$?

You have to define a measure for optimiality. The problem with that is that with bigger $k$ most measures become smaller (better). One measure which is independent from $k$ is the silhouette coefficient:

Let $C = (C_1, \dots, C_k)$ be the clusters. Then:

  • Average distance between an object $o$ and other objects in the same cluster: $$a(o) = \frac{1}{|C(o)|} \sum_{p \in C(o)} dist(o, p)$$
  • Average distance to the next cluster: $$b(o) = \min_{C_i \in \text{Cluster} \setminus C(o)}(\frac{1}{C_i}) \sum_{p\in C_i} \sum_{p \in C_i} \text{dist}(o, p)$$
  • Silhouette of an object: $$s(o) = \begin{cases}0 &\text{if } a(o) = 0, \text{i.e. } |C_i|=1\\ \frac{b(o)-a(o)}{\max(a(o), b(o))} &\text{otherwise}\end{cases}$$
  • Silhouette of a clustering $C$: $$\text{silh}(C) = \frac{1}{|C|} \sum_{C_i \in C} \frac{1}{|C_i|} \sum_{o \in C_i} s(o)$$

You can see that $s(o) \in [-1, 1]$ and $\text{silh}(C) \in [-1, 1]$. Higher is better. Smaller than 0 is very bad.

Now you can start with $k=1$ and increase $k$ until $\text{silh}(C)$ gets smaller again.

However, there are alternatives to $k$-means clustering:

Martin Thoma
  • 2,360
  • 1
  • 22
  • 41
0

There isn't an optimum value for k, all you can do is use one which is not too bad! To do so (it's a maths approach), I use a PCA (principal components analysis) which gives me the proportion of variance explained by each axis of the eigen vectors. Suppose you have 10 variables and you see that 4 axis explained 90% of the variance, then you can set k to 4.

galzra
  • 9
  • 1