What is the global function we are trying to Optimise with Clustering Algorithms?

Question

I am doing some reading (and implementation) of some Clustering Algorithms.

First I started with the well known K-Mean algorithm and implemented it directly from a paper. Got a kind of decent understanding of what I going on there.

What I am actually interested in is Particle Swarm Optimisation based clustering.

For Particle Swarm Optimisation, you need to have a global fitness function that will (in this case) tell you how well clusters this is.

The general notion of well clustered I have found is that each cluster is compact. But how to express this mathematically?

I thought that we would go though each cluster and check its positional variance (away from its centroid).

For some cluster C :

$$\mu(C)=\dfrac{1}{|C|}\sum_{\forall\tilde{x}\in C}\tilde{x}$$

$$\sigma^{2}(C)=\sum_{\forall\tilde{x}\in C}\left(\tilde{x}-\mu(C)\right)^{2}$$

No point converting to standard deviation, unless using a gradient based optimization approach, which PSO is not.

then the global fitness would be to minimize the average variance, which is proportional to minimizing the total variance since we have a fixed number of clusters.

if $S$ is the set of all clusters the we are to minimise some function $g$.

$g_{mine}(S)=\sum_{\forall C\in S}\sigma^{2}(C)$

ie $$g_{mine}(S)=\sum_{\forall C\in S}\sum_{\forall\tilde{x}\in C}\left(\tilde{x}-\mu(C)\right)^{2}$$

According to Chen, C.-Y. & Ye, F. Particle swarm optimization algorithm and its application to clustering analysis Networking, Sensing and Control, 2004 IEEE International Conference on, 2004, 2, 789-794

Equation 7 (Rewritten to use my notation)

$$g_{chen}(S)=\sum_{\forall C\in S}\sum_{\forall\tilde{x}\in X}(\tilde{x}-\mu(C)) $$

The difference from what I had is the $\forall\tilde{x}\in X$, for $X$ the whole spaceof data rather than $\forall\tilde{x}\in C$.

This is what got me wondering if I was correct in the first place. as it can be seen that $g_{mine}\ne g_{chen}$. It can't since the extra points are going to change it and will not cancel since all variances are positive. However they are equivalent if $\forall S, S'$ $g_{mine}(S)\ge g_{mine}(S') \iff g_{chen}(S)\ge g_{chen}(S')$

In: Ahmadyfard, A. & Modares, H. Combining PSO and k-means to enhance data clustering Telecommunications, 2008. IST 2008. International Symposium on, 2008, 688-691

Equation 6 has: (again with notation switched)

$$g_{ahmadfard}(S)=\dfrac{1}{|X|} \sum_{\forall C\in S}\sum_{\forall\tilde{x}\in C}\left(\tilde{x}-\mu(C)\right)^{2}$$

This is equivalent to mine (at least in proportionality) as: $|X| \times g_{ahmadfard}(S) = g_{mine}$, and since $|X|>0$ thus $\forall S, S'$ $g_{mine}(S)\ge g_{mine}(S') \iff g_{ahmadfard}(S)\ge g_{ahmadfard}(S')$

Ye and Chen also came up with another function in: Ye, Fun, and Ching-Yi Chen. "Alternative KPSO-clustering algorithm." Tamkang Journal of science and Engineering 8.2 (2005): 165. It is very complicated and beyond the scope of this question

What are the pros and cons of $g_{chen}$ vs $g_{mine}$ vs $g_{ahmadfard}$ ?

gaborous · Accepted Answer · 2015-07-17T08:05:33.833

About your first, general question: generally, the cost function we want to optimize in clustering algorithms is double: we want to both minimize intra-class variance while maximizing extra-class variance. Replace class by cluster and you understand easily what it means: inside a cluster, we want the elements to be as similar as possible, while elements between different clusters must be as dissimilar as possible. The basic idea is very simple and natural.

Different algorithms will only tweak the way they measure this intra-class similarity and the extra-class dissimilarity, usually in the goal to optimize for a particular kind of dataset with known properties.

About the second, more specific question: I did not read those papers, but from a first look, it seems gchen is using L1-norm (absolute value measure) while gmine is using L2-norm (euclidian measure). gahmadfard seems to only be the L2-norm plus feature scaling or some kind of tf-idf (anyway what is certain is that they are pondering the similarity measure with the number of elements in the group, but I don't know for what purpose).

I can't advise when to use which clustering algorithm, you should read the intro of the papers or find a survey, but usually the best way to find the best clustering algorithm for you is to try several on your dataset and manually select which one fits the best your purpose (since there's no objective way to find which clustering is the best, it all depends on your subjective needs, after all, clustering algorithms belong to the class of unsupervised learning).

What is the global function we are trying to Optimise with Clustering Algorithms?

1 Answers1