3

I am a physicist, with little formal training in computer science - please don't assume I know even obvious things about computer science!

Within the context of data analysis, I was interested in identifying clusters within a $d$-dimensional list of $n$ data-points, for which the dimensionality $d$ could be $\sim100$, whilst the number of data-points could be $\sim 1,000,000$, or perhaps more.

I wanted the points with a cluster to be close together, with distance measured in the Euclidean manner, $$ d(\vec x,\vec y) = \sqrt{\sum_{i=1}^d (x_i - y_i)^2} $$ As long as the clustering was reasonably accurate, I wasn't bothered if the exactly correct result was obtained. i.e. if of my $\sim1,000,000$, $\sim1,000$ points were wrongly categorized, it wouldn't matter much.

I have written a short algorithm that can perform typically at $\mathcal{O}(n)$ (from trials of up to $n\sim5,000,000$ and some theoretical analysis) and worst-case $\mathcal{O}(n^2)$ (from my theoretical evaluation of the algorithm). The nature of the algorithm sometimes (but not always) avoids the so-called chaining problem in clustering, where dissimilar clusters are chained together because they have a few data-points that are close.

The complexity is, however, sensitive to the a priori unknown number of clusters in the data-set. The typical complexity is, in fact, $\mathcal{O}(n\times c)$, with $c$ the number of clusters.

Is that better than currently published algorithms? I know naively it is a $\mathcal{O}(n^3)$ problem. I have read of SLINK, that optimizes the complexitiy to $\mathcal{O}(n^2)$. If so, is my algorithm useful? Or do the major uses of clustering algorithms require exact solutions?

In real applications is $c\propto n$?, such that my algorithm has no advantage. My naive feeling is that for real problems, the number of "interesting" clusters (i.e. not including noise) is a property of the physical system/situation being investigated, and is in fact a constant, with no significant dependence on $n$, in which case my algorithm looks useful.

innisfree
  • 145
  • 6

1 Answers1

3

To answer your last question: In real problems, the number of clusters $c$ is usually much less than $n$.

To compare to other algorithms, the running time of a naive implementation of the standard $k$-means algorithm is $O(ncdt)$ (here your $c$ is often called $k$, so $c=k$, and $t$ is the number of iterations needed for $k$-means to converge; $t$ is often much smaller than $n$ and roughly a small constant, but certainly not always). There are ways to speed it up further. It tends to be very fast in practice.

For high-dimension spaces, such as you are working in, one can also apply various dimension-reduction techniques on all the values, and then do cluster the reduced-dimension values. This sometimes gives additional performance improvements.

To answer your first question: We can't tell you whether your algorithm is useful if you don't show us the algorithm. Sorry, but that question just isn't well-suited for this site. Some questions aren't, no matter how much you'd love to know the answer.

Perhaps you'd like to ask a new question about what is the state of the art in clustering for Euclidean metrics, for some class of parameters and some distribution on the inputs. Except... make sure you go do your research first. There's a lot that has been written on the subject of clustering, so go read about standard clustering algorithms first, and then make sure to tell us what research you've done when you ask the new question. (In fact, you probably should have done that for this question, too...)

D.W.
  • 167,959
  • 22
  • 232
  • 500