3

I am looking for an incremental clustering algorithm. By incremental I mean an algorithm that builds clusters starting from an initial dataset and that is able to progressively ingest new items/observations adding them to existing or new clusters.

The maximum number of clusters is a priori unknow and is expected to grow over time, meaning that, after the algorithm have been run on the initial dataset, I expect to receive observations that belongs to never before seen clusters.

I am quite new to this kind of problem and all the clustering algorithms in the Scipy's clustering library only provide methods for one-shot clustering.

The only incremental clustering algorithm offered by Scikit-learn library is the MiniBatchKMeans that requires a fixed number of clusters and does not fit for my use case.

Are there incremental clustering algorithms that handle an unknown number of clusters? Are they already implemented somewhere?

Thank you a lot!

Sirion
  • 131
  • 3

2 Answers2

2

One option is incremental hierarchical clustering.

Hierarchical clustering either uses agglomerative or divisive approaches to divide the data into stratified groups. In hierarchical clustering, the number of clusters can be chosen during the process of building the clusters. Incremental hierarchical clustering allows data points to be added throughout the process. The paper "Incremental Clustering for Hierarchical Clustering" by Narita, Hochin, and Nomiya goes into greater detail.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
0

IncrementalDBSCAN (Ester et al. 1998) is the incremental version of the classic DBSCAN clustering algorithm which is designed to be updateable, i.e., one can add new points to a previously calculated clustering (or remove unnecessary points for that matter). It doesn't require the expected number of clusters as input.

There is an open source implementation of it that can be installed and used as a Python package. (For full disclosure, I implemented this package.)