4

I have an array of images. And, there is a function that calculates the distance between two images.

I wish to cluster the images based on this distance. So the clusters contain images that are all at short distance to each other.

So only the distance between two images can be used to form the clusters. An image has no usable properties on their own. As many common clustering algorithms expect objects to have usable properties, it seems I can use these algorithms.

To limit computational complexity, I've introduced a few thresholds to decide if two images may end up in the same cluster:

  • calculated distance (I want to optimize for shortest distance, but above a certain distance, images are not close enough to ever share a cluster)
  • input array distance (e.g. an image at index 23 may only cluster with images up to index 123)
  • creation data interval (images created more than N hours apart, may not share a cluster)

Some details:

  • There can be up to 100k images.
  • Roughly 25-50% of images will be 'unique' and do not end up in a cluster with others. Of the remainder, most will be expected to end up in clusters with a size below 10, with 5% outliers above that.
  • The maximum cluster size is something I want to set somewhere between 10 and 50.

Given the above and the nature of the image set, a graph (image: edge, calculated distance: vertex) would consist of disjoint parts consisting of up to a few hundred images. So the algorithm needs to be able to cluster sets of well below 1000 images.

The clusters must be optimised for distance, such that closer images end up together in a cluster. I'm aware that parameters like minimum and maximum cluster size would be needed; it's a play between creating a few large clusters with a higher maximum distance, or more smaller clusters with images that are closer together.

What kind of algorithm(s) could I use?

2 Answers2

1

You can use Cluster Edge Deletion, however, the problem is NP-hard.

Create a graph where the vertex set is the set of your objects, and you create an edge between two objects if the distance is at most the threshold.

Now, the Cluster Edge Deletion problem is to find the set with fewest edges to delete such that the remaining graph is a cluster graph (a graph whose connected components are cliques). In this solution, every object belongs to a cluster in which the distance to every other object is bounded by the threshold.

Alternatively, you can run $k$-means (with increasing $k$) until you're satisfied with a solution. Take note that the $k$-means algorithm is randomized and gives an approximate solution (as also this problem is NP-hard), so you might want to run the algorithm several times per $k$.

Ainsley H.
  • 17,823
  • 3
  • 43
  • 68
1

There are several standard clustering algorithms you could try, including k-means and hierarchical clustering. The additional restrictions you have should make both implementable in a reasonably efficient way. In particular, build a graph with one vertex per image, and an edge between each pair of images that meet all the thresholds. Note that this can be done using at most $100 \times 100,000$ distance computations: you only need to compare each image to the 100 that precede it.

D.W.
  • 167,959
  • 22
  • 232
  • 500