0

I've coded a small clustering algorithm for time signals using kmeans, which works ok (gives acceptable results).

However, kmeans uses the sum of squared differences. I would like to be able to input instead my own measure of difference, but there doesn't seem to be a way provided by the library to do that.

What would be the easiest way to achieve this? Any other python library which may provide me some way to input instead my own function to define the distance? Or I guess I could instead re-implement the algorithm myself, but I'd rather keep the sci-kit one (since they provide functionalities I want to use such as parallel processing).

1 Answers1

2

K-means cannot optimize arbitrary measures.

The mean optimizes squared errors. It does not optimize, e.g., Euclidean distances, Manhattan distances, etc. It won't crash, but the solution will not be optimal (not even locally) because the centers are not well placed.

So it makes little sense to add support for another "inertia" inside k-means (nor other distances), as it can't optimize that.

If you want to optimize other distances, there is for example the PAM algorithm, and k-medians (for Manhattan).

There is nothing wrong with computing such a quality afterwards though, but then it belongs into the evaluation package, not in the KMeans class.

Has QUIT--Anony-Mousse
  • 8,134
  • 1
  • 16
  • 31