15

I have a large data set and a cosine similarity between them. I would like to cluster them using cosine similarity that puts similar objects together without needing to specify beforehand the number of clusters I expect.

I read the sklearn documentation of DBSCAN and Affinity Propagation, where both of them requires a distance matrix (not cosine similarity matrix).

Really, I'm just looking for any algorithm that doesn't require a) a distance metric and b) a pre-specified number of clusters.

Does anyone know of an algorithm that would do that?

Smith Volka
  • 685
  • 2
  • 6
  • 13

5 Answers5

13

First, every clustering algorithm is using some sort of distance metric. Which is actually important, because every metric has its own properties and is suitable for different kind of problems.

You said you have cosine similarity between your records, so this is actually a distance matrix. You can use this matrix as an input into some clustering algorithm.

Now, I'd suggest to start with hierarchical clustering - it does not require defined number of clusters and you can either input data and select a distance, or input a distance matrix (where you calculated the distance in some way).

Note that the hierarchical clustering is expensive to calculate, so if you have a lot of data, you can start with just sample.

HonzaB
  • 1,699
  • 1
  • 14
  • 20
6

I'd use sklearn's Hierarchical clustering

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from scipy.cluster import  hierarchy

#Vectorizing
X = CountVectorizer().fit_transform(docs)
X = TfidfTransformer().fit_transform(X)
#Clustering
X = X.todense()
threshold = 0.1
Z = hierarchy.linkage(X,"average", metric="cosine")
C = hierarchy.fcluster(Z, threshold, criterion="distance")

C is your clustering of the documents docs.

You can use other metrics instead of cosine, and use a different threshold than 0.1

Uri Goren
  • 448
  • 2
  • 7
4

DBSCAN can trivially be implemented with a similarity measure instead of a distance. You just need to change the <= epsilon into a >= epsilon.

HAC also works just fine with similarities (at least single-link, complete-link, UPGMA, WPGMA - don't use Ward), if you swap "min" and "max" (you want to merge with maximum similarity rather than minimum distance).

If you are lazy, you can also just transform your similarity into a distance. If you have a fixed maximum, dist=max-sim will often do.

Has QUIT--Anony-Mousse
  • 8,134
  • 1
  • 16
  • 31
3

I think the clustMixType package might give you better results/insights.

By using this package you can use combination of Categorical and Numeric Data directly, it doesn’t need any kind of hot encoding.

You just need to feed in the data and it automatically segregates into Categorical and Numeric Data, if you find any issues at the time of segregation you can use functions like as.factor(to convert to a categorical) and as.numeric(to convert to a Numeric field).

You can calculate Lambda(mean Distance value) before hand and fed in as an input to the algorithm.

If you don’t know the optimal number of clusters, you can use WSS(within Sum of Squares), plot(elbow chart) to decide the optimal number of clusters.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Toros91
  • 2,392
  • 3
  • 16
  • 32
2

All clustering methods use a distance metric of some sort. And remember that distance is essentially a dissimilarity measure. So if you normalize your similarity betwen 0 and 1, your distance is simply 1-similarity

As for algorithms that do not require a number of clusters to be specified, there are of course hierarchical clustering techniques, which essentially build a tree like structure that you can "cut" wherever you please (you can use some perfomance metrics to do that automatically)

X-means is a version of K-means which tries a certain number of K and picks the one that maximizes some evaluation function.

Mean shift also "finds" a natural number of clusters but is sensible to other parameters such as the bandwith for instance.

Valentin Calomme
  • 6,256
  • 3
  • 23
  • 54