4

We want to use cosine similarity with hierarchical clustering and we have cosine similarities already calculated. In the sklearn.cluster.AgglomerativeClustering documentation it says:

A distance matrix (instead of a similarity matrix) is needed as input for the fit method.

So, we converted cosine similarities to distances as

distance = 1 - similarity

Our python code produces error at the fit() method at the end. (I am not writing the real value of X in the code, since it is very big.) X is just a cosine similarity matrix with values converted to distance as written above. Notice the diagonal, it is all 0.) Here is the code:

import pandas as pd
import numpy as np 
from sklearn.cluster import AgglomerativeClustering

X = np.array([[0,0.3,0.4],[0.3,0,0.7],[0.4,0.7,0]])

cluster = AgglomerativeClustering(affinity='precomputed')  
cluster.fit(X)

The error is:

runfile('/Users/stackoverflowuser/Desktop/4.2/Pr/untitled0.py', wdir='/Users/stackoverflowuser/Desktop/4.2/Pr')
Traceback (most recent call last):

  File "<ipython-input-1-b8b98765b168>", line 1, in <module>
    runfile('/Users/stackoverflowuser/Desktop/4.2/Pr/untitled0.py', wdir='/Users/stackoverflowuser/Desktop/4.2/Pr')

  File "/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 704, in runfile
    execfile(filename, namespace)

  File "/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
    builtins.execfile(filename, *where)

  File "/Users/stackoverflowuser/Desktop/4.2/Pr/untitled0.py", line 84, in <module>
    cluster.fit(X)

  File "/anaconda2/lib/python2.7/site-packages/sklearn/cluster/hierarchical.py", line 795, in fit
    (self.affinity, ))

ValueError: precomputed was provided as affinity. Ward can only work with euclidean distances.

Is there anything that I can provide? Thanks already.

Simon Larsson
  • 4,313
  • 1
  • 16
  • 30
M. Kaan
  • 43
  • 1
  • 1
  • 3

1 Answers1

3

According to sklearn's documentation:

If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.

So you need to change the linkage to one of complete, average or single. If you try this it works:

import numpy as np 
from sklearn.cluster import AgglomerativeClustering

X = np.array([[0,0.3,0.4],[0.3,0,0.7],[0.4,0.7,0]])

#cluster = AgglomerativeClustering(affinity='precomputed', linkage='complete') 
#cluster = AgglomerativeClustering(affinity='precomputed', linkage='average')
cluster = AgglomerativeClustering(affinity='precomputed', linkage='single')
cluster.fit(X)
Simon Larsson
  • 4,313
  • 1
  • 16
  • 30