40

Jaccard similarity and cosine similarity are two very common measurements while comparing item similarities. However, I am not very clear in what situation which one should be preferable than another.

Can somebody help clarify the differences of these two measurements (the difference in concept or principle, not the definition or computation) and their preferable applications?

shihpeng
  • 563
  • 1
  • 4
  • 8

4 Answers4

33

The answer from saq7 is wrong, as well as not answering the question.

∥A∥ means the $L2$ norm of $A$, i.e. the length of the vector in Euclidean space, not the dimensionality of the vector $A$. In other words, you don't count the 0 bits, you add up only the 1 bits and take the square root.

Sorry I don't have a real answer as to when you should use which metric, but I can't let the incorrect answer go unchallenged.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
user18596
  • 431
  • 4
  • 3
21

Jaccard Similarity is given by $s_{ij} = \frac{p}{p+q+r}$

where,

p = # of attributes positive for both objects
q = # of attributes 1 for i and 0 for j
r = # of attributes 0 for i and 1 for j

Whereas, cosine similarity = $\frac{A \cdot B}{\|A\|\|B\|}$ where A and B are object vectors.

Simply put, in cases where the vectors A and B are comprised 0s and 1s only, cosine similarity divides the number of common attributes by the product of A and B's distance from zero. Whereas in Jaccard Similarity, the number of common attributes is divided by the number of attributes that exists in at least one of the two objects.

And there are many other measures of similarity, each with its own eccentricities. When deciding which one to use, try to think of a few representative cases and work out which index would give the most usable results to achieve your objective.

The Cosine index could be used to identify plagiarism, but will not be a good index to identify mirror sites on the internet. Whereas the Jaccard index, will be a good index to identify mirror sites, but not so great at catching copy pasta plagiarism (within a larger document).

When applying these indices, you must think about your problem thoroughly and figure out how to define similarity. Once you have a definition in mind, you can go about shopping for an index.

Edit: Earlier, I had an example included in this answer, which was ultimately incorrect. Thanks to the several users who have pointed that out, I have removed the erroneous example.

saq7
  • 410
  • 3
  • 5
12

Jaccard similarity is used for two types of binary cases:

  1. Symmetric, where 1 and 0 has equal importance (gender, marital status,etc)
  2. Asymmetric, where 1 and 0 have different levels of importance (testing positive for a disease)

Cosine similarity is usually used in the context of text mining for comparing documents or emails. If the cosine similarity between two document term vectors is higher, then both the documents have more number of words in common

Another difference is 1 - Jaccard Coefficient can be used as a dissimilarity or distance measure, whereas the cosine similarity has no such constructs. A similar thing is the Tonimoto distance, which is used in taxonomy.

Vikram Venkat
  • 221
  • 2
  • 4
8

The answer by saq7 is wrong.

Where $\mathbf{a}$ and $\mathbf{b}$ are binary vectors, they can be interpreted as sets of indices with value 1. Let's therefore consider sets $A$ and $B$.

Jaccard similarity is then given by $$J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A \cap B| + |A - B| + |B - A|}$$

Cosine similarity is then given by $$C(A, B) = \frac{|A \cap B|}{\sqrt{\left|A\right|\left|B\right|}} = \frac{|A \cap B|}{\sqrt{(\left|A\cap B\right| + |A - B|)(\left|A\cap B\right| + |B - A|)}}$$

Some comparisons:

  • The numerators here are the same.
  • The denominator grows arithmetically with the size of $|A|$ and $|B|$ in jaccard, but geometrically in cosine.
  • The denominator of cosine depends only on the number of items in $|A|$ and the number of items in $|B|$. It does not depend on their intersection.

I do not yet have a clear intuition on where one should be preferred over the other, except that, as Vikram Venkat noted, 1 - Jaccard corresponds to a true metric, unlike cosine; and cosine naturally extends to real-valued vectors.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
joeln
  • 101
  • 1
  • 3