Why use cosine similarity instead of scaling the vectors when calculating the similarity of vectors?

Question

I'm watching a NLP video on Coursera. It's discussing how to calculate the similarity of two vectors. First it discusses calculating the Euclidean distance, then it discusses the cosine similarity. It says that cosine similarity makes more sense when the size of the corpora are different. That's effectively the same explanation as given here.

I don't see why we can't scale the vectors depending on the size of the corpora, however. For example in the example from the linked question:

User 1 bought 1x eggs, 1x flour and 1x sugar.

User 2 bought 100x eggs, 100x flour and 100x sugar

User 3 bought 1x eggs, 1x Vodka and 1x Red Bull

Vector 1 and 2 clearly have different norms. We could normalize both of them to have length 1. Then the two vectors turn out to be identical and the Euclidean distance becomes 0, achieving results just as good as cosine similarity.

Why is this not done?

Mateen Ulhaq · Accepted Answer · 2022-09-14T21:27:55.557

Let $u, v$ be vectors. The "cosine distance" between them is given by

$$d_{\cos}(u, v) = 1 - \frac{u}{\|u\|} \cdot \frac{v}{\|v\|} = 1 - \cos \theta_{u,v},$$

and the proposed "normalized Euclidean distance" is given by

$$d_{NE}(u, v) = \left\| \frac{u}{\|u\|} - \frac{v}{\|v\|} \right\| = d_E(\frac{u}{\|u\|}, \frac{v}{\|v\|}).$$

By various symmetries, both distance measures may be written as a univariate function of the angle $\theta_{u,v}$ between $u$ and $v$. ^[1] Let's then compare the distances as a function of radian angle deviation $\theta_{u,v}$.

Evidently, they both have the same fundamental properties that we desire -- strictly increasing monotonicity for $\theta_{u,v} \in [0, \pi]$ and appropriate symmetry and periodicity across $\theta_{u,v}$.

Their shapes are different, however. Euclidean distance disproportionately punishes small deviations in the angles larger than is arguably necessary. Why is this important? Consider that the training algorithm is attempting to reduce the total error across the dataset. With Euclidean distance, law-abiding vectors are unfairly punished ($\frac{1}{2} d_{NE}(\theta_{u,v} = \pi/12) = 0.125$), making it easier for the training algorithm to get away with much more serious crimes ($\frac{1}{2} d_{NE}(\theta_{u,v} = \pi) = 1.000$). That is, under Euclidean distance, 8 law-abiding vectors are just as bad as maximally opposite-facing vectors.

Under cosine distance, justice is meted out with more proportionate fairness so that society (the sum of error across the dataset) as a whole can get better.

^{^[1] In fact, $d_{\cos}(u, v) = \frac{1}{2} (d_{NE}(u, v))^2$.}

Eduard · Answer 2 · 2022-09-13T13:23:23.637

In your example, User 1 and User 2 bought the same ingredients, but User 2 bought 100x more ingredients than User 1. If you normalize and use Euclidean distance, then the distance is 0 (by the mathematical definition of such distance), but if you do not normalize then the two vectors will be "distant"; similarly, if you normalize (i.e., 100x eggs $\to$ 1x eggs, 100x flour $\to$ 1x flour, and 100x sugar $\to$ 1x sugar) or not, then the cosine similarity will be 1 in either case. This answers your question.

Moreover, observe that User 3 bought eggs but also other ingredients that the first two did not. If you try to compare, measure, etc. User 3 vector with User 1 or User 2 would be like comparing apples with pears.

score -2 · Answer 3 · answered Sep 14 '22 at 07:02

-2

User 3 bought eggs but also other ingredients that the first two did not. If you try to compare, measure, etc. User 3 vector with User 1 or User 2 would be like comparing apples with pears.

answered Sep 14 '22 at 07:02

floydpric

1

Why use cosine similarity instead of scaling the vectors when calculating the similarity of vectors?

3 Answers3