Measuring similarity from massive embedded vectors

Question

I am given a set of 10,000 journal articles, with their corresponding 100th-dimension embedded vectors. (The way they are embedded is unknown, but I'm guessing it is word2vec. The vector values vary from -0.5 to 0.5.) Among the 10,000 article name data, 10 of them are my target articles. My objective is to find several articles that are 'similar' to my target articles.

After reading this post, it seems like word similarity measurements such as tf-idf are unnecessary for my task, for I already have an access to embedded vectors. If so, how would I calculate similarity between my target articles and the rest optimally, given that my dataset is already massive (10,000 * 100)?

score 3 · Accepted Answer · answered Mar 24 '22 at 08:58

There are many excellent answers on the differences between cosine distance (1-cosine similarity) and euclidean distance - some are linked below.

I think it's useful to first think when they are similar. They are in fact clearly related when you work with unit-norm vectors $a,b$: $||a||_2 = ||b||_2 = 1$.

In this particular case: $$||a-b||^2 = (a-b)^T\cdot(a-b)=a^Ta-2a^Tb+b^Tb=2-2a^Tb=1-c_s(a,b)$$

Where $c_s$ is the cosine similarity of the two normalized vectors. Hence, the squared euclidean distance is the cosine distance for unit-norm vectors.

What happens when vectors are not normalized? You may have good reasons to use Euclidean distance if that means something (e.g. in physical space).

In NLP, it is hard to interpret the euclidean distance. Look at your word embeddings (vectors): is there a good reason to think that the absolute magnitude of the components should be comparable across embeddings?

Cosine distance measures instead the (normalized) projection of one vector (embeddings, in your case) onto the other, i.e. tells if the direction of the two is the same.

To better understand this, it may be useful to try working out Euclidean and cosine distances across the following vectors:

$v_1=(1,1,0,0), v_2=(5,5,0,0), v_3=(1,1,0,1)$

Which might well be the outputs of a CountVectorizer for 3 words with a dictionary of size 4.

Some good references:

an intuitive explanation and a detailed comparison with code
both measures suffers the curse of dimensionality, discussion on statsexchange
again, on the equivalence of euclidean and cosine similarity

Measuring similarity from massive embedded vectors

1 Answers1