Semantic similarity on a large dataset

Question

I'm going through this guide on semantic similarity and use the code there as is.

I'm applying it to a dataset where each row is typically a paragraph (3-4 sentences, over 100 words). Currently, I have over 100k observations, but this number is likely to grow to 500k.

I want to measure semantic similarity between all rows.

When I test BoW and TFIDF on around 20-30k sample, I don't get any performance issues (even without cleaning, stopwords, etc.).

When I try Word2Vec/Universal Sentence Encoder, however, it takes couple of hours to finish even on 3-4k rows sample .

I also get completely different results, but that's beyond the point.

Is there a way to improve the performance for Word2Vec/Universal Sentence Encoder, especially the latter. (As far as I understand, in Word2Vec, words "good" and "bad" may cancel each other out, which is not good for my speach-like data.)

Brian Spiering · Answer 1 · 2022-10-11T18:17:50.303

One approach would be to profile the code to empirically find the slowest parts. A quick visual scan of the code you referenced relieved inefficiencies.

For example, there are several list comprehensions:

labels = [headline[:20] for headline in headlines]
docs = [nlp(headline) for headline in headlines]

One straightforward way to speed up the code is converting those into generator expressions.

Additionally, there are nested for-loops:

similarity = []
for i in range(len(docs)):
    row = []
    for j in range(len(docs)):
        row.append(docs[i].similarity(docs[j]))
similarity.append(row)

You may not need to do a doc-by-doc comparison.

score 0 · Answer 2 · answered Oct 11 '22 at 16:23

Two ideas:

I understand that you calculate similarity between every pair of rows/documents, right? If so, the bottleneck is due to the quadratic processing of all the pairs First, you should compare only $(d_1,d_2)$ and not $(d_2,d_1)$ (using indexes: if $i<j$), this saves 50% time. I also assume that the goal is to capture pairs/groups of strongly similar documents. If yes, a method would be to first apply the BoW/TFIDF method (simpler and faster), then to apply the embeddings method only to the pairs which obtain at least some similarity threshold with the first method.
A completely different approach: apply topic modelling (LDA, or HDP, or other recent method) on the set of document. This would be likely faster. It might also reveal a different kind of semantic similarity betweem documents.

Semantic similarity on a large dataset

2 Answers2