Text similarity with sentence embeddings

Question

I'm trying to calculate similarity between texts with various lengths. My current approach is following:

Using Universal Sentence Encoder, I convert text to a set of vectors.
I average these vectors to create the final feature vector.
I compare feature vectors using cosine similarity.

This gives me pretty good results for texts with roughly same sizes, but I was wondering if there is a better approach for the step #2 if texts have different lengths.

Brian Spiering · Accepted Answer · 2019-09-27T16:39:46.490

One approach is using Word Mover’s Distance (WMD). WMD is an algorithm for finding the distance between texts of different lengths, where each word is represented as a word embedding vector.

The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document.

For example:

Source: "From Word Embeddings To Document Distances" Paper

WMD can be modified to Sentence Mover’s Distance, comparing how far apart different sentence embeddings are to each other.

score 2 · Answer 2 · answered Sep 23 '19 at 16:57

With the advancement in language models, representation of sentences into vectors has been getting better lately. That might give some good result in your case. For example, BERT can be used to get the sentence embedding. Look at the following usage of BERT for sentence similarity :

You can use the pre-trained BERT model and you can pass two sentences and you can let the vector obtained at C pass through a feed forward neural network to decide whether the sentences are similar. This approach can work if you have labelled set of data. If you don't have, consider the following :

You pass the variable length sentences to the BERT network and the vector obtained at the token C becomes the vector for the sentence. You can then use cosine similarity the way you have been using.

Text similarity with sentence embeddings

2 Answers2