9

I'm trying to calculate similarity between texts with various lengths. My current approach is following:

  1. Using Universal Sentence Encoder, I convert text to a set of vectors.
  2. I average these vectors to create the final feature vector.
  3. I compare feature vectors using cosine similarity.

This gives me pretty good results for texts with roughly same sizes, but I was wondering if there is a better approach for the step #2 if texts have different lengths.

Gyan Ranjan
  • 851
  • 7
  • 13

2 Answers2

9

One approach is using Word Mover’s Distance (WMD). WMD is an algorithm for finding the distance between texts of different lengths, where each word is represented as a word embedding vector.

The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to "travel" to reach the embedded words of another document.

For example:

enter image description here Source: "From Word Embeddings To Document Distances" Paper

WMD can be modified to Sentence Mover’s Distance, comparing how far apart different sentence embeddings are to each other.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
2

With the advancement in language models, representation of sentences into vectors has been getting better lately. That might give some good result in your case. For example, BERT can be used to get the sentence embedding. Look at the following usage of BERT for sentence similarity : BERT for sentence similarity

You can use the pre-trained BERT model and you can pass two sentences and you can let the vector obtained at C pass through a feed forward neural network to decide whether the sentences are similar. This approach can work if you have labelled set of data. If you don't have, consider the following :

BERT for single sentence

You pass the variable length sentences to the BERT network and the vector obtained at the token C becomes the vector for the sentence. You can then use cosine similarity the way you have been using.

Gyan Ranjan
  • 851
  • 7
  • 13