0

I am working on a project where I need to find similar roles -- for example, Software Engineer, Soft. Engineer , Software Eng ( all should be marked similar)

Currently, I have tried using the Standard Occupational Classification Dataset and tried using LSA, Leveinstein and unsupervised FastText with Word Movers Distances. The last option works but isn't great.

I am wondering if there are more comprehensive data sets or ways available to solve this problem?? Any lead would be helpful!

1 Answers1

0

You can calculate the text similarity using Transformers. With transformers, we can get better accuracies. Try the following code:

pip install sentence-transformers==1.2.1

from sentence_transformers import SentenceTransformer model = SentenceTransformer('distilbert-base-uncased')

sen = [ "Software Engineer", "Soft. Engineer" , "Software Eng", "Senior Software Engineer", ]

sen_embeddings = model.encode(sen)

from sklearn.metrics.pairwise import cosine_similarity #let's calculate cosine similarity for sentence 0: cosine_similarity( [sen_embeddings[0]], sen_embeddings[1:] )

If the similarity score is greater than 0.6 ( or 0.7), you can assume the texts to be similar.

Shrinidhi M
  • 411
  • 2
  • 5