Data Set and guidance for Occupations/ Roles classification problem

Question

I am working on a project where I need to find similar roles -- for example, Software Engineer, Soft. Engineer , Software Eng ( all should be marked similar)

Currently, I have tried using the Standard Occupational Classification Dataset and tried using LSA, Leveinstein and unsupervised FastText with Word Movers Distances. The last option works but isn't great.

I am wondering if there are more comprehensive data sets or ways available to solve this problem?? Any lead would be helpful!

score 0 · Answer 1 · answered Aug 19 '21 at 13:53

You can calculate the text similarity using Transformers. With transformers, we can get better accuracies. Try the following code:

pip install sentence-transformers==1.2.1
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-uncased')
sen = [
"Software Engineer", 
"Soft. Engineer" , 
"Software Eng",
"Senior Software Engineer",
]
sen_embeddings = model.encode(sen)
from sklearn.metrics.pairwise import cosine_similarity
#let's calculate cosine similarity for sentence 0:
cosine_similarity(
    [sen_embeddings[0]],
    sen_embeddings[1:]
)

If the similarity score is greater than 0.6 ( or 0.7), you can assume the texts to be similar.

Data Set and guidance for Occupations/ Roles classification problem

1 Answers1