Looking for a way to rank the tens and hundreds of named entities present in any document in order of their importance/relevance in the context.
Any thoughts ?
Thanks in advance!
Looking for a way to rank the tens and hundreds of named entities present in any document in order of their importance/relevance in the context.
Any thoughts ?
Thanks in advance!
Just ranking on occurence is easy. You can just simply count the entities in the entire document. As for ranking on importance, the importance metric will need to be evaluated based on the task you are performing. Which brings me to the main question, what do you want to do with the ranking order? Need to know that to help further.
An easy way would be to use TF-IDF (term frequency–inverse document frequency). It can help you find how much terms stand out in a document (by comparing with your entire corpus) and use it to rank your entities.
TfidfVectorizer from scikit-learn
Just note that the TfidfVectorizer is on a word level. So some processing will be needed if your entities can consist of more than one word.
Alternatively you could use a model that allows you to produce a heatmap of the words. Then you can use that heatmap to look up your NEs in that heatmap. This paper, A Structured Self-Attentive Sentence Embedding, could give you some ideas.