10

I want to use Latent Dirichlet Allocation for a project and I am using Python with the gensim library. After finding the topics I would like to cluster the documents using an algorithm such as k-means(Ideally I would like to use a good one for overlapping clusters so any recommendation is welcomed). I managed to get the topics but they are in the form of:

0.041*Minister + 0.041*Key + 0.041*moments + 0.041*controversial + 0.041*Prime

In order to apply a clustering algorithm, and correct me if I'm wrong, I believe I should find a way to represent each word as a number using either tfidf or word2vec.

Do you have any ideas of how I could "strip" the textual information from e.g. a list, in order to do so and then place them back in order to make the appropriate multiplication?

For instance the way I see it if the word Minister has a tfidf weight of 0.042 and so on for any other word within the same topic I should be to compute something like:

0.041*0.42 + ... + 0.041*tfidf(Prime) and get a result that will be later on used in order to cluster the results.

Thank you for your time.

Swan87
  • 221
  • 1
  • 2
  • 4

3 Answers3

5

Assuming that LDA produced a list of topics and put a score against each topic for each document, you could represent the document and it's scores as a vector:

Document | Prime | Minister | Controversial | TopicN | ...
   1       0.041    0.042      0.041          ...
   2       0.052    0.011      0.042          ...

To get the scores for each document, you can run the document. as a bag of words, through a trained LDA model. From the gensim documentation:

>>> lda = LdaModel(corpus, num_topics=100)  # train model
>>> print(lda[doc_bow]) # get topic probability distribution for a document

Then, you could run the k-means on this matrix and it should group documents that are similar together. K-means by default is a hard clustering algorithm implying that it classifies each document into one cluster. You could use soft clustering mechanisms that will give you a probability score that a document fits within a cluster - this is called fuzzy k-means. https://gist.github.com/mblondel/1451300 is a Python gist showing how you can do it with scikit learn.

ps: I cant post more than 2 links

Ash
  • 181
  • 1
  • 5
1

Complementary to the previous answer you should better not just run kmeans directly on the compositional data derived from the lda topic-doc distribution, instead use some compositional data transformation to project them to the euclidean space like ilr or clr.

(Example)

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
0

Another approach would be to use the document-topic matrix that you obtained by training the LDA model in order to extract the topic with the maximum probability and let that topic be your label.

This will give a result that is somewhat interpretable to the degree your topics are.