2

I am involved in twitter analysis data. I want to find trending topics in tweets with some hashtags, like #finance or #technology. I have a hugh data set of tweets and now I need to analyze them.

I need to recognize topics, if there are. They way I'm approaching this is, first, performing a vector representation of each tweet, with a tfidf technique, and then, build groups of them based on their cosine similarity.

Are there common techniques in tweets analysis?

Federico Caccia
  • 760
  • 1
  • 6
  • 18

2 Answers2

3

I believe that the algorithm that you want to use is something called a latent dirichlet allocation (LDA) model. This model is designed to uncover the topics in a corpus of documents.

Scikit learn has an implementation.

They even have a tutorial which teaches you how to extract topics. The tutorial also describes Non-negative Matrix Factorization (NNMF) as a method to extract the topics. I can't vouch for this algorithm, because I haven't used it personally (as opposed to LDA which I have used before), but from their tutorial NNMF does seem to give reasonable results.

Using cosine similarity will help you to group tweets that are most similar, but it wouldn't give you their topics. Which may be what you want? It really is hard to say, because only you would know how you should have the system behave. Unfortunately, that doesn't help you figure out what is trending, and you will need to do some heavy post-processing to make whatever algorithm you use spit out something that is useful to you.

Good luck!

Ryan
  • 716
  • 3
  • 11
1

As mentioned by @Ryan the LDA is a way to go but I am not sure it will provide robust results on documents that are fundamentally limited to 140 characters in length. I tried it in the past on summaries of news articles and got mixed results. One alternative idea might be to test the performance of a supervised model like SVM or KNN when hash-tags are used as the classes?

As an aside if you are committed to the LDA check out the gensim and LDAviz packages in python.

Dan Temkin
  • 181
  • 1
  • 7