Algorithm to find the probability of a given text to be about a large topic

Question

I want the conditional probability for each topic (being the word that we give as input). For example, the text being

have seen and reviewed your requirements you posted here. If you can give me the fix criteria/category of your data mining then I can do this job. If you want me to define and allot criteria and categorize it in then charges will be extra for per categorization included.

I have seen and reviewed your requirements you posted here. If you can give me the fix criteria/category of your data mining then I can do this job. If you want me to define and allot criteria and categorize it in then charges will be extra for per categorization included.

Assume that I give a word called research as an input, I want to know

What is the likelihood/probability that the text relates to research?

What algorithms we should create to get the above?

Bitwise · Answer 1 · 2013-03-10T19:38:38.950

You can try simple probabilistic graphical models, the simplest one being Naive Bayes.

One way to do this would be to represent a portion of text as a word frequency vector, that will be associated with a topic (the "class variable"). Then you use many such texts that are associated with topics to train your model (i.e. you model the probability of a frequency vector given a certain topic). Finally, given a new text you can ask what is the most likely topic assignment.

Naive Bayes, the simplest graphical model, would miss dependencies between the frequencies of the various words, but it is worth a shot as it is easy to implement. More complicated models could be used to capture these dependencies.

score 1 · Answer 2 · edited Mar 17 '13 at 19:56

Try Latent Semantic Analysis or the similar Latent Semantic Indexing which reduces documents to vectors, and typically finds principal components via singular value decomposition, SVD that roughly relate or correspond to intrinsic or latent "subjects". It had major or key success in the Netflix datamining contest. Similarity is computed using vector algebra e.g. the cosine similarity or similar metrics.

Algorithm to find the probability of a given text to be about a large topic

2 Answers2