4

Is there a relatively simple way of telling if two pieces of text are semantically similar?

Some assumptions that are valid:

  • It is all english
  • I have a list of all the important nouns

Are there any strategies that I should pursue? Looking for something that is relatively computationally cheap, though something that could be scaled to improve accuracy at the expense of computational power would be a bonus.

Note:

Assume that there are not enough posts for some type of probabilistic analysis, but some type of NN might be feasible (I think, just don't know enough about it).

Juho
  • 22,905
  • 7
  • 63
  • 117
soandos
  • 1,143
  • 2
  • 10
  • 23

1 Answers1

4

Here's a simple technique.

Train an LDA using something like MALLET over your collection of texts. For each pair of documents you want to compare, obtain the topic distributions and compute the Hellinger distance between them.

Things you can tweak include term weighting, the LDA hyperparameters, and the metric for comparing distributions. Term weighting would obviate both the need for a list of important words, and the restriction to only English.

jogloran
  • 161
  • 6