10

I am looking to design a system that given a paragraph of text will be able to categorize it and identify the context:

  1. Is trained with user generated text paragraphs (like comments/questions/answers)
  2. Each item in the training set will be tagged with . So for e.g. ("category 1", , "text paragraph")
  3. There will be hundreds of categories

What would be the best approach to build such a system? I have been looking at a few different options and the following is a list of possible solutions. Is Word2Vec/NN the best solution at the moment?

  1. Recursive Neural Tensor Network fed with averaged Word2Vec data
  2. RNTN and The Paragraph Vector (https://cs.stanford.edu/~quocle/paragraph_vector.pdf)?
  3. TF-IDF used in a Deep Belief Network
  4. TF-IDF and Logistic Regression
  5. Bag of words and Naive Bayes classification
Shankar
  • 101
  • 3

1 Answers1

5

1) Max-Entropy(Logistic Regression) on TFIDF vectors is a good starting point for many NLP classification task.

2) Word2vec is definitely something worth trying and comparing to model 1. I would suggest using the Doc2Vec flavor for looking at sentences/paragraphs.

Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf

Gensim(python) has a nice Doc2vec model.

rushimg
  • 51
  • 1