Text-Classification-Problem: Is Word2Vec/NN the best approach?

Question

I am looking to design a system that given a paragraph of text will be able to categorize it and identify the context:

Is trained with user generated text paragraphs (like comments/questions/answers)
Each item in the training set will be tagged with . So for e.g. ("category 1", , "text paragraph")
There will be hundreds of categories

What would be the best approach to build such a system? I have been looking at a few different options and the following is a list of possible solutions. Is Word2Vec/NN the best solution at the moment?

Recursive Neural Tensor Network fed with averaged Word2Vec data
RNTN and The Paragraph Vector (https://cs.stanford.edu/~quocle/paragraph_vector.pdf)?
TF-IDF used in a Deep Belief Network
TF-IDF and Logistic Regression
Bag of words and Naive Bayes classification

score 5 · Answer 1 · answered Nov 04 '15 at 16:45

1) Max-Entropy(Logistic Regression) on TFIDF vectors is a good starting point for many NLP classification task.

2) Word2vec is definitely something worth trying and comparing to model 1. I would suggest using the Doc2Vec flavor for looking at sentences/paragraphs.

Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf

Gensim(python) has a nice Doc2vec model.

Text-Classification-Problem: Is Word2Vec/NN the best approach?

1 Answers1