TFIDF for very short sentences

Question

I'm trying to build a regression model, in which one of the features contains text data. I was thinking in using scikit-learn's sklearn.feature_extraction.text.TfidfVectorizer. The issue however, is that the actual strings contain very few words. Exactly 1.8 in average. Here's a sample:

 print(df.keyword)

0 fre lifeproof
1 car stereo
2 analog clock
3 refrigerator

So my question is,

is the TfidfVectorizer also suited for such case? Or will such sparse resulting matrix not be benefiting the model?
Is there a more suited approach for such case?

score 1 · Answer 1 · answered Sep 07 '19 at 00:00

It depends what it's going to be used for, but in general it can make sense to use TF-IDF with short sentence. The main difference with the more standard case of long sentences is that TF (Term Frequency) won't play any role since the frequency will almost always be 1. IDF can still be useful though, assuming it's relevant to assign more weight to rare words than to frequent words.

However the problem is that comparisons such as cosine are often going to be zero, since there will be little chance of words in common between two sentences.

score 1 · Answer 2 · edited Jun 07 '20 at 07:15

As I see it, TF-IDF is very poor (more or less like a category feature) and does not take the words' precedence (approximately 80% of cases have 2 or more words).

One possible approach is to embed Word2Vect => Doc2Vect for every sentence. A good implementation is the gensim library: https://radimrehurek.com/gensim/models/doc2vec.html

Doc2Vect will provide a n-feature vector that can be used in your regression model (the dimension of embedded result).

Another more bizarre approach is to see what happens if you feed every sentence as a string of characters, and enter the sequence into an LSTM layer that will deliver you also a n-feature result (the dimension of LSTM layer).

score 0 · Answer 3 · answered Sep 06 '19 at 09:48

Are the strings representing disjunct categories? If yes, you may want so use it as categorical feature for your model. If there are too much seperate categories you may want to group them into higher level categories such as refrigerator maps to kitchen products.

TFIDF for very short sentences

3 Answers3