I'm trying to build a regression model, in which one of the features contains text data. I was thinking in using scikit-learn's sklearn.feature_extraction.text.TfidfVectorizer. The issue however, is that the actual strings contain very few words. Exactly 1.8 in average. Here's a sample:
print(df.keyword)
0 fre lifeproof
1 car stereo
2 analog clock
3 refrigerator
So my question is,
- is the
TfidfVectorizeralso suited for such case? Or will such sparse resulting matrix not be benefiting the model? - Is there a more suited approach for such case?