8

I have 40000 rows of text data of health care domain. Data has one column for text (2-5 sentences) and one column for its category. I want to classify that into 300 categories. Some categories are independent while some are somewhat related. Distribution of data among categories is not uniform either i.e some of the categories(around 40 of them) have less data about 2-3 rows.

I am attaching log probablity of each class/categories. (OR distribution of classes) here. Class prior logarithm of probabilities (log class distribution of data)

Alok Nayak
  • 191
  • 1
  • 5

1 Answers1

8

In general, a decent starting point for problems like these is Naive Bayes (NB) classification using a simple bag of words model. Here are some slides describing NB as applied to natural language processing. There's nothing especially fancy about this approach, but it's pretty easy to implement and will give you a starting point to expand from.

Once you've found some initial results assuming independence among your features and your output labels, you'll probably have a better sense of where the model is weak. From that point forward you can apply some feature engineering (maybe TF-IDF) as well as some post processing to deal with samples that get assigned to related categories.

Ryan J. Smith
  • 681
  • 3
  • 15