3

I have a dataset with 4 categorical features (Cholesterol, Systolic Blood pressure, diastolic blood pressure, and smoking rate). I use a decision tree classifier to find the probability of stroke. I am trying to verify my understanding of the splitting procedure done by Python Sklearn. Since it is a binary tree, there are three possible ways to split the first feature which is either to group categories {0 and 1 to a leaf, 2 to another leaf} or {0 and 2, 1}, or {0, 1 and 2}. What I know (please correct me here) is that the chosen split is the one with the highest information gain.

I have calculated the information gain for each of the three grouping scenarios:

{0 + 1 , 2} --> 0.17

{0 + 2 , 1} --> 0.18

{1 + 2 , 0} --> 0.004

However, sklearn's decision tree chose the first scenario instead of the third (please check the picture).

Can anyone please help clarify the reason for selecting the first scenario? is there a priority for splits that results in pure nodes. thus selecting such a scenario although it has less information gain?

I have added the frequency of each class/feature/category so it would be easy to calculate the Gini indexenter image description here

GYSHIDO
  • 133
  • 9

2 Answers2

2

sklearn doesn't know that your feature is categorical; it's treating it as continuous, for which only splits of the form $x \leq \alpha$ are checked, so your second listed split candidate isn't actually a candidate.

In general, sklearn doesn't support categorical variables (yet?), and you'll need to encode it differently (one-hot?) if you want different behavior.

See also
https://datascience.stackexchange.com/a/52103/55122
how to make a decision tree when i have both continous and categorical variables in my dataset?

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
1

You mean first to second, not to third?

In any case possible explanation:

What are your parameters in decision tree. For example for different min_samples_split you can expect different GINI values. You got information gain values (very likely) calculated for all of the samples (rows) of your dataset, but thats not how decision tree calculates it (especially when you set this param) (GINI or inf.gain)

Noah Weber
  • 5,829
  • 1
  • 13
  • 26