10

A decision tree, while performing recursive binary splitting, selects an independent variable (say $X_j$) and a threshold (say $t$) such that the predictor space is split into regions {$X|X_j < t$} and {$X|X_j >= t$}, and which leads to greatest reduction in cost function.

Now let us suppose that we have a variable with categorical values in {$X$}. Suppose we have label-encoded it and its values are in the range 0 to 9 (10 categories).

  1. If DT splits a node with the above algorithm and treat those 10 values are true numeric values, will it not lead to wrong/misinterpreted splits?
  2. Should it rather perform the split based on == and != for this variable? But then, how will the algorithm know that it is a categorical feature?
  3. Also, will one-hot encoded values make more sense in this case?
fractalnature
  • 825
  • 6
  • 19
Supratim Haldar
  • 309
  • 1
  • 3
  • 10

3 Answers3

9

You are right on all counts:

  1. If DT splits a node with the above algorithm and treat those 10 values are true numeric values, will it not lead to wrong/misinterpreted splits?

Yes absolutely, for exactly the reason you mention below:

  1. Should it rather perform the split based on == and != for this variable? But then, how will the algorithm know that it is a categorical feature?

Yes, as you correctly assume a (true) categorical variable should be compared only for equality, not order.

In general the algorithm cannot guess the nature of the feature, there has to be some parameters in the implementation which provide it with this information. Some implementations allow this, for example with Weka the features are typed with either a "numeric" or "nominal" (categorical) type.

  1. Also, will one-hot encoded values make more sense in this case?

Correct again, that's what should be done for a categorical feature in case the implementation treats all the features as numeric values.

Erwan
  • 26,519
  • 3
  • 16
  • 39
6
  1. Yes, it will add a certain bias given by the fact that we are inserting an ordering that is not intrinsic to the categories

  2. Not really. The natural way to deal with a categorical feature that has L classes would be to explore ALL possible partitions! That means $2^L-1$ partitions!

  3. Only partially. OHE makes theoretically sense but does not work well for high cardinality features. In general, for Regression and Binary Classfication problems, the optimal solution is target encoding, as expressed by Breiman in his original paper on Classification and Regression Trees (1984). Indeed, he proves that by ordering the categories by mean response value (or probability), one can find the optimal split among the $2^L-1$ possible ones by only evaluating the L-1 splits of the so ordered categories.

One hot encoding is not that good for trees as it forces them to make many, sparse splits that can only separate few features, and it is particularly detrimental in case of high cardinalities. In that sense, Binary Encoding or even a Numerical Encoding might help achieve better separations with a less depth, even though they do insert a bias towards certain types of splits.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Davide ND
  • 191
  • 1
  • 5
1

A decision tree has to convert continuous variables to have categories anyway. There are different ways to find best splits for numeric variables. In a 0:9 range, the values still have meaning and will need to be split anyway just like a regular continuous variable. If you considered each value as separate categories, you are basically just splitting at every possible point.

fractalnature
  • 825
  • 6
  • 19