6

I am having a look at this material and I have found the following statement:

For this class of models [Gradient Boosting Machine algorithms] [...] it is both safe and significantly more computationally efficient use an arbitrary integer encoding [also known as Numeric Encoding] for the categorical variable even if the ordering is arbitrary [instead of One-Hot encoding].

Do you know some references that support this statement? I get that Numeric Encoding is more computationally efficient than One-Hot Encoding, but I would like to know more about their supposed equivalence to encode unordered categorical variables in Gradient Boosting Methods.

Thanks!

carlo_sguera
  • 161
  • 3

1 Answers1

6

This is actually a feature of tree-based models in general, not just gradient boosting trees.

Not exactly a reference, but this Medium article explains why ordinal encoding is often more efficient.

On the topic of safety, I think the author should have said that the use of ordinal encoding is more safe compared to linear methods, but still not perfectly safe. It's possible for decision-tree methods to find spurious rules within ordinal encodings, but they don't have the strong assumptions about numeric semantics that linear methods do.

. . . I would like to know more about their supposed equivalence to encode unordered categorical variables . . .

Any rule derived with one-hot encoding can also be represented with ordinal encoding, it just might take more splits.

To illustrate, suppose you have a categorical variable foo with possible values spam, ham, eggs. A one-hot encoding would create 3 dummy variables, is_spam, is_ham, is_eggs. Let's say an arbitrary ordinal encoding assigns spam = 1, ham = 2, and eggs = 3.

Suppose the OHE decision tree splits on is_eggs = 1. This can be represented in the ordinal decision tree by the split foo > 2. Suppose the OHE tree splits on is_ham = 1. The ordinal tree will require two splits: foo > 1 then foo < 3

zachdj
  • 2,812
  • 10
  • 15