2

I would like to understand all the methods available in Catboost for encoding categorical features.

Unfortunately, the published articles by Yandex ("CatBoost: gradient boosting with categorical features support" and "CatBoost: unbiased boosting with categorical features") do not go into much detail on this regard.

The official documentation and a tutorial on their Github provide some clues, but I still have some questions:

  1. Do all methods require the conversion of labels to integers (stage 2)?
    • It is not clear to me whether Float Target Mean Value actually uses these discretised labels.
  2. For the methods Buckets and Borders, does one apply another quantisation on top of stage 2?
    • It seems to me that they could be directly applied to targets in a regression task – without the need of stage 2.
  3. What is the difference between CtrBorderType and TargetBorderType? The documentation says the former is the "quantization type of the label value" while the latter is for categorical features. When do you need to apply quantization on categorical features?
    • EDIT: I seem to have found an answer to this. According to their Github (see section "Default value of simple_ctr and combinations_ctr"), CtrBorderType can be used to discretize encoded categorical values.
  4. How are categorical features encoded during inference?
    • I suspect the latest encoded occurrence of each category (in the shuffled dataset) is used in each tree. However, I can't find anything to substantiate this hypothesis.
calpyte
  • 121
  • 2

0 Answers0