3

I have a dataset with many ordered features, most of which have 3 levels (e.g., 0, 1, 2), and my outcome variable is censored. I’m debating whether to treat these ordinal features as numeric or categorical.

If I treat them as categorical, I’m considering options like one-hot encoding or target encoding. However, I’m unsure what factors to take into account to help me decide the best way to handle these features. Should I focus on preserving their ordinal nature, or does encoding them as categorical variables provide better flexibility for modeling? Any guidance would be greatly appreciated.

Seydou GORO
  • 161
  • 1

1 Answers1

4

There are 2 types of categorical variables (by opposition to numerical variables):

  • Ordinal (they have an order like "low"/"medium"/"high"). You can encode them with ordinal encoding (like low/medium/high becomes 0/1/2)
  • Nominal (they can't be ordered like "blue"/"red"/"yellow"). You can encode them with one hot encoding, binary encoding, ...

In your case you have ordinal variables which have already been encoded it seems (assuming that the numbers have a logic similar to the 1st point above) so no need for one-hot encoding or similar where you would lose the information about the fact that low < medium < high.

rehaqds
  • 1,801
  • 4
  • 13