2

I have a categorical variable that measures the income of a family:

A: no income
B: Up to $500
C: $500-$700
…
P: $5000-$6000
Q: More than \\\$6000

It seems odd to me that I have to get dummies for this variable, since it's ordered. I wonder if it's better to map the values: {'A': 0, 'B': 1, …, 'Q': 17} so I can input it into the algorithm this values as integer numbers.

What's the proper way of preprocessing this variable to feed an algorithm such as Random Forest or a simple neural network?

marcus
  • 21
  • 2

2 Answers2

1

It depends on the algorithm.

For Random Forest, ordering the feature values could help by making it easier for the decision trees to get the feature effect with fewer nodes, but it isn't critical. The RF sorts it automatically given there is enough data.

For Neural Networks, you can add an embedded layer in the input to handle this automatically.

Yair Beer
  • 21
  • 4
1

One way to do is to use target encoding:

(There are a million resources to learn target encoding)

This way your categories will not only be ordered by the number but for the target value (what is at the end what you want, to give better predictions)

Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51