8

How do we use one hot encoding if the number of values which a categorical variable can take is large ?

In my case it is 56 values. So as per usual method I would have to add 56 columns (56 binary features) in the training dataset which will immensely increase the complexity and hence the training time.

So how do we deal with such cases ?

mach
  • 367
  • 1
  • 4
  • 9

4 Answers4

10

If you really care about the number of dimensions, you still can try to apply a dimensionality reduction algorithm, such as PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), after your one hot encoding.

But know that "56 features" isn't really large and it's highly common in the industry to have thousands, millions or even billions of features.

Dawny33
  • 8,476
  • 12
  • 49
  • 106
jmvllt
  • 629
  • 2
  • 8
  • 15
2

You could try reducing the dimmension of the 56 dummy resulting features, if you have some categories that represent a small proportion compared to the majority by labeling them the same.

1

It depends on the problem you are working on. If number of categorical variables is very large, it is better to use label encoding. But the label encoding should be meaningful i.e. the categories which are close to each other should get similar labels. Let's say you are creating a model where you have a feature Month. But there is a periodicity in your target variable, i.e. every x months, say 3 months, the trends are similar. Now it does not make sense to use labels 1, 2, ... 12 for months, instead, it is better to use 0, 1, 2, 0, 1, 2.... such labels. So Jan is 0, Feb is 1, Mar is 2 and again Apr is 0 and so on.

You can use LabelEncoder of sklearn.preprocessing for this problem. But it does not take care of the semantics as I mentioned. For that, you can do some manual label encoding.

Ricky
  • 189
  • 1
  • 8
-1

When there are large number of categorical variables, it is advisable to do one versus rest.