One Hot encoding for large number of values

Question

How do we use one hot encoding if the number of values which a categorical variable can take is large ?

In my case it is 56 values. So as per usual method I would have to add 56 columns (56 binary features) in the training dataset which will immensely increase the complexity and hence the training time.

So how do we deal with such cases ?

score 10 · Answer 1 · edited Dec 03 '15 at 12:32

10

If you really care about the number of dimensions, you still can try to apply a dimensionality reduction algorithm, such as PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), after your one hot encoding.

But know that "56 features" isn't really large and it's highly common in the industry to have thousands, millions or even billions of features.

edited Dec 03 '15 at 12:32

Dawny33

8,476
12
49
106

answered Dec 03 '15 at 11:38

jmvllt

629
2
8
15

score 2 · Answer 2 · answered Oct 03 '15 at 19:28

2

You could try reducing the dimmension of the 56 dummy resulting features, if you have some categories that represent a small proportion compared to the majority by labeling them the same.

answered Oct 03 '15 at 19:28

Alexandru Daia

153
7

score 1 · Answer 3 · answered Apr 24 '21 at 05:19

It depends on the problem you are working on. If number of categorical variables is very large, it is better to use label encoding. But the label encoding should be meaningful i.e. the categories which are close to each other should get similar labels. Let's say you are creating a model where you have a feature Month. But there is a periodicity in your target variable, i.e. every x months, say 3 months, the trends are similar. Now it does not make sense to use labels 1, 2, ... 12 for months, instead, it is better to use 0, 1, 2, 0, 1, 2.... such labels. So Jan is 0, Feb is 1, Mar is 2 and again Apr is 0 and so on.

You can use LabelEncoder of sklearn.preprocessing for this problem. But it does not take care of the semantics as I mentioned. For that, you can do some manual label encoding.

score -1 · Answer 4 · answered Dec 04 '15 at 11:05

-1

When there are large number of categorical variables, it is advisable to do one versus rest.

answered Dec 04 '15 at 11:05

Rishiraj Surti

17

One Hot encoding for large number of values

4 Answers4