3

I have one categorical variable of string type in my dataset. I need to convert it to numerical value for further processing. I know standard way to represent categorical data is to use one-hot encoding. But that will convert each entry of the variable to a vector.

LabelEncoder of sklearn converts each entry to a scalar value. I realise this is a very naive and possibly stupid question but which representation is more commonly used and is there a reason for the bias?

SHASHANK GUPTA
  • 3,855
  • 4
  • 20
  • 26

2 Answers2

7

The main difference I can think of is that using one-hot encoding will mean that all your strings will be at the same (hamming) distance from each other, while using a scalar value means that distances between the resulting features will be meaningless (it may encode "red" as 1, "blue" as 2 and "green" as 3, but there is no reason why red is more similar to blue than to green).

Jérémie Clos
  • 330
  • 1
  • 6
2

When to use label encoding versus one-hot encoding.

Tree based methods:

When categorical feature is ordinal label encoding can lead to better quality if it preserves correct order of values. In this case a split made by a tree will divide the feature to values 'lower' and 'higher' that the value chosen for this split.

Non-tree based methods:

One-hot encoding or embedings should be used. Unless there is a linear relashionship between the label encoding and the dependent variable non-tree based methods will have a hard time with label encoding.

One-hot encoding a categorical feature with huge number of values can lead to high memory consumption. You can use sparse matrices to deal with this problem. You can also ignore a subset of the categories that are rare to decrease the number of new features.