2

I am working on a binary classification problem where I have a mix of continuous and categorical variables.

Categorical variables were created by me using get_dummies function in pandas.

Now my questions are,

1) I see that there is a parameter called drop_first which usually is given the value True. Why do we have to do this? Let's say for the purpose of example, we have 2 values in gender column namely Male,Female. If I use drop_first=True, it returns only one column. like gender_male with binary 1 and 0 as values For example, If my feature importance returns gender_male as an important feature, Am I right to infer that it is only Male gender that influences the outcome (because male is denoted as 1 and female is 0) and female (0's) don't impact the model outcome? or 0's in general doesn't play any role in ML model predictions?

2) Let's say my gender has 3 values for example Male,Female,Transgender. In this case if I use drop_first=True, it would only returns two columns

gender_male with 1 and 0 - Here 0 represents Transgender right?

gender_female with 1 and 0 - Here 0 represents Transgender right?

3) What's the disadvantage of not using drop_first=True? Is it only about the increase in number of columns

Can you help me with the above queries?

The Great
  • 2,725
  • 3
  • 23
  • 49

1 Answers1

5

1) Using drop_first=True is more common in statistics and often referred to as "dummy encoding" while using drop_first=False gives you "one hot-encoding" which is more common in ML. For algorithmic approaches like Random Forests it does not make a difference. Also see "Introduction to Machine Learning with Python"; Mueller, Guido; 2016:

The one-hot encoding we use is quite similar, but not identical, to the dummy encoding used in statistics. For simplicity, we encode each category with a different binary feature. In statistics, it is common to encode a categorical feature with k different possible values into k–1 features (the last one is represented as all zeros). This is done to simplify the analysis (more technically, this will avoid making the data matrix rank-deficient).

However, using dummy encoding on a binary variable does not mean that a 0 has no relevance. If gender_male has high importance that does not generally say anything about the importance of gender_male==0 vs gender_male==1. It is variable importance and accordingly calculated per variable. If you, for example, use impurity based estimates in Trees it only gives you the average reduction in impurity achieved by splitting on this very variable.

Moreover, if your gender variable is binary, gender_male==1 is equivalent to gender_female==0. Therefore from a high variable importance of gender_male you cannot infer that being female (or not) is not relevant.

2) In this case gender_male==0 AND gender_female==0 means Transgender is true.

3) see 1). For algorithmic approaches in ML there is no statistical disadvantage using one-hot-encoding. (as pointed out in the comments it might even be advantageous since tree-based models can directly split on all features when none is being dropped)

Jonathan
  • 5,605
  • 1
  • 11
  • 23