Should I do one hot encoding before feature selection and how should I perform feature selection on a dataset with both categorical and numerical data

Question

a newbie here. I am currently self-learning data science. I am working on a dataset that has both categorical and numerical (continuous and discrete) features (26 columns, 30244 rows). Target is numerical (1, 2, 3). I have several questions.

I still have not performed any encoding or scaling techniques. According to my knowledge, as my categorical data are unordered, I have to perform one hot encoding right? As it will increase the number of columns, I am hoping to do that after feature selection. Is that okay?
How can I perform feature selection for this dataset? (Because this has both numerical and categorical data) Should I first do one-hot encoding and then go for checking correlation or t-scores or something like that?

(I am currently focusing on EDA only. I don't have a model in my mind)

Any help is much appreciated. Thank you!

Devashish Prasad · Accepted Answer · 2021-05-31T13:48:53.310

I have to perform one hot encoding right?

Yes

As it will increase the number of columns, I am hoping to do that after feature selection. Is that okay?

No, you should do basic preprocessing like dealing with missing values and then proceed for handling categorical data before feature selection. Beware of nominal vs ordinal features.

How can I perform feature selection for this dataset?

There are many ways to perform feature selection. You can use the methods you mentioned as well many other methods like -

L1 and L2 regularization
Sequential feature selection
Random forests
More techniques in the blog

Should I first do one-hot encoding and then go for checking correlation or t-scores or something like that?

There is a great answer on this issue here.

Should I do one hot encoding before feature selection and how should I perform feature selection on a dataset with both categorical and numerical data

1 Answers1