5

a newbie here. I am currently self-learning data science. I am working on a dataset that has both categorical and numerical (continuous and discrete) features (26 columns, 30244 rows). Target is numerical (1, 2, 3). I have several questions.

  1. I still have not performed any encoding or scaling techniques. According to my knowledge, as my categorical data are unordered, I have to perform one hot encoding right? As it will increase the number of columns, I am hoping to do that after feature selection. Is that okay?

  2. How can I perform feature selection for this dataset? (Because this has both numerical and categorical data) Should I first do one-hot encoding and then go for checking correlation or t-scores or something like that?

(I am currently focusing on EDA only. I don't have a model in my mind)

Any help is much appreciated. Thank you!

leahnanno
  • 83
  • 1
  • 4

1 Answers1

1

I have to perform one hot encoding right?

Yes

As it will increase the number of columns, I am hoping to do that after feature selection. Is that okay?

No, you should do basic preprocessing like dealing with missing values and then proceed for handling categorical data before feature selection. Beware of nominal vs ordinal features.

How can I perform feature selection for this dataset?

There are many ways to perform feature selection. You can use the methods you mentioned as well many other methods like -

  1. L1 and L2 regularization
  2. Sequential feature selection
  3. Random forests
  4. More techniques in the blog

Should I first do one-hot encoding and then go for checking correlation or t-scores or something like that?

There is a great answer on this issue here.

Devashish Prasad
  • 864
  • 8
  • 17