2

I have a dataset with 5K records for binary classification problem.

My features are min_blood_pressure, max_blood_pressure, min_heart_rate, max_heart_rate etc. Similarly, I have more than 15 measurements and each of them have min and max columns amounting to 30 variables.

When I ran correlation on the data, I was able to see that these input features are highly correlated. I mean min_blood_pressure is highly correlated (>80%) to max_blood_pressure. Each measurement with its min and max feature is highly correlated. Though their individual correlation to target variable is less.

So in this case, which one should I drop or how should I handle this scenario?

I guess there is min and max variables for a reason. How would you do in a situation like this?

Should we find the average of all the measurements and create a new feature?

Can anyone help me with this?

The Great
  • 2,725
  • 3
  • 23
  • 49

2 Answers2

2

I'd start here. Most basic idea is to run statistical tests to see how target variable depends on each feature. These include tests like chi-square or ANOVA. Tree-based models can also output feature importance. Check this post. There's plenty of posts on kaggle with code. Might be worth checking those:

As your data set isn't so drastically large, you could push grid search and check how your model behaves for different factors of PCA.

It's hard to tell a priori whether you should drop some features. I guess trying each combination of 30 features is completely out of scope, though you might try dropping most redundant ones.

As your data contains categorical features, it might be good idea to give catboost a try. They claim it handles categorical features better than other gradient boosters. Just keep in mind, that default number of estimators is 10 times of that in xgboost. You might lower it for experiments.

First, I'd create base model with all the features. Now comes the question: which method to choose? Gradient boosters poses ability of learning the feature importance, those redundant ones will get little weight and you might not see much of an improvement, when dropping features. You might get more insight using more vanilla methods, but in the end you'll be certainly deploying gradient boosting to production, so I don't see much sense in it. I'd stick with xgboost or catboost and perform experiments using same parameters.

Please keep in mind: though some features might be highly redundant, they may still contribute some knowledge to your model.

Piotr Rarus
  • 854
  • 1
  • 5
  • 15
2

You said:

Yes, I already ran few feature selection algorithms like SelectKbest, SelectFrom Model, RFE, Feature Importance etc which outputs both min and max. For example - Min_bp and Max_bp. When I did a sanity check by running correlation, I was able to see that they all are correlated.

In general you have 2 options.

  1. You can remove features that are not predictive for the target variable. This will include statistical tests such as ANOVA see here.

Then based on the F-values you can only keep the features that have the higher F-values meaning that they have high predictive ability for the target variable.

  1. If you want to remove correlated features, for example when using a regression (you ideally need uncorrelated variables), then dimensionality reduction such as PCA can be used. In this case, the new features will not be correlated but you will not be able to project back to the original features. PCA will lead to a linear combination of the original features.
seralouk
  • 121
  • 3