I’m building two models (one for a regression problem and the other for a classification task) but I am facing low correlation in the data (lower in the classification problem than in the regression problem). Are there any resources or key information I should consider for building models suitable for such data? It seems decision trees and related models are less sensitive to correlation. Are there any models that inherently don't rely on correlation, or is the issue that the data is actually non-linear?
3 Answers
I guess that when you say "correlation matrix" you mean with Pearson's coefficients, then it can only look at linear correlation. Even if you use Spearman's coefficients, it will only study monotonic correlation. So 2 variables can be non-linearly "correlated" even if these coefficients are very low.
Also, it is possible to have 2 features with low correlation to the target but when used together the 2 features can be good predictors of the target.
So in theory you could find a good model even with low correlations between your features and the target.
That's for the general case. But you are trying to predict stock price volatility which seems to be extremly difficult if not impossible (many great minds around the planet have tried without success). Moreover you are only using past prices, why past prices could predict future volatility ? Just one news that surprise traders and investors could lead to a big jump in volatility.
- 1,801
- 4
- 13
Well, you may consider the following points:-
- The variables chosen may not affect any connection between them.
- Look for outliers. They could be possible reasons for affecting the output.
- Correlation is good for bivariate analysis. Your data might have some complex relation. (Think about the sine curve!)
- 919
- 1
- 4
Few things here. Correlation is a metric that given two sets of numbers it tells you how much linearly associated the values of those sets are. But you need to consider a couple of things beforehand:
- The type of data: methods like
df.corr()default into calculating the sample correlation, which is only valid when both columns (sets) are numeric. If one of the variables you are trying to compare is not numeric, other methods like Cramer's V must be used to estimate associations. - When you are dealing with a regression/classification problem you have to address two independent analysis in terms of the variables correlations. Correlation between predicting variables and correlation between predicting variables and the target variable. For the former, if two predicting variables have a high correlation, you have to think eliminate one, as they contain redundant information. Low correlation between predicting variables, is not an indicative of low success of a regression model, as they just can be affecting the tatget variable independently.For the latter, any variable which is highly correlated (|r_xy|>0.6) to the target variable is not recommended to be included within the model, as it would just define the predictions itself.
Finaly, answering your questions. Models do not rely on Correlation by themselves. Correlations are a metric that inform you of the linear association between two variables. There can be other non-linear relations between your variables. There are models that can take this into account or transformations that you can use on your features to linearize your problem.
A couple of resources to start with, over here:
- 11
- 3