1

Title.

I have a dataset that's highly imbalanced, say the output variable I want to predict is restricted within the range from 0 to 1, but almost all of the datapoints sit around 0.7-0.9, while my prediction set is mostly values from 0 to 0.4.

I see there is a huge gap. So far, my prediction is completely off. I believed it is caused by imbalanced dataset, but due to the nature of my data, I can't add any more points. I tried some data augmentation strats but did not work very well. I am also aware of oversampling methods like ADASYN and SMOTE, but since my dataset are purely numerical, those methods are not applicable.

Is there any way to deal with this dilemma, or I have to admit that due to the distribution of my dataset, such regression issue is unsolvable?


update 2024-06-07

I am using Random Forest from Scikit-learn package. I did try some benchmark models like linear regression, but the result is even worse (x-reality, y-prediction): enter image description here below is my prediction vs reality result using random forest, as you can see, most of the predictions is far from correct. R2 is less than 0, RMSE is about 0.5.

prediction vs reality plot

"When you refer to the prediction set, do you mean that your predictions are between 0 and 0.4 or that the true values are between 0 and 0.4?"

I meants the true values are between 0 and 0.4.
Yuuya
  • 51
  • 5

1 Answers1

0

Your setup has you in a position where you’re trying to predict something you are not modeling. When you train on one group and then test on a considerably different group, of course the predictions are terrible when you have not trained to detect relationships for that group.

You need to model your group of interest. Since you performed the experiments to collect data for that group, you have the information available to make predictions about your group of interest. It also makes sense to include the current training group, as that gives the model more data and also puts you in a position be able to predict higher outcomes.

Your manager wanting to exclude the group of interest from the training has made a mistake with conditional probability. You go into a prediction problem not knowing what the outcome is, so your manager’s argument has an implicit conditioning on the unknown. That’s a problem when the whole point is to develop a system to make predictions about whether or not the results of a chemical reaction will be a high value or not. It is time to argue with the manager and win that argument.

Dave
  • 4,542
  • 1
  • 10
  • 35