Title.
I have a dataset that's highly imbalanced, say the output variable I want to predict is restricted within the range from 0 to 1, but almost all of the datapoints sit around 0.7-0.9, while my prediction set is mostly values from 0 to 0.4.
I see there is a huge gap. So far, my prediction is completely off. I believed it is caused by imbalanced dataset, but due to the nature of my data, I can't add any more points. I tried some data augmentation strats but did not work very well. I am also aware of oversampling methods like ADASYN and SMOTE, but since my dataset are purely numerical, those methods are not applicable.
Is there any way to deal with this dilemma, or I have to admit that due to the distribution of my dataset, such regression issue is unsolvable?
update 2024-06-07
I am using Random Forest from Scikit-learn package. I did try some benchmark models like linear regression, but the result is even worse (x-reality, y-prediction):
below is my prediction vs reality result using random forest, as you can see, most of the predictions is far from correct. R2 is less than 0, RMSE is about 0.5.
"When you refer to the prediction set, do you mean that your predictions are between 0 and 0.4 or that the true values are between 0 and 0.4?"
I meants the true values are between 0 and 0.4.
