6

Given some dataset for prediction,

for eg say I have different housing price prediction dataset:

dataset 1 : 100 training and 100 testing sample, 50 feature

dataset 2 : 100 training and 100 testing sample, 120 feature

dataset 3 : 1000 training and 1000 testing sample, 50 feature

dataset 4 : 1000 training and 1000 testing sample, 5000 feature

how should I choose the best methods for estimating the unknown parameters ( predict price) in a linear regression model from the following for each of these dataset?

  • Ordinary least squares

  • Stepwise regression

  • Principal component regression

  • Partial least squares regression

Should I experiment with each of these one by one and compare the results or is there any rule of thump on when to use each of them based on the dataset ?

Please help

Ethan
  • 1,657
  • 9
  • 25
  • 39
Sreejithc321
  • 1,940
  • 3
  • 20
  • 34

1 Answers1

3

After the Data Munging, this is the most difficult task on a prediction model. However, in order to answer it, we need more details. What do you mean by the "best model"? Do you want accuracy and long training time? Do you need something really fast with lower accuracy? Something between the two of them? What are your features? Have you just taken them or have you created new features from them?

In any case, I suggest that you should spend some time and read this perfect tutorial from Microsoft about Machine Learning. A part of the tutorial to understand what I mean:

Regression

  1. Ordinal regression: Data in rank ordered categories
  2. Poisson regression: Predicting event counts
  3. Fast forest quantile regression: Predict a distribution
  4. Linear regression: Fast training, linear model
  5. Bayesian linear regression: Linear model, small data sets
  6. Neural Network regression: Accuracy, long training time
  7. Decision forest regression: Accuracy, fast training
  8. Boosted decision tree regression: Accuracy, fast training

When I have a similar question and I don't know which one to choose, I usually end up to 3-4 different algorithms based on the cheat sheet of Microsoft or the one from the scikit-learn, try them all and choose one or two of them with the best results.

Tasos
  • 3,960
  • 5
  • 25
  • 54