0

Best practice advice for linear regression - if training data contains entries that do not need predictions, is it commonplace to remove these entries? For example, if you are predicting a fare amount but some fares are flat fee fares (not needing to be predicted since they are predetermined), is it best practice to remove these from a sampled data set before training? Or, does removing them create biased data?

This is the shortened version of the question I asked here: Best practice advice for known target values before training a linear regression model?

ssou
  • 13
  • 3

1 Answers1

0

Remove your irrelevant observations.

In your example, I certainly would remove those observations for flat fee fares. If you're looking for the effect of some variable on the fee, then including the flat fee observations will impact the estimated relationship.

Your regression model is estimating the relationship between your independent variables (explanatory variables) and your outcome variable. You can interpret the parameter estimate for independent variable $X$ as the average effect of a 1 unit change in $X$ when holding all other variables constant.

Given it is an average effect, if for some of your observations there is no effect, then your parameter estimates will be closer to 0 than if you removed them - a "bias", if you like.

It will also be more difficult to interpret your estimates with these irrelevant observations included. It's much easier to say you're predicting just those tickets without flat fees, because I assume you don't need a model to predict flat fees anyway!

Jamie
  • 68
  • 7