What is the best practice advice for the following scenario: Before training a linear regression model, if the data used to train the model contains a subset of observations with a target that is already known and not in need of a prediction, what should be done with these?
For example, if the prediction is to estimate a fare amount for a taxi ride that hasn't taken place yet, and some of those rides have a predetermined flat fare (like an airport ride), then the predictions and their residuals would weaken a model's accuracy. Should these flat fare observations:
- be removed before training?
- be removed before training and then be brought back in with their known rate amount prior to model evaluation?
- Or, should they be left in, trained, then adjust predicted values by imputing to the correct flat fare amount then evaluate the model?
Part B: If option #2, then how is this done? This may be made more difficult if the data is scaled.
Thanks in advance.