2

As far as I know, tree models (such as those trained using xgboost/lightgbm) makes reasonable prediction only if the input feature vector is similar to the train set data. If the feature vector looks like an outlier, then the prediction result is not reliable.

So my question is: how to determine the feasibility of applying the model to make prediction given a feature vector?

I initially tried to use the one-class SVM to determined whether the given feature vectors is close to the train set, but it turned out that SVM cannot handle big data set (a few million samples and ~1000 features).

1 Answers1

2

As you have indicated, you can use anomaly detection on the input data. One model that scales pretty well is IsolationForest, and as a tree-based model it is similar in nature to xgboost. Also, you can likely subsample your data a fair amount and still get a reasonable model.

Furthermore, you could also check each feature individually. For example, comparing it to the empirical distribution from the dataset. This is a simple way of caching simple (but common) problems, and has the advantage of also being easy to understand what is problematic about the input.

It is also wise to reduce the features used to what is actually needed. A feature that is not used cannot cause a model to go out-of-distribution!

Jon Nordby
  • 1,557
  • 10
  • 14