Sklearn Aggregating Multiple Fitted Models Into A Single Model? (binary classification)

Question

My problem context:

dataset too big to fit into memory.
binary classification [0,1]
30 csv files in a directory with exactly 30,000 samples (rows)
each file contains 15,000 0 class and 15,000 1 class (no unbalance)
model is xgboost.XGBClassifier()

If I iterate through these 30 files, fit a model, and pickle the model to disk; is it possible to aggregate these 30 fitted models into a single model? I am aware of sklearn.ensemble.VotingClassifier but that is used for:

sklearn.ensemble.VotingClassifier

Soft Voting/Majority Rule classifier for unfitted estimators.

Jarad · Accepted Answer · 2017-10-09T15:15:53.203

Despite the downvote, the question is clear, and a common one I'm sure most stumble across after doing machine learning work for some time.

The goal was to make a stronger predictive model from multiple trained models.

Quote from my question:

is it possible to aggregate these 30 fitted models into a single model?

Answer: yes but there's no good functionality that allows you to do this in sklearn.

Verbose answer:

Imagine you have 30 CSV files that contain 15,000 0 class and 15,000 1 class samples. In other words, an equally balanced number of binary responses (no class imbalance). I generated these files myself because 1) the size of the data I'm working with is too big to fit into memory (point #1 of my original question) and 2) contains 97% class 0 and 3% class 1 (large class imbalance). My goal is to see if it's even possible to distinguish between a 0 or a 1 if the class imbalance issue was removed from the equation. If there were distinguishing features found, I'd want to know what those features were.

To generate each batch, I grabbed 15,000 1 responses, and randomly sampled from the 97% 15,000 more samples, joined into one dataset (30,000 samples total), then shuffled them at random.

I then went through each batch and trained an XGBClassifier() using optimal parameters found from GridSearchCV() for each. I then saved the model to disk (using Python's pickle capabilities).

At this point, you have 30 saved models.

/models/ directory looks like this:

['model_0.pkl', 'model_1.pkl', 'model_10.pkl', 'model_11.pkl', 'model_12.pkl',
 'model_13.pkl', 'model_14.pkl', 'model_15.pkl', 'model_16.pkl', 'model_17.pkl',
 'model_18.pkl', 'model_19.pkl', 'model_2.pkl', 'model_20.pkl', 'model_21.pkl',
 'model_22.pkl', 'model_23.pkl', 'model_24.pkl', 'model_25.pkl', 'model_26.pkl',
 'model_27.pkl', 'model_28.pkl', 'model_29.pkl', 'model_3.pkl', 'model_4.pkl',
 'model_5.pkl', 'model_6.pkl', 'model_7.pkl', 'model_8.pkl', 'model_9.pkl']

Again, going back to the original question - is it possible to aggregate these into a single model? I wrote a pretty basic python function for this.

def xgb_predictions(X):
  ''' returns predictions from 30 saved models '''

  predictions = {}
  for pkl_file in os.listdir('./models/'):
    file_num = int(re.search(r'\d+', pkl_file).group())
    xgb = pickle.load(open(os.path.join('models', pkl_file), mode='rb'))
    y_pred = xgb.predict(X)
    predictions[file_num] = y_pred

  new_df = pd.DataFrame(predictions)
  new_df = new_df[sorted(new_df.columns)]

  return new_df

It iterates through each file in a directory of saved models and loads them one at a time casting its own "vote" so-to-speak. The end result is a new dataframe of predictions.

Since this new dataset is smaller (X.shape x 30), it will fit into memory. I iterate through each batch file calling xgb_predictions(X), getting a dataframe of predictions, add the y column to the dataframe, and append it to a new file. I then read the full dataset of predictions and create a "level 2" model instance where X is the prediction data and y is still y.

So to recap, the concept is, for binary classification, create equally balanced class datasets, train a model on each, run through each dataset and let each trained model cast a prediction. This collection of predictions is, in a way, a transformed version of your original dataset. You build another model to try and predict y given the predictions dataset. Train it all in memory so you end up with a single model. Any time you want to predict, you take your dataset, transform it into predictions, then use your level 2 trained model to cast the final prediction.

This technique performed much better than any other single model did or even stacking with UNTRAINED prior models.

Sklearn Aggregating Multiple Fitted Models Into A Single Model? (binary classification)

1 Answers1