Model ensemble with Spark or Scikit Learn

Question

I am using Spark MLLib to make prediction and I would like to know if it is possible to create your custom Estimators.

Here is a reproducible of what I would like my model to do with the Spark api


from sklearn.datasets import load_diabetes
import pandas as  pd
import pyspark 
from pyspark.ml.feature import VectorAssembler, SQLTransformer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
# Query diabetes data 
diab = load_diabetes()
df = pd.DataFrame(diab.data, columns=diab.feature_names)
df['is_male'] = df.sex > 0
df.drop('sex', inplace=True, axis=1)
df['label'] = diab.target
# Model made with Spark

spark = pyspark.sql.SparkSession.builder.master('local[10]').appName('A random spark context').getOrCreate()
def create_gender_model_male(): 
    return Pipeline(stages=[SQLTransformer(statement='SELECT * FROM __THIS__ WHERE is_male'),
                                VectorAssembler(inputCols=['age', 'bmi', 'bp', 's1'],outputCol='features'),
                                LogisticRegression(featuresCol='features', labelCol='label', maxIter=100,
                                                   elasticNetParam=1, regParam=0.)
                                ])
def create_gender_model_female(): 
    return Pipeline(stages=[SQLTransformer(statement='SELECT * FROM __THIS__ WHERE not is_male'),
                                VectorAssembler(inputCols=['age', 'bmi', 'bp', 's1'],outputCol='features'),
                                LogisticRegression(featuresCol='features', labelCol='label', maxIter=100,
                                                   elasticNetParam=1, regParam=0.)
                                ])
df = spark.createDataFrame(df)
class MixedModel():
    def __init__(self):
        self.models = {'male': create_gender_model_male(), 'female': create_gender_model_female()}
        self.fitted_models = {'male': None, 'female': None}
    def fit(self, df): 
        self.fitted_models['male'] = self.models['male'].fit(df)
        self.fitted_models['female'] = self.models['female'].fit(df)
    def predict(self, df): 
        return self.fitted_models['male'].transform(df).union(self.fitted_models['female'].transform(df))
mm = MixedModel()
mm.fit(df)
mm.transform(df)

Here, for example I have one logistic regression per sex but I would also like to be able to have prediction with a tree for males and prediction with Logistic regression for females if I want.

In a perfect world there would be a function:

ModelAggregation(('is_male is true, male, model_for_male), ('is_male is false', model_for_female)))

which would return me an object like my model aggregation

Pierre Nodet · Answer 1 · 2019-09-06T17:53:08.163

As you said in your question, there is no way to do that with the baseline algorithms provided in MLLib.

Two ways you could do that is by either :

Creating a function to generate a Pipeline
Creating a Meta Estimator that would take your base learners and the forking column.

The first one is what you have stated in your question and the second one has been explained by Brian Spiering.

As you said making a custom Estimator and Transformer will make it works nicely with all the MLLib Tools as Tuning Tools.

If you want a really precise example how to do so, there is my library which implements meta algorithms for Ensemble Learning with Spark.

There is a way to train multiple estimators in parallel quite simply in the Stacking Classifier, and for all of them you have an example on how to ask for an estimator as the parameter (to chose either your logistic regression or decision tree).

You can pick ideas out of that !

Edit : Similar question on StackOverflow

score 0 · Answer 2 · answered Apr 18 '19 at 15:17

You could just pass gender as a parameter to create a pipeline

def create_pipeline(gender): 

    select_statement = "SELECT * FROM __THIS__ WHERE {predicate}".format(predicate = "is_male" if gender == "male" else "not is_male") 

    pipeline = Pipeline(stages=[transformer = SQLTransformer(statement=select_statement),
                                VectorAssembler(inputCols=['age', 'bmi', 'bp', 's1'],outputCol='features'),
                                LogisticRegression(featuresCol='features', labelCol='label', maxIter=100,
                                                   elasticNetParam=1, regParam=0.)
                            ])
    return pipeline

Then store all the models

model_types = {'model_male': create_model(gender='male'),
               'model_not_male': create_model(gender='not_male')}

Fit each model

for model_types in models:
    pipeline = models[model_types]
    model = pipeline.fit(df)
    model.transform(df)

Model ensemble with Spark or Scikit Learn

2 Answers2