4

I am working on the Boston challenge hosted on Kaggle and I'm still refining my features. Looking at the dataset, I realize that some columns need to be encoded in binary, some encoded in decimals (ranking them out of a scale of n) and some need to be one-hot-encoded. I've collected these columns and categorized them in distinct lists (at least based on my judgement on how their data should be encoded):

categorical_columns = ['MSSubClass', 'MSZoning', 'Alley', 'LandContour', 'Neighborhood', 'Condition1', 'Condition2',
                       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating',
                       'Functional', 'GarageType', 'PavedDrive', 'SaleType', 'SaleCondition']

binary_columns = ['Street', 'CentralAir']

ranked_columns = ['LotShape', 'Utilities', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
                  'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu',
                  'GareFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']

One fellow stackexchange user suggested that I use pandas.get_dummies() method to one-hot-encode categorical variables like MSZoning and attach it to a variable like this:

OHE_MSZoning = pd.get_dummies(train['MSZoning'])

I'd like to know how I can automate this process using functions and control-flow statements like for-loop.

Andros Adrianopolos
  • 352
  • 1
  • 8
  • 19

1 Answers1

2

I'm the fellow Stackexchange user, hi! I wrote the function that iterates the one-hot encoding on all your categorical_columns:

def serial_OHE(df, categorical_columns):

    # iterate on each categorical column
    for col in categorical_columns:

        # take one-hot encoding
        OHE_sdf = pd.get_dummies(df[col])

        # drop the old categorical column from original df
        df.drop(col, axis=1, inplace=True)

        # attach one-hot encoded columns to original dataframe
        df = pd.concat([df, OHE_sdf], axis = 1, ignore_index = True)

    return df

So you can call it like this:

df = serial_OHE(df, categorical_columns)

Let me know it there are any problems.

Leevo
  • 6,445
  • 3
  • 18
  • 52