One Hot Encoding for any kind of dataset

Question

How can I make a one hot encoding for a unknown dataset which can iterate and check the dytype of the dataset and do one hot encoding by checking the number of unique values of the columns, also how to keep track of the new one hot encoded data with the original dataset?

Carlos Mougan · Accepted Answer · 2020-07-13T07:23:35.213

I would recommend to use the one hot encoding package from category encoders and select the columns you want to using pandas select dtypes.

import numpy as np
import pandas as pd     
from category_encoders.one_hot import OneHotEncoder
pd.options.display.float_format = '{:.2f}'.format # to make legible
make some data
df = pd.DataFrame({'a': ['aa','bb','cc']2,
                   'b': [True, False]  3,
                   'c': [1.0, 2.0] * 3})
cols_encoding = df.select_dtypes(include='object').columns
ohe = OneHotEncoder(cols=cols_encoding)
encoded = ohe.fit_transform(df)

Note that you can change the way you handle unseen data with

handle_unknown: str

options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

One Hot Encoding for any kind of dataset

1 Answers1

make some data