5

How can I make a one hot encoding for a unknown dataset which can iterate and check the dytype of the dataset and do one hot encoding by checking the number of unique values of the columns, also how to keep track of the new one hot encoded data with the original dataset?

1 Answers1

5

I would recommend to use the one hot encoding package from category encoders and select the columns you want to using pandas select dtypes.

import numpy as np
import pandas as pd     
from category_encoders.one_hot import OneHotEncoder

pd.options.display.float_format = '{:.2f}'.format # to make legible

make some data

df = pd.DataFrame({'a': ['aa','bb','cc']2, 'b': [True, False] 3, 'c': [1.0, 2.0] * 3})

cols_encoding = df.select_dtypes(include='object').columns ohe = OneHotEncoder(cols=cols_encoding) encoded = ohe.fit_transform(df)

Note that you can change the way you handle unseen data with

handle_unknown: str

options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

Carlos Mougan
  • 6,430
  • 2
  • 20
  • 51