2

I have a regression model that I want to make prediction based on values that I will get from an end user.
In my dataset, I have one categorical variable region which I one-hot encoded, which generated 53 new columns (54 regions).
Now my data has the shape 1000x72. I then split into training and testing sets and my model is working fine.
But I'm confused about how my model would predict new values. Since I will only be getting one value for region from the end user, my model will one-hot encode a single value, and it will no longer fit the shape it has been trained on, as it will have the shape 1x18. I'm really confused as in how would I fit it into the model this way... Do I just make 53 other columns and put a dummy 0 in each one??
Sorry if this is a trivial question, I'm very beginner to this and any help would be greatly appreciated!!

region_ohe = OneHotEncoder(categories = "auto", handle_unknown = "ignore")
X_encoded = region_ohe.fit_transform(df['region'].values.reshape(-1,1)).toarray()
Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
IngridX
  • 33
  • 1
  • 4

2 Answers2

2

With sklearn's OneHotEncoder, the categories are baked in after fitting. You can apply the encoding to new data with region_ohe.transform(x_new). (And, as you might guess, fit_transform just calls fit then transform.)

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
0

Say you have a column with numerical regions:

r
1
2
3

One hot encoding (aka „dummys“ or indicators) gives:

r1 r2 r3
1  0  0
0  1  0
0  0  1

Read the docs for Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

Needless to say that your trained model need to see the same data structure (viz. variables or features) as the data you want to predict.

If trained on one hot, you just need to set all other region values to zero to make predictions based on user supplied input for a region.

Peter
  • 7,896
  • 5
  • 23
  • 50