I am experimenting with a simple MLPClassifier and one-hot encoding in SKLearn.
data = pd.read_csv("./Synthetic_data.csv", header=0)
filtered_data = data.drop(['diag_binary'], axis=1)
X = filtered_data[['sex', 'age', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q']]
y = filtered_data[['diag_multi']]
datasets = train_test_split(X_1hot, y, test_size=0.25, random_state=42, stratify=y)
train_data, test_data, train_labels, test_labels = datasets
enc = OneHotEncoder(handle_unknown='ignore')
train_data = enc.fit_transform(train_data)
test_data = enc.transform(test_data)
mlp = MLPClassifier(max_iter=1000, batch_size=32, random_state=42)
mlp.fit(train_data, train_labels.values.ravel())
print(mlp.score(test_data, test_labels))
The score of my MLP in this setup is higher compared to the MLP without one-hot encoded data. I do not understand as to how the encoding can change the output of my model?
To provide context to the database: The features 'a' to 'z' all represent questions from a survey with 4 possible values: 'yes', 'no', 'don't know' and 'skipped' and are encoded with 1,2,3 and 9. Since the MLP could learn an ordinal relation between the samples I one-hot encode these values. I also one-hot encode 'sex' since similar to the other features it is also encoded as 1,2 and 3. Same for 'age' of course, even though I am not sure if one-hot encoding is the correct way to handle the age values. The dataset is very imbalanced so this may be a problem, I'm not sure.