0

I have a medium size data set (7K) of patient age, sex, and pre-existing conditions. Age of course is from 0-101, sex is 1 for male, 2 for female, and -1 for diverse. All the pre-conditions are Boolean. The outcome, death is also Boolean.

Regardless of how I scale the data (I tried lots of scalers), I always get a warning:

FitFailedWarning: Estimator fit failed. The score on this train-test partition for 
                  these parameters will be set to nan.

This traces back to: ValueError("Unknown label type: %r" % y_type) ValueError: Unknown label type: 'unknown'

If I take out the age and sex columns, the error goes away. There are definitely no text, missing, or weird values here.

If I look at my rescaled data, it looks as I would expect it to look.

df.describe of rescaled data

If I drastically simplify the data, it works.

import numpy as np
import pandas as pd
from sklearn import preprocessing

array = np.array([[42, 1, False, False, False, False, False, False, False, False, False, False, False],
[72, 1, False, False, True, False, False, False, False, False, False, True, False],
[77, 2, False, False, False, False, False, False, False, False, True, True, False],
[36, 1, False, False, False, False, False, False, False, False, False, False, False],
[42, 1, False, False, False, False, False, False, False, False, True, False, False],
[82, 1, False, False, False, True, False, False, False, False, False, True, False],
[71, 2, False, False, False, False, False, False, False, False, False, True, False],
[36, -1, False, False, False, False, False, False, False, False, True, False, False], [52, 1, False, False, False, False, False, False, False, False, False, False, False],
[52, 1, False, False, False, False, False, False, False, True, False, True, True],
[77, 2, False, False, False, False, False, False, True, False, True, True, False],
[46, 1, False, False, False, False, False, False, False, False, False, False, False],
[45, 1, False, False, False, False, False, False, False, False, False, False, False],
[88, 1, False, False, False, False, False, True, False, False, False, True, True],
[79, 2, False, True, True, False, False, False, False, False, False, True, True],
[36, -1, True, False, False, False, False, False, False, False, False, False, False]])

X = array[:,0:12] Y = array[:,12] scaler = preprocessing.MinMaxScaler().fit(X) rescaledX = scaler.transform(X)

from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits=3, shuffle=True, random_state=7) # split the data into training and test sets for k-fold validation model = LogisticRegression(solver='lbfgs') # set up model of a linear regression results = cross_val_score(model, rescaledX, Y, cv=kfold) print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()100.0, results.std()100.0))

It would be awesome if someone has an idea of what might be wrong, or how to troubleshoot further.

Ethan
  • 1,657
  • 9
  • 25
  • 39
DrWhat
  • 103
  • 1
  • 4

1 Answers1

1

So, to me what you have to do is :

  • Transform all your your True/False to 1/0, so they're numerical. Keep age as it is (or use some normalisation, but not that necessary

  • Absolutely change the way Sex is handled. You have a big bias since you have 3 values : Since it's numerical, distance matters. Here, distance between "Male" and "Diverse" is 2, and distance between "Female" and "Diverse" is 3. There's no logical reason, seeing your problem, for that. This will bring bias to your model.

You should read this answer : https://datascience.stackexchange.com/a/79575/101580 In your case One Hot Encoder is good enough since you have 3 values.

Adept
  • 904
  • 6
  • 17