Sklearn predicts different results depending on the input length

Question

Here is the problem: I fitted a Random Forest Classifier and saved it to a pickle file. However, when I predict with the entire dataset I get one result, and when run predict line by line (loop) I get another result. Why is this happening?

Here is one example. The probabilities ("prob_sim" and "prob_nao") are inverted between each output, but this doesn't happen all the time, and the classes ("class_d1" and "class_d2) are different, so I think I didn't switch the probabilities.

index index1 0 1 2 3 4 5 6 7 8 prob_sim prob_nao classe_d1 df
2764 2764 38 0 0 23 9.72 -24.35 167.21 126.31 3 0.475422 0.524578 nao d1
index index1 0 1 2 3 4 5 6 7 8 prob_sim prob_nao classe_d2 df
2764 2764 38 0 0 23 9.72 -24.35 167.21 126.31 3 0.526055 0.473945 sim  base

I've found this post, but it did not help me.

Some considerations:

The only preprocess step for the training data is encoding;
I've tried ensuring that the numbers were all float or int on both predictions;
First I used Pandas DF, then I tried numpy arrays (that's why there are some "to_numpy") on both predictions;
I've tried setting np.random.seed(42).

The model creation:

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=42)
modelo = RandomForestClassifier(
    criterion= 'entropy', 
    max_features= 'sqrt', 
    min_samples_split= 4, 
    n_estimators= 158,
    random_state=42)
  over = SMOTE(random_state=42)
  under = RandomUnderSampler(random_state=42)
pipe = Pipeline([
    ('o',over),
    ('u',under),
    ('m',modelo)
  ])
pipe.fit(x_train,y_train)
  pickle.dump(pipe,open(nome_modelo,'wb'))

The first prediction process, with the entire dataframe:

  modelo = pickle.load(open(nome_modelo,'rb'))
  d1 = base.copy(deep=True).reset_index(drop=True).to_numpy()
  y_new_chance = modelo.predict_proba(d1)
  y_new_class = modelo.predict(d1)
  d1 = pd.DataFrame(d1)
d1['prob_sim'] = [i[1] for i in y_new_chance]
  d1['prob_nao'] = [i[0] for i in y_new_chance]
  d1['classe'] = y_new_class
  d1['df'] = 'd1'
  d1.to_csv('d1.csv',index=False)

The second prediction process, with a loop that predicts line by line:

base = base.reset_index(drop=True)
for idx,row in base.iterrows():
    x_new = pd.DataFrame({
      '0':int(row['0']),
      '1':int(row['1']),
      '2':int(row['2']),
      '3':int(row['3']),
      '4':row['4'],
      '5':row['5'],
      '6':row['6'],
      '7':row['7'],
      '8':int(row['8'])
    },index=[idx]).to_numpy()
y_new_chance = modelo.predict_proba(x_new)
y_new_class = modelo.predict(x_new)
base.loc[idx,'prob_sim'] = y_new_chance[0][1]
base.loc[idx,'prob_nao'] = y_new_chance[0][0]
base.loc[idx,'classe'] = y_new_class[0]
base['df'] = 'base'


base.to_csv('d2.csv',index=False)

Then I used the following code to check for different result:

d1 = pd.read_csv('d1.csv')
d2 = pd.read_csv('d2.csv')
d1 = d1.rename(columns={'classe':'classe_d1'}).reset_index()
d2 = d2.rename(columns={'classe':'classe_d2'}).reset_index()
x = pd.merge(d1[['classe_d1','index']],d2[['classe_d2','index']],left_on='index',right_on='index',how='left')
x['igual'] = [True if i == j else False for i, j in zip(x['classe_d1'],x['classe_d2'])]
x_diferente = list(x[x['igual'] == False].index)
for cd in x.loc[x['igual'] == False,'index']:
  print(d1[d1['index'] == cd])
  print(d2[d2['index'] == cd])
  print('\n--------------------------------------------\n')

Sklearn predicts different results depending on the input length

0 Answers0