Here is the problem: I fitted a Random Forest Classifier and saved it to a pickle file. However, when I predict with the entire dataset I get one result, and when run predict line by line (loop) I get another result. Why is this happening?
Here is one example. The probabilities ("prob_sim" and "prob_nao") are inverted between each output, but this doesn't happen all the time, and the classes ("class_d1" and "class_d2) are different, so I think I didn't switch the probabilities.
index index1 0 1 2 3 4 5 6 7 8 prob_sim prob_nao classe_d1 df
2764 2764 38 0 0 23 9.72 -24.35 167.21 126.31 3 0.475422 0.524578 nao d1
index index1 0 1 2 3 4 5 6 7 8 prob_sim prob_nao classe_d2 df
2764 2764 38 0 0 23 9.72 -24.35 167.21 126.31 3 0.526055 0.473945 sim base
I've found this post, but it did not help me.
Some considerations:
- The only preprocess step for the training data is encoding;
- I've tried ensuring that the numbers were all float or int on both predictions;
- First I used Pandas DF, then I tried numpy arrays (that's why there are some "to_numpy") on both predictions;
- I've tried setting np.random.seed(42).
The model creation:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=42)
modelo = RandomForestClassifier(
criterion= 'entropy',
max_features= 'sqrt',
min_samples_split= 4,
n_estimators= 158,
random_state=42)
over = SMOTE(random_state=42)
under = RandomUnderSampler(random_state=42)
pipe = Pipeline([
('o',over),
('u',under),
('m',modelo)
])
pipe.fit(x_train,y_train)
pickle.dump(pipe,open(nome_modelo,'wb'))
The first prediction process, with the entire dataframe:
modelo = pickle.load(open(nome_modelo,'rb'))
d1 = base.copy(deep=True).reset_index(drop=True).to_numpy()
y_new_chance = modelo.predict_proba(d1)
y_new_class = modelo.predict(d1)
d1 = pd.DataFrame(d1)
d1['prob_sim'] = [i[1] for i in y_new_chance]
d1['prob_nao'] = [i[0] for i in y_new_chance]
d1['classe'] = y_new_class
d1['df'] = 'd1'
d1.to_csv('d1.csv',index=False)
The second prediction process, with a loop that predicts line by line:
base = base.reset_index(drop=True)
for idx,row in base.iterrows():
x_new = pd.DataFrame({
'0':int(row['0']),
'1':int(row['1']),
'2':int(row['2']),
'3':int(row['3']),
'4':row['4'],
'5':row['5'],
'6':row['6'],
'7':row['7'],
'8':int(row['8'])
},index=[idx]).to_numpy()
y_new_chance = modelo.predict_proba(x_new)
y_new_class = modelo.predict(x_new)
base.loc[idx,'prob_sim'] = y_new_chance[0][1]
base.loc[idx,'prob_nao'] = y_new_chance[0][0]
base.loc[idx,'classe'] = y_new_class[0]
base['df'] = 'base'
base.to_csv('d2.csv',index=False)
Then I used the following code to check for different result:
d1 = pd.read_csv('d1.csv')
d2 = pd.read_csv('d2.csv')
d1 = d1.rename(columns={'classe':'classe_d1'}).reset_index()
d2 = d2.rename(columns={'classe':'classe_d2'}).reset_index()
x = pd.merge(d1[['classe_d1','index']],d2[['classe_d2','index']],left_on='index',right_on='index',how='left')
x['igual'] = [True if i == j else False for i, j in zip(x['classe_d1'],x['classe_d2'])]
x_diferente = list(x[x['igual'] == False].index)
for cd in x.loc[x['igual'] == False,'index']:
print(d1[d1['index'] == cd])
print(d2[d2['index'] == cd])
print('\n--------------------------------------------\n')