1

Here is the problem: I fitted a Random Forest Classifier and saved it to a pickle file. However, when I predict with the entire dataset I get one result, and when run predict line by line (loop) I get another result. Why is this happening?

Here is one example. The probabilities ("prob_sim" and "prob_nao") are inverted between each output, but this doesn't happen all the time, and the classes ("class_d1" and "class_d2) are different, so I think I didn't switch the probabilities.

index index1 0 1 2 3 4 5 6 7 8 prob_sim prob_nao classe_d1 df
2764 2764 38 0 0 23 9.72 -24.35 167.21 126.31 3 0.475422 0.524578 nao d1

index index1 0 1 2 3 4 5 6 7 8 prob_sim prob_nao classe_d2 df 2764 2764 38 0 0 23 9.72 -24.35 167.21 126.31 3 0.526055 0.473945 sim base

I've found this post, but it did not help me.

Some considerations:

  1. The only preprocess step for the training data is encoding;
  2. I've tried ensuring that the numbers were all float or int on both predictions;
  3. First I used Pandas DF, then I tried numpy arrays (that's why there are some "to_numpy") on both predictions;
  4. I've tried setting np.random.seed(42).


The model creation:

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=42)

modelo = RandomForestClassifier( criterion= 'entropy', max_features= 'sqrt', min_samples_split= 4, n_estimators= 158, random_state=42) over = SMOTE(random_state=42) under = RandomUnderSampler(random_state=42)

pipe = Pipeline([ ('o',over), ('u',under), ('m',modelo) ])

pipe.fit(x_train,y_train) pickle.dump(pipe,open(nome_modelo,'wb'))

The first prediction process, with the entire dataframe:

  modelo = pickle.load(open(nome_modelo,'rb'))
  d1 = base.copy(deep=True).reset_index(drop=True).to_numpy()
  y_new_chance = modelo.predict_proba(d1)
  y_new_class = modelo.predict(d1)
  d1 = pd.DataFrame(d1)

d1['prob_sim'] = [i[1] for i in y_new_chance] d1['prob_nao'] = [i[0] for i in y_new_chance] d1['classe'] = y_new_class d1['df'] = 'd1' d1.to_csv('d1.csv',index=False)

The second prediction process, with a loop that predicts line by line:

base = base.reset_index(drop=True)
for idx,row in base.iterrows():
    x_new = pd.DataFrame({
      '0':int(row['0']),
      '1':int(row['1']),
      '2':int(row['2']),
      '3':int(row['3']),
      '4':row['4'],
      '5':row['5'],
      '6':row['6'],
      '7':row['7'],
      '8':int(row['8'])
    },index=[idx]).to_numpy()
y_new_chance = modelo.predict_proba(x_new)
y_new_class = modelo.predict(x_new)
base.loc[idx,'prob_sim'] = y_new_chance[0][1]
base.loc[idx,'prob_nao'] = y_new_chance[0][0]
base.loc[idx,'classe'] = y_new_class[0]
base['df'] = 'base'

base.to_csv('d2.csv',index=False)

Then I used the following code to check for different result:

d1 = pd.read_csv('d1.csv')
d2 = pd.read_csv('d2.csv')

d1 = d1.rename(columns={'classe':'classe_d1'}).reset_index() d2 = d2.rename(columns={'classe':'classe_d2'}).reset_index()

x = pd.merge(d1[['classe_d1','index']],d2[['classe_d2','index']],left_on='index',right_on='index',how='left') x['igual'] = [True if i == j else False for i, j in zip(x['classe_d1'],x['classe_d2'])] x_diferente = list(x[x['igual'] == False].index)

for cd in x.loc[x['igual'] == False,'index']: print(d1[d1['index'] == cd]) print(d2[d2['index'] == cd]) print('\n--------------------------------------------\n')

Juarez
  • 11
  • 2

0 Answers0