1

I'm trying to predict some variables for MOF's (from a scientific paper) using the Random Forest model in Phyton, but the value of R2 is negative (different from the paper, which was positive). I actually don't know if the problem is with my dataset (that involves assigning numerical values to strings), or with my code. The original dataset is literature dataset and my version is "Dados editados 3"

The Dataset link: https://docs.google.com/spreadsheets/d/17r-hxcuuzEFdfsqcAdHe_9iIp1d51IvL/edit?usp=sharing&ouid=111702212107777597741&rtpof=true&sd=true

I tried to review the code multiples times, though I don't know what I'm doing wrong, because the RF model is a very robust one. Also, I'm removing all Nan values of the dataset with a funcition to detect and remove outliers.

The code:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import MinMaxScaler

from sklearn.ensemble import RandomForestRegressor

from sklearn import preprocessing from sklearn import metrics

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

import warnings warnings.filterwarnings('ignore')

from google.colab import drive

drive.mount('/content/drive', force_remount= True)

df = pd.read_excel(r'/content/drive/My Drive/Projeto ML/Dados ML.xlsx', sheet_name = "Dados editados 3")

print(df.shape)

df = df.drop(columns = ['Enzyme loading', 'Immobilisation yield', 'Activity retention'] )

df.replace('-', np.NaN, inplace = True)

df = df.dropna()

print(df.shape)

#Modified function

def get_outliers(l): #if you keep 0.1 and 0.75 then pretty much no outliers will be filtered #q1 is 0.25 quantile and q3 is 0.75 quantile q1 = l.quantile(0.25) q3 = l.quantile(0.75) iqr = q3-q1 fenceLow = q1 - 1.5 * iqr fenceHigh = q3 + 1.5 * iqr return [~(i>=fenceLow and i<=fenceHigh) for i in l]

outliers = df.apply(get_outliers)

df =df[~outliers.apply(lambda x:any(x), axis=1)]

X = df.iloc[:, :-1]

Y = df.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 42)

print(X)

scaler = preprocessing.MinMaxScaler(feature_range = (0.1, 0.9))

x_train = scaler.fit_transform(x_train)

x_test= scaler.fit_transform(x_test)

from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score from sklearn.metrics import r2_score from matplotlib import pyplot as plt

rf = RandomForestRegressor()

rf.fit(x_train, y_train)

y_pred = rf.predict(x_test)

k_folds = KFold(n_splits = 6)

scores = cross_val_score(rf, x_train, y_train, cv = k_folds, scoring = "r2")

print("CV Score: ", scores) print("Average CV Score: ", scores.mean()) print("Number of CV Scores used in Average: ", len(scores))

Any suggestion about the code or the dataset itself?

enter image description here

desertnaut
  • 2,154
  • 2
  • 16
  • 25

1 Answers1

2

I had a go at processing the data and fitting a random forest.

My steps were:

  • Drop columns: DOI, Amino group
  • Drop rows with no y value
  • Convert strings to numbers, and one-hot encode Ligand column
  • Put a test set aside (shuffling & stratified sampling was used)
  • Impute missing categorical rows, and missing numerical values

These steps were encapsulated in a pipeline that allows for full separation of train-validation data during CV.

I ran KFold CV and got a median $R^2$ of 0.57.

enter image description here

One of the folds had a negative (close to zero) $R^2$. I ran CV again, but this time recording all of the predictions and plotting them for each of the 6 folds:

enter image description here

The model seems to learn the training set well, but generalise relatively poorly. The fold with the lowest score (first plot) had relatively few values at the lower end, and was variable at the upper end. Even though it had a low $R^2$, its RMSE of 20 was as good as most other folds. The folds generally had flat-looking test predictions apart from at y=0 - is y=0 a valid value?

I finally ran leave-one-out CV to see how well it predicts individual samples when trained on the rest of the data.

enter image description here

To me it seems like the model isn't too good at rendering predictions from these features. I think the $R^2$ could be skewed by the y=0 entries. I then deleted the y=0 rows and re-ran the CV, finding that the score dropped to 0.15.

The bad predictions might be a data issue. Values beyond y=0 the model doesn't make good predictions. Also, tuning of the RF didn't yield significant improvement. sklearn.ensemble.GradientBoostingRegressor did a bit better. But I think one ought to try a wider range of models before drawing a definitive conclusion as to whether it's a data problem vs. model problem.

Could you point to the paper that reported their attempt?

My analysis code:

import pandas as pd
import numpy as np

from sklearn import set_config set_config(transform_output = "pandas")

np.random.seed(0) #reproducible results

df = pd.read_csv('Dados ML.xlsx - Literature dataset.csv', header=1)

#Remove the DOI column df = df.drop(columns='DOI')

#Drop amino column amino_col = 'Dominant amino acid group of enzymes' df = df.drop(columns=amino_col)

#Replace commas with a dot and

and '-' with NaN to allow numerical conversion

df = df.apply(lambda col: col.str.replace(',', '.').replace('-', np.nan), axis=1)

#Identify categorical vs numeric columns ligand_col = 'Ligand precursor functional group' categorical_cols = [ligand_col,]

numeric_cols = [col for col in df.columns if col not in categorical_cols]

#Convert numbers (currently strings) to numeric types df[numeric_cols] = df[numeric_cols].astype(np.float64)

#Drop unavailable targets target_col = 'Reusability: Relative activity after 7 reuse cycles (%)' df = df.drop(index=df.index[ df[target_col].isna() ])

df = df.loc[df[target_col] != 0] #Drop zeros assuming they're invalid

def quick_hist(df, title=''): df_num = df.select_dtypes(np.float64) axs = df_num.hist(bins=50, grid=False, figsize=(7, 5)) for i, name in enumerate(df_num.columns): col_num = df.columns.to_list().index(name) axs.flatten()[i].set_title(f'[{col_num}] ' + name[:min(16, len(name))], fontsize=9) plt.suptitle(title, fontsize=12) plt.tight_layout() plt.show()

#Features/target for preprocessing features = df.drop(columns=target_col) target = df[target_col] quick_hist(features, 'features before outlier removal')

outliers_bool = ( (features.iloc[:, 2] > 10) | (features.iloc[:, 3] > 50) | (features.iloc[:, 5] > 50) | (features.iloc[:, 10] > 200_000) | (features.iloc[:, 11] > 100_000) )

print('Detected', outliers_bool.sum(), 'outliers') df_inliers = df.drop(index=features.index[ outliers_bool ]) features_inliers = df_inliers.drop(columns=target_col) target_inliers = df_inliers[target_col] quick_hist(features_inliers, 'features after outlier removal')

drop_outliers = False if drop_outliers: print('Dropping outliers'.upper()) use_features = features_inliers use_target = target_inliers else: print('NOT dropping outliers') use_features = features use_target = target

#Split off a test test to be used as a final unbiased measure from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( use_features, use_target, test_size=0.15, stratify=pd.cut(use_target, bins=5), shuffle=True, random_state=1000 )

#Use imputation to fill in missing data from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer

cat_transformer = SimpleImputer(strategy='most_frequent') num_transformer = SimpleImputer(strategy='median') numeric_features = [col for col in numeric_cols if col != target_col]

imputer = ColumnTransformer( [ ('Categorical imputer', cat_transformer, categorical_cols), ('Numeric imputer', num_transformer, numeric_features) ], verbose_feature_names_out=False )

#Encode categorical column from sklearn.preprocessing import OneHotEncoder

ligand_encoder = OneHotEncoder(sparse_output=False, drop='if_binary') cat_encoder = ColumnTransformer( [ ('Ligand encoder', ligand_encoder, [ligand_col]), ], remainder='passthrough', verbose_feature_names_out=False )

#Compose a preprocessor pipeline from sklearn.pipeline import make_pipeline

preprocessor = make_pipeline(imputer, cat_encoder)

#Try a random forest from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score, KFold, LeaveOneOut

#Shuffle the data per fold for kfold kfold_cv = KFold(n_splits=6, shuffle=True, random_state=0) loo_cv = LeaveOneOut()

model = RandomForestRegressor(random_state=np.random.RandomState(0))

#Dummy regressor

from sklearn.dummy import DummyRegressor

model = DummyRegressor(strategy='median')

#Another model I tried

from sklearn import ensemble

model = ensemble.GradientBoostingRegressor(learning_rate=0.1/4, random_state=np.random.RandomState(0))

#Compose pipeline with model pipeline = make_pipeline(preprocessor, model)

scores = cross_val_score( estimator=pipeline, X=X_train, y=y_train, cv=kfold_cv, scoring='r2', n_jobs=-1 )

print('median score:', np.median(scores).round(3)) print('scores:', scores.round(3))

""" median score: 0.569 scores: [-0.008 0.473 0.613 0.539 0.799 0.599] """

Let's inspect each fold

from matplotlib import pyplot as plt from sklearn.model_selection import cross_validate from sklearn.metrics import mean_squared_error, r2_score

#Get complete results cv_results = cross_validate( estimator=pipeline, X=X_train, y=y_train, cv=kfold_cv, scoring='r2', n_jobs=-1, return_estimator=True, return_indices=True )

trn_predictions = [] val_predictions = []

f, axs = plt.subplots(3, 2, figsize=(8, 4), sharex=True, sharey=True) axs = axs.flatten()

for fold in range(kfold_cv.get_n_splits()): fold_estimator = cv_results['estimator'][fold] fold_trn_indices = cv_results['indices']['train'][fold] fold_val_indices = cv_results['indices']['test'][fold]

trn_predictions.append(
    fold_estimator.predict(X_train.iloc[fold_trn_indices])
)

val_predictions.append(
    fold_estimator.predict(X_train.iloc[fold_val_indices])
)

#Print fold's RMSE and R2
fold_val_y = y_train.iloc[fold_val_indices]

print(f'Fold {fold}\n val RMSE:',
      mean_squared_error(fold_val_y, val_predictions[fold], squared=False))

print(' val R2:',
      r2_score(fold_val_y, val_predictions[fold]), '\n')

#Scatter of train and val predictions for fold
fold_trn_y = y_train.iloc[fold_trn_indices]
axs[fold].scatter(fold_trn_y, trn_predictions[fold], marker='.', label='train predictions')
axs[fold].scatter(fold_val_y, val_predictions[fold], marker='.', label='val predictions')
axs[fold].axline(xy1=(0, 0), slope=1, color='black', linewidth=1, linestyle='--')

axs[-1].set_xlabel('$y$', fontsize=12) axs[0].set_ylabel('$\hat{y}$', fontsize=12) axs[0].legend(bbox_to_anchor=(0.6, 1.7)) plt.show()

#Use leave-one-out CV

from sklearn.model_selection import cross_validate

cv_results = cross_validate( estimator=pipeline, X=X_train, y=y_train, cv=loo_cv, scoring='neg_mean_absolute_error', return_estimator=True, return_indices=True, n_jobs=-1 )

#Get out-of-sample predictions predictions = np.empty(y_train.shape) for fold in range(loo_cv.get_n_splits(X_train)): predictions[fold] = cv_results['estimator'][fold].predict( X_train.iloc[cv_results['indices']['test'][fold]] ).item()

print('LOO scores') print(' RMSE: ', mean_squared_error(y_train, predictions, squared=False)) print(' R2:', r2_score(y_train, predictions).round(3))

val_indices_flat = np.concatenate(cv_results['indices']['test'])

from matplotlib import pyplot as plt plt.scatter(y_train.iloc[val_indices_flat], predictions, alpha=0.6) plt.axline(xy1=(0, 0), slope=1, color='black', linestyle=':') plt.xlim(-1, 101) plt.ylim(-1, 101) plt.xlabel('$y$', fontsize=12) plt.ylabel('$\hat{y}$', fontsize=12) plt.gcf().set_size_inches(5, 3)

"""

Final evaluation on the test set

fitted_pipeline = pipeline.fit(X_train, y_train) test_predictions = pipeline.predict(X_test) test_r2 = r2_score(y_test, test_predictions) test_rmse = mean_squared_error(y_test, test_predictions, squared=False)

print('Final test set scores') print('RMSE', round(test_rmse, 2)) print('R2:', round(test_r2, 3)) """