Given a pandas DataFrame as below:
import pandas as pd
from sklearn.metrics import mean_squared_error
df = pd.DataFrame.from_dict(
{'row': ['a','b','c','d','e','y'],
'a': [ 0, -.8,-.6,-.3, .8, .01],
'b': [-.8, 0, .5, .7,-.9, .01],
'c': [-.6, .5, 0, .3, .1, .01],
'd': [-.3, .7, .3, 0, .2, .01],
'e': [ .8,-.9, .1, .2, 0, .01],
'y': [ .01, .01, .01, .01, .01, 0],
}).set_index('row')
df.columns.names = ['col']
I want to create a new column of RMSE values (from scikit-learn) using specific columns for the arguments. Namely, the columns y_true = df['a','b','c'] vs y_pred = df['x','y','x']. This was easy to do using an iterative approach:
for tup in df.itertuples():
df.at[tup[0], 'rmse'] = mean_squared_error(tup[1:4], tup[4:7])**0.5
And that gives the desired result:
col a b c d e y rmse
row
a 0.00 -0.80 -0.60 -0.30 0.80 0.01 1.003677
b -0.80 0.00 0.50 0.70 -0.90 0.01 1.048825
c -0.60 0.50 0.00 0.30 0.10 0.01 0.568653
d -0.30 0.70 0.30 0.00 0.20 0.01 0.375988
e 0.80 -0.90 0.10 0.20 0.00 0.01 0.626658
y 0.01 0.01 0.01 0.01 0.01 0.00 0.005774
But I want a higher-performance solution, possibly using vectorization, since my dataframe has shape (180000000, 52). I also dislike indexing by tuple position rather than by column name. The attempt below:
df['rmse'] = df.apply(mean_squared_error(df[['a','b','c']], df[['d','e','y']])**0.5, axis=1)
Gets the error:
TypeError: ("'numpy.float64' object is not callable", 'occurred at index a')
So what am I doing wrong with my use of df.apply()? Does this even maximize performance over iteration?
Testing Performance
I've tested the wall times for each of the first two respondants using the below test df:
# set up test df
dim_x, dim_y = 50, 1000000
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
shuffle(cols)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols) #, index=idx, columns=cols
a = df.values
# define column samples
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
c0 = [s for s in cols if "a" in s]
c1 = [s for s in cols if "b" in s]
s0 = a[:,column_index(df, c0)]
s1 = a[:,column_index(df, c1)]
The results are as follows:
%%time
# approach 1 - divakar
rmse_out = np.sqrt(((s0 - s1)**2).mean(1))
df['rmse_out'] = rmse_out
Wall time: 393 ms
%%time
# approach 2 - divakar
diffs = s0 - s1
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'] = rmse_out
Wall time: 228 ms
%%time
# approach 3 - divakar
diffs = s0 - s1
rmse_out = np.sqrt((np.einsum('ij,ij->i',s0,s0) + \
np.einsum('ij,ij->i',s1,s1) - \
2*np.einsum('ij,ij->i',s0,s1))/3.0)
df['rmse_out'] = rmse_out
Wall time: 421 ms
The solution using the apply function is still running after several minutes...