Well you are doing a lot of optimizations in your answer post. I would like to add few more (mostly tweaks). I would build upon the winner from the answer post, which seems to be numexpr based on.
Tweak #1
First off, np.sum(X ** 2, axis = -1) could be optimized with np.einsum. Though this part isn't the biggest overhead, but optimization of any sort won't hurt. So, that summation could be expressed as -
X_norm = np.einsum('ij,ij->i',X,X)
Tweak #2
Secondly, we could leverage Scipy supported blas functions and if allowed use single-precision dtype for noticeable performance improvement over its double precision one. Hence, np.dot(X, X.T) could be computed with SciPy's sgemm like so -
sgemm(alpha=1.0, a=X, b=X, trans_b=True)
Few more tweaks on rearranging the negative sign with gamma lets us feed more to sgemm. Also, we would push in gamma into the alpha term.
Tweaked implementations
Thus, with these two optimizations, we would have two more variants (if I could put it that way) of the numexpr method, listed below -
from scipy.linalg.blas import sgemm
def app1(X, gamma, var):
X_norm = -np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(g * (A + B + 2 * C))', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : np.dot(X, X.T),\
'g' : gamma,\
'v' : var\
})
def app2(X, gamma, var):
X_norm = -gamma*np.einsum('ij,ij->i',X,X)
return ne.evaluate('v * exp(A + B + C)', {\
'A' : X_norm[:,None],\
'B' : X_norm[None,:],\
'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
'g' : gamma,\
'v' : var\
})
Runtime test
Numexpr based one from your answer post -
def app0(X, gamma, var):
X_norm = np.sum(X ** 2, axis = -1)
return ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
'A' : X_norm[:,None],
'B' : X_norm[None,:],
'C' : np.dot(X, X.T),
'g' : gamma,
'v' : var
})
Timings and verification -
In [165]: # Setup
...: X = np.random.randn(10000, 512)
...: gamma = 0.01
...: var = 5.0
In [166]: %timeit app0(X, gamma, var)
...: %timeit app1(X, gamma, var)
...: %timeit app2(X, gamma, var)
1 loop, best of 3: 1.25 s per loop
1 loop, best of 3: 1.24 s per loop
1 loop, best of 3: 973 ms per loop
In [167]: np.allclose(app0(X, gamma, var), app1(X, gamma, var))
Out[167]: True
In [168]: np.allclose(app0(X, gamma, var), app2(X, gamma, var))
Out[168]: True