What is the fastest way to compute an RBF kernel in python?

Question

I would like to compute an RBF or "Gaussian" kernel for a data matrix X with n rows and d columns. The resulting square kernel matrix is given by:

K[i,j] = var * exp(-gamma * ||X[i] - X[j]||^2)

var and gamma are scalars.

What is the fastest way to do this in python?

You might want to offer some up some test data here if performance is the main concern - otherwise it is tricky for anyone else to offer alternative solutions and compare them appropriately. — miradulo, Nov 13 '17 at 19:22
@Divakar yes, they are, I'll add that to the question. Thank you for pointing me to tis. — Callidior, Nov 13 '17 at 20:07

Callidior · Answer 1 · 2017-11-13T20:09:12.997

I am going to present four different methods for computing such a kernel, followed by a comparison of their run-time.

Using pure numpy

Here, I use the fact that ||x-y||^2 = ||x||^2 + ||y||^2 - 2 * x^T * y.

import numpy as np

X_norm = np.sum(X ** 2, axis = -1)
K = var * np.exp(-gamma * (X_norm[:,None] + X_norm[None,:] - 2 * np.dot(X, X.T)))

Using numexpr

numexpr is a python package that allows for efficient and parallelized array operations on numpy arrays. We can use it as follows to perform the same computation as above:

import numpy as np
import numexpr as ne

X_norm = np.sum(X ** 2, axis = -1)
K = ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
        'A' : X_norm[:,None],
        'B' : X_norm[None,:],
        'C' : np.dot(X, X.T),
        'g' : gamma,
        'v' : var
})

Using `scipy.spatial.distance.pdist`

We could also use scipy.spatial.distance.pdist to compute a non-redundant array of pairwise squared euclidean distances, compute the kernel on that array and then transform it to a square matrix:

import numpy as np
from scipy.spatial.distance import pdist, squareform

K = squareform(var * np.exp(-gamma * pdist(X, 'sqeuclidean')))
K[np.arange(K.shape[0]), np.arange(K.shape[1])] = var

Using `sklearn.metrics.pairwise.rbf_kernel`

sklearn provides a built-in method for direct computation of an RBF kernel:

import numpy as np
from sklearn.metrics.pairwise import rbf_kernel

K = var * rbf_kernel(X, gamma = gamma)

Run-time comparison

I use 25,000 random samples of 512 dimensions for testing and perform experiments on an Intel Core i7-7700HQ (4 cores @ 2.8 GHz). More precisely:

X = np.random.randn(25000, 512)
gamma = 0.01
var = 5.0

Each method is run 7 times and the mean and standard deviation of the time per execution is reported.

|               Method                |       Time        |
|-------------------------------------|-------------------|
| numpy                               | 24.2 s ± 1.06 s   |
| numexpr                             | 8.89 s ± 314 ms   |
| scipy.spatial.distance.pdist        | 2min 59s ± 312 ms |
| sklearn.metrics.pairwise.rbf_kernel | 13.9 s ± 757 ms   |

First of all, scipy.spatial.distance.pdist is surprisingly slow.

numexpr is almost 3 times faster than the pure numpy method, but this speed-up factor will vary with the number of available CPUs.

sklearn.metrics.pairwise.rbf_kernel is not the fastest way, but only a bit slower than numexpr.

So, what was the shape of the input(s) for the test setup? Was it 25000 x 512 for `X`? — Divakar, Nov 13 '17 at 19:09
Could you add the test setup, so that we could grab those from here and test out rather than we assume things? — Divakar, Nov 13 '17 at 20:09
@Divakar I've added you I generated the data, as well as `gamma` and `var`. I simply used the `%%timeit` magic from IPython for the measurements. I'm curious whether anyone gets to different findings. — Callidior, Nov 13 '17 at 20:12

Divakar · Accepted Answer · 2017-11-13T20:46:42.733

Well you are doing a lot of optimizations in your answer post. I would like to add few more (mostly tweaks). I would build upon the winner from the answer post, which seems to be numexpr based on.

Tweak #1

First off, np.sum(X ** 2, axis = -1) could be optimized with np.einsum. Though this part isn't the biggest overhead, but optimization of any sort won't hurt. So, that summation could be expressed as -

X_norm = np.einsum('ij,ij->i',X,X)

Tweak #2

Secondly, we could leverage Scipy supported blas functions and if allowed use single-precision dtype for noticeable performance improvement over its double precision one. Hence, np.dot(X, X.T) could be computed with SciPy's sgemm like so -

sgemm(alpha=1.0, a=X, b=X, trans_b=True)

Few more tweaks on rearranging the negative sign with gamma lets us feed more to sgemm. Also, we would push in gamma into the alpha term.

Tweaked implementations

Thus, with these two optimizations, we would have two more variants (if I could put it that way) of the numexpr method, listed below -

from scipy.linalg.blas import sgemm

def app1(X, gamma, var):
    X_norm = -np.einsum('ij,ij->i',X,X)
    return ne.evaluate('v * exp(g * (A + B + 2 * C))', {\
        'A' : X_norm[:,None],\
        'B' : X_norm[None,:],\
        'C' : np.dot(X, X.T),\
        'g' : gamma,\
        'v' : var\
    })

def app2(X, gamma, var):
    X_norm = -gamma*np.einsum('ij,ij->i',X,X)
    return ne.evaluate('v * exp(A + B + C)', {\
        'A' : X_norm[:,None],\
        'B' : X_norm[None,:],\
        'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
        'g' : gamma,\
        'v' : var\
    })

Runtime test

Numexpr based one from your answer post -

def app0(X, gamma, var):
    X_norm = np.sum(X ** 2, axis = -1)
    return ne.evaluate('v * exp(-g * (A + B - 2 * C))', {
            'A' : X_norm[:,None],
            'B' : X_norm[None,:],
            'C' : np.dot(X, X.T),
            'g' : gamma,
            'v' : var
    })

Timings and verification -

In [165]: # Setup
     ...: X = np.random.randn(10000, 512)
     ...: gamma = 0.01
     ...: var = 5.0

In [166]: %timeit app0(X, gamma, var)
     ...: %timeit app1(X, gamma, var)
     ...: %timeit app2(X, gamma, var)
1 loop, best of 3: 1.25 s per loop
1 loop, best of 3: 1.24 s per loop
1 loop, best of 3: 973 ms per loop

In [167]: np.allclose(app0(X, gamma, var), app1(X, gamma, var))
Out[167]: True

In [168]: np.allclose(app0(X, gamma, var), app2(X, gamma, var))
Out[168]: True

Great! With my test setup, your variant #2 computes the kernel in 4.82 seconds, which is significantly faster than my 8.89 seconds. I'll wait 2 days to see if someone can come up with an even faster solution, but otherwise this will be the correct answer. — Callidior, Nov 13 '17 at 20:43
@Callidior Slightly tweaked app#2 a bit more. If you are timing, please use the updated one. — Divakar, Nov 13 '17 at 20:52
@Callidior Also, there's a very closely related project I was involved with sometime back - https://github.com/droyed/eucl_dist/wiki/Main-Article. I got amazing speedups with GPU. So, if you are really desperate for performance, I would highly recommend porting over to GPUs. — Divakar, Nov 13 '17 at 20:56
I get 4.7 s with the new version. However, how do you justify the use of single precision? Since the other methods would also be faster with single precision, I would normally consider that kind of cheating. Anyway, even with `dgemm` instead of `sgemm` your version is slightly faster than mine (8.28 s vs. 8.89 s). — Callidior, Nov 13 '17 at 21:03
@Callidior That's why I qualified it in the post as - `if allowed use single-precision dtype`, the subtext being if the use-case allows it. As also mentioned, it was stated to be noticeably faster than double precision one `dgemm`, which as you just saw is noticeably slower. — Divakar, Nov 13 '17 at 21:05
@Callidior I haven't seen `np.dot` allowing single precision. Would had been awesome though :) Or if you could figure that out. — Divakar, Nov 13 '17 at 21:12
If the two arguments to `np.dot` have a `dtype` of `np.float32`, the result is single precision as well and the computation is also faster. It hence seems to support single precision dot products. — Callidior, Nov 13 '17 at 21:24
@Callidior Ah yeah, that's one way! I guess I was worried about memory efficiency. So, then both `app0` and `app1` move forward with better timings. — Divakar, Nov 13 '17 at 21:29

score 1 · Answer 3 · answered Jun 10 '21 at 09:14

In the case that you are evaluating X against a high number of gammas, it is useful to save the negative pairwise distances matrix using the tricks done by @Callidior and @Divakar.

from numpy import exp, matmul, power, einsum, dot
from scipy.linalg.blas import sgemm
from numexpr import evaluate

def pdist2(X):
    X_norm = - einsum('ij,ij->i', X, X)
    return evaluate('A + B + C', {
        'A' : X_norm[:,None],
        'B' : X_norm[None,:],
        'C' : sgemm(alpha=2.0, a=X, b=X, trans_b=True),
    })

pairwise_distance_matrix = pdist2(X)

Then, the best solution would be to use numexpr to compute the exponential.

def rbf_kernel2(gamma, p_matrix):
    return evaluate('exp(g * m)', {
        'm' : p_matrix,
        'g' : gamma,
    })

Example:

import numpy as np
np.random.seed(1001)
X= np.random.rand(1001, 5).astype('float32')
p_matrix_test = pdist2(X)
gamma_test_list = (10 ** np.linspace(-2, 1, 11)).astype('float32')

def app2(gamma, X):
    X_norm = - gamma * einsum('ij,ij->i', X, X)
    return evaluate('exp(A + B + C)', {\
        'A' : X_norm[:, None],\
        'B' : X_norm[None, :],\
        'C' : sgemm(alpha=2.0*gamma, a=X, b=X, trans_b=True),\
        'g' : gamma,
    })

I have the results:

%timeit y = [app2(gamma_test, x_test) for gamma_test in gamma_test_list]

70.8 ms ± 5.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit y = [rbf_kernel2(gamma_test, p_matrix_test) for gamma_test in gamma_test_list]

33.6 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that you need to add the overhead to compute the pairwise distance matrix before but it shouldn't be much if you are evaluating against a large number of gammas.

score 0 · Answer 4 · answered Oct 19 '21 at 17:07

The NumPy solution with a given X, Y and gamma would be:

import numpy as np

def rbf_kernel(X, Y, gamma):
    X_norm = np.sum(X ** 2, axis=-1)
    Y_norm = np.sum(Y ** 2, axis=-1)
    K = np.exp(-gamma * (X_norm[:, None] + Y_norm[None, :] - 2 * np.dot(X, Y.T)))
    return K

What is the fastest way to compute an RBF kernel in python?

4 Answers4

Using pure numpy

Using numexpr

Using `scipy.spatial.distance.pdist`

Using `sklearn.metrics.pairwise.rbf_kernel`

Run-time comparison

Tweak #1

Tweak #2

Tweaked implementations

Runtime test

Linked

What is the fastest way to compute an RBF kernel in python?

4 Answers4

Using pure numpy

Using numexpr

Using scipy.spatial.distance.pdist

Using sklearn.metrics.pairwise.rbf_kernel

Run-time comparison

Tweak #1

Tweak #2

Tweaked implementations

Runtime test

Linked

Using `scipy.spatial.distance.pdist`

Using `sklearn.metrics.pairwise.rbf_kernel`