Does the Rabin-Karp really need me to care about applying a mod Q operation on the rolling hashes?

Question

I have been reading about the Rabin Karp algorithm and I kept wondering what's the big deal with keeping our rolling hashes values bounded by a value Q?

I'd thought that as our integer number representation on the typical computer is 2-complement, it's actually exactly equivalent as bounding all our operations over the rolling hashes by 2^31, so that in other words, I simply shouldn't care. Plus, the smaller we bound or hashes to, the more collisions we'd have, so a bigger Q should equal with improved performance!

I've tried coding a simple (Java) implementation:

public static int rabinKarp(String text, String pattern) {
    if (text.length() < pattern.length()) {
        return -1;
    } else {
        int patternHash = 0;
        int textHash = 0;
        int pow = 1;

        // preprocessing the pattern and the first characters of the text string
        for (int i = pattern.length()-1; i >= 0; --i) {
            patternHash += pattern.charAt(i) * pow;
            textHash += text.charAt(i) * pow;
            pow *= 10;
        }
        pow /= 10;

        // actual search
        if (patternHash == textHash && areEqual(text, 0, pattern)) {
            return 0;
        } else {
            for (int i = 1; i < text.length()-pattern.length()+1; ++i) {
                textHash -= text.charAt(i-1)*pow;
                textHash *= 10;
                textHash += text.charAt(i+pattern.length()-1);
                if (textHash == patternHash && areEqual(text, i, pattern)) {
                    return i;
                }
            }
            return -1;
        }
    }
}

From some preliminary tests, my hypothesis seems to be empirically accurate, yet I have not seen that written anywhere, so I am left wondering..

Am I missing something?

Tom van der Zanden · Accepted Answer · 2015-07-16T11:16:29.083

Yes, in practice you can get by fine with just letting the computations overflow. You are effectively working modulo $2^{32}$. It also has the advantage of not requiring an (expensive) modulo computation. However, it lacks some of the theoretical performance guarantees. You need to be very careful with the choice of the base (in this case: $10$) with respect to the modulus.

In particular, your choice of $10$ is very poor. Note that $10^{32}=2^{32}\cdot 5^{32}$, so $10^{32} \textrm{ mod } 2^{32} = 0$. This means that only the last $32$ characters of the string are taken in to account in the hash, so one can construct an input on which your algorithm performs very poorly.

Let the haystack be a string of $m$ $1$'s, i.e. $1111111\ldots$ and the needle a string consisting of $n$ $1$'s, one $0$, and then $32$ $1$'s. Because the string ends with $32$ $1$'s every position will result in a spurious hit, and the algorithm will need to loop over $n$ $1$'s before encountering a zero, meaning you will get a $\Omega(nm)$ running time.

I tested your algorithm on an input where $n=3000,m=n^2=9\cdot 10^6$. It took $18$ seconds to run on an input that ended in $32$ 1's, but only $200ms$ for a string ending in $31$ $1$'s.

The problem is that $10$ is not relatively prime to the modulus. For instance, taking $9$ as the base makes your program perform much better, taking only $200ms$ for the case with $32$ $1$'s. Of course, taking a prime modulus will partially solve this problem since the base will automatically be relatively prime to it. However, this is not the only reason for preferring a prime modulus.

Now, even if the modulus $n$ and base $b$ are relatively prime, undesirable things can still happen. For instance, there is a $k$ for which $b^k=1\textrm{ mod } n$. It is undesirable for $k$ to be small, since the hash function can not distinguish every $i^\textrm{th}$ character from every $i+k^\textrm{th}$ character. In mathematical terms you want the order of $b$ mod $n$ to be as large as possible.

The order of $b$ mod $n$ is always at most the Euler-Phi function $\phi(n)$. For a prime $p$, $\phi(p)=p-1$ while for non-primes $n$ it will be smaller. So taking $n$ to be a prime will allow more of the values of $b^k$ to be "useful". Ideally, one should take $b$ to be a primitive root modulo $n$, making that $b^k=1 \textrm{ mod } n$ does not hold for any value of $0<k<\phi(n)$.

Note that you can always construct instances for which the performance is poor, and to protect against "attacks" from an adversary you need to take the base and modulus to be random values.

Does the Rabin-Karp really need me to care about applying a mod Q operation on the rolling hashes?

1 Answers1