I have been reading about the Rabin Karp algorithm and I kept wondering what's the big deal with keeping our rolling hashes values bounded by a value Q?
I'd thought that as our integer number representation on the typical computer is 2-complement, it's actually exactly equivalent as bounding all our operations over the rolling hashes by 2^31, so that in other words, I simply shouldn't care. Plus, the smaller we bound or hashes to, the more collisions we'd have, so a bigger Q should equal with improved performance!
I've tried coding a simple (Java) implementation:
public static int rabinKarp(String text, String pattern) {
if (text.length() < pattern.length()) {
return -1;
} else {
int patternHash = 0;
int textHash = 0;
int pow = 1;
// preprocessing the pattern and the first characters of the text string
for (int i = pattern.length()-1; i >= 0; --i) {
patternHash += pattern.charAt(i) * pow;
textHash += text.charAt(i) * pow;
pow *= 10;
}
pow /= 10;
// actual search
if (patternHash == textHash && areEqual(text, 0, pattern)) {
return 0;
} else {
for (int i = 1; i < text.length()-pattern.length()+1; ++i) {
textHash -= text.charAt(i-1)*pow;
textHash *= 10;
textHash += text.charAt(i+pattern.length()-1);
if (textHash == patternHash && areEqual(text, i, pattern)) {
return i;
}
}
return -1;
}
}
}
From some preliminary tests, my hypothesis seems to be empirically accurate, yet I have not seen that written anywhere, so I am left wondering..
Am I missing something?