4

Let's say there are $M$ strings that we are trying to create a perfect hash for such that we get as output of the hash $[0,M)$ with no collisions, when hashing those $M$ items.

I know that there are algorithms that can help you achieve this by finding salt values to use with a given hash function that will cause this to happen (like this one http://blog.demofox.org/2015/12/14/o1-data-lookups-with-minimal-perfect-hashing/).

However, all of the algorithms I've seen are randomized/probabilistic: e.g., there is a probability $p$ for any particular salt value to yield a perfect hash, so keep trying until you find one.

Is there any deterministic algorithm for constructing a perfect hash?

Second, is there a deterministic algorithm to construct a perfect hash that is minimal in size (the smallest program that is a valid perfect hash on these $M$ items)?

Security isn't a concern, and while efficiency of the algorithm to find the mapping is nice, it isn't a requirement. Evaluating whatever mapping program was found should be as efficient as possible though.

Alan Wolfe
  • 1,358
  • 11
  • 22

1 Answers1

4

There is a deterministic algorithm for constructing a perfect hash, if you don't care about efficiency. For instance, you can enumerate all programs (in order of increasing size) and test each one to see which is the first that produces a valid perfect hash. This is a valid deterministic algorithm that is guaranteed to always find a valid perfect hash (and even to find the minimal such program). However, this will be extremely slow for all but the smallest values of $M$, so it's not likely to be useful in practice.

I expect that finding the smallest such program will be NP-hard, and probably $\Pi_2$-hard (loosely, "even harder than NP-hard"). I don't have a proof of this -- just a suspicion. For example, the problem of circuit minimization is known to be $\Pi_2$-hard. In other words, given a circuit $C$, it's $\Pi_2$-hard to find the smallest circuit $C'$ that computes the same function as $C$. Your probably isn't quite the same, but it has a "similar feel", so I would expect your problem (finding the smallest deterministic program that yields a valid hash for the $M$ items) will also be hard.

If you are willing to make unproven assumptions, it is possible to come up with a deterministic algorithm that is also efficient: about as efficient as the randomized/probabilistic algorithms you know of it and want to avoid. In particular, you can take a randomized algorithm and derandomize it as follows: fix a cryptographic-strength pseudorandom number generator (e.g., AES-CTR mode); fix a seed for it (e.g., the all-zeros AES key; or pick a random 128-bit value, and fix it as the AES key); feed this seed the prng algorithm and use it to generate an unending stream of pseudorandom numbers; now run the randomized algorithm, except that whenever it calls for a random number, give it the next pseudorandom number from the output of the crypto prng. Note that this is a completely deterministic algorithm: the seed is hardcoded into the code of the algorithm and fixed for all time. Here's what we know: if the pseudorandom number generation algorithm is cryptographically secure, and if the randomized algorithm outputs the correct answer, then (with high probability over the choice of seed) the deterministic algorithm also outputs the correct answer.

This means that if you know how to build a cryptographically-secure pseudorandom number generator, you can convert any randomized algorithm into a deterministic algorithm. It may be surprising that cryptography has anything to do with this subject, since we don't require kind of security properties from the hash, but it turns out that this construction gives a generic way to derandomize any randomized algorithm. Here's how to see the connection: the definition of what it means for a prng to be cryptographically-security is that no polynomial-time algorithm should be able to tell that it is receiving pseudorandomness rather than real randomness; thus, we can run the randomized algorithm, and swapping out the true-random bits for pseudorandom bits won't mess up the algorithm.

Thus, this will let you build a deterministic algorithm that is correct and efficient, if AES is secure (or one that works if factoring is hard, or based on any other cryptographic assumption). It's widely believed that AES is secure (and factoring is hard, etc.), even though we have no proof, so this provides a reasonable practical solution to your problem, even though we have no proof that it works.

D.W.
  • 167,959
  • 22
  • 232
  • 500