4

Let's say that I wanted to build a key to value associative map where the only requirement was that lookup times were fast.

Once built, the associative map would not allow inserts, deletes, or modifications, so is a static data structure.

What would the best algorithm be for making and using such a thing?

I'm interested to know if there is a provable "best solution", or if it's provably NP-Hard or similar.

Here are some ideas of my own:

Minimally Perfect Hash

My best idea would be to use minimal perfect hashing. I would find a hashing algorithm that hashes the $N$ known inputs to $[0,N)$, that resulting value being able to be looked up in an array.

I would want to find the computationally cheapest (average time) hashing algorithm for my data set.

However, finding a minimally perfect hash function is a challenging problem already without wanting the computationally cheapest one.

Feature Based Indexing

Another thought I have would be to look at my input and find bits which are differing between items. For instance, file paths may have a lot of the same characters in them, especially if they are absolute paths to the same deep folder.

I could find where the bits are that matter and make a tree of objects out of them.

The challenge here I think is that I would ideally want a balanced tree, and it might be hard to test all the permutations of bits that actually matter, to make the tree as balanced as possible.

I think Ideally, my hope is that the entire tree could go away, and I could instead take the bits that mattered and make some equation like "xor bit 2 against bit 3 and add bit 5" to come up with an index into an array.

Raphael
  • 73,212
  • 30
  • 182
  • 400
Alan Wolfe
  • 1,358
  • 11
  • 22

2 Answers2

6

Here's a way to think about the feature-based approach: you select a set of candidate features, then look for the smallest decision tree that assigns each of the $N$ keys a different index in $[0,N)$ using only those features. This is effectively a way of building a perfect hash, where the hash function is expressed as a decision tree applied to the feature vector constructed from the input.

Building the smallest decision tree is NP-hard in general, but there are heuristic algorithms that do a reasonable job. You might look at the ID3 algorithm, for instance; in many settings it does a pretty good job of building a small (though not necessarily minimal) decision tree. You can compare that to other ways of building perfect hash functions and see which is computationally more efficient for your particular data set. Probably the only way to know is to try it out; I'm not sure there is any theory that will give you a definitive answer, as the answer probably depends on characteristics of your dataset that will be hard to model in a theoretical setting.

D.W.
  • 167,959
  • 22
  • 232
  • 500
2

A practical answer to this question is that the fastest lookup is the simplest. Make a flat array of the data items you want to look up, and make sure all your keys are pre-processed (before run time, possibly at compile time) from their key value into the array index that they correspond to.

Furthermore, doing this, if the data is truly static, and has duplicate values in it, you can make multiple keys point at the same value to deduplicate the data.

This is the fastest pragmatic solution, but doesn't answer the question about what sort of data structures might be helpful, for the cases where you can't preprocess the data and keys in this way.

More info on this and similar techniques, including performance timings and source code, is available here: http://blog.demofox.org/2016/10/03/is-code-faster-than-data-examining-hash-tables/

Alan Wolfe
  • 1,358
  • 11
  • 22