5

I've been tasked with hashing arbitrary types in C++, with the caveat that A == B implies hash(A) == hash(B) even if equality of A and B is determined by a custom equality function ==.

For simplicity we can assume that == is an equivalence relation.

For example, the expected behavior of the hash function on std::vectors is as follows:

Given

using namespace std;
vector A = vector();
vector B = vector(); 

A == B will be true because == is overloaded for std::vector to mean equality of the underlying data. Correspondingly, hash(A) == hash(B) should also be true.

I can't simply hash the addresses of A,Bas integers because A == B but hash(&A) != hash(&B) in general.

I've thought of one solution, but I wonder if its optimal. It seems terribly inefficient. The solution is to build the hash function as new values are hashed:


using namespace std;

<template class Key>
class Hasher{

    public:

       unordered_map<pair<Key,Integer>> hashedKeys;
       int max_hash

       Hasher(int max_hash){
          this->max_hash = max_hash;
       }

       int hash(Key key){

          // If key has already been hashed, used that hash_value
          if ( hashedKeys.count(key) == 1){
               return hashedKeys[key];
          }

          // For pairs of saved (Key key, int hash_value)
          for(unordered_map<Key,Integer>::iterator it=hashedKeys.begin(); it!=hashedKeys.end(); it++;){

             // If an equal key has been inserted, just use its hash_value
             if(key == *it){
                hashedKeys.insert(key, *it.second);
                return *it.second; //use hash value of equal Key
             }
          }

          // If no other Keys equal this one, randomly hash it, and save
          int hash_value = rand() % max_hash;
          hashedKeys.insert(key, hash_value);
          return hash_value;

       }
}

I could do some extra bookkeeping to ensure that inequivalent Keys are less likely to be mapped to the same hash by the random assignment, but that's largely besides the point.

Ignoring collision resolution, hashing a new value is O(hashedKeys.size()), while hashing a previous hashed value is O(1) We also require O(n) additional space to store the computed hash values, where most hash functions require O(1).

In a situation where a cache is large and new keys are constantly being inserted, the O(n) search is incredibly inefficient, so I'd prefer another approach if possible, or a proof that improvement is impossible.

Take the class ParityInteger:

class ParityInteger{
   public:
      int number;

      ParityInteger(int n){
         number = n;
      }

      bool operator==(const ParityInteger& other){
         return (number % 2) == (other.number % 2);
      }
}

The ideal hash for such a class is:

int hash(ParityInteger n){
   return n%2;
}

which basically assigns a ParityInteger to a representative of its equivalence class.

Besides my method in the class Hasher, is there any better way to automatically find a function which assigns equivalent members of an arbitrary type to the same integer, without being trivial?

Given a computable equivalence relation == for some type, is there an algorithm to compute a nontrivial function hash such that == is a congruence relation wrt hash.

EDIT: It seems that the naive Hasher algorithm I put forward is essentially optimal assuming you treat == as a blackbox: https://cstheory.stackexchange.com/questions/33223/on-partitioning-a-collection-into-equivalence-classes

etha7
  • 51
  • 5

4 Answers4

5

The way I can think of to do this is by some sort of normalization: that is, you need to find a function $f$ such that, if $\equiv$ is your custom equality and $==$ is the normal C++ (or whatever language you use) equality, for all $x,y$, we have $x \equiv y$ if and only if $f(x)==f(y)$. We call $f(x)$ the normal form of $x$.

Then, the trick is, instead of computing hashes, you compute hashes of normal forms.

Hash functions are specifically designed to produce large changes in output for small changes in input: that's what makes them well suited to hashtables and cryptography. So there's not likely a way to make a hash function that is invariant over some custom equality, except to have it compute on normal forms.

What you've described might work from a correctness point of view, but there are a few things to consider:

  1. It is not hashing. That is, a hash is essentially a function that takes some variable-sized data and produces a fixed size output (in your case, an int). You haven't designed a function at all, you've just defined a way to assign random identifiers to input.
  2. You lose all the advantages of hashing. One main use case of hashing is quickly comparing two things.

    If you hash a bunch of things ahead of time, then you very quickly check that any of those two things are for sure different, and if their hashes are the same, you know with high probability that they are actually the same. With your version, you get fast comparison, but to compute all your hashes you'll have already compared all $n^2$ unhashed pairs at least once, so you will never save work.

    The other thing hashes are useful for is indexing complex data in a data structure. That is, you convert your key into a hash, and each time you do a key lookup, you compute the hash and use it to find the key in a data structure, possibly a tree or hashtable. With yours, you end up doing $n$ comparisons each time you lookup the hash key, which means you'd be better to just use an unordered list as your data structure and search through it each time, comparing each element to the key you're looking for.

Using unique identifiers instead of hashes is a fine way to index data, but then you definitely do NOT want to generate random identifiers, since there's still a risk of a collision. Usually you'd just keep a counter and generate one plus the last identifier each time you allocate a new one.

Joey Eremondi
  • 30,277
  • 5
  • 67
  • 122
3

If you don’t know the equality function then you let hash(x) = 0. Seriously. All your algorithms will work, but slowly because of collisions. All the other suggestions will make your hashing slow instead so you lose nothing. Actually, if you have multiple dictionaries containing these keys, operations are quadratic in the size of each dictionary, instead of quadratic in the total number of hashed values.

For a known custom equality function you just write a custom hash function. Just calculate hashes of all or some of the items that are used for the equality check and combine them.

For example, if you are comparing arrays by comparing the size and all items, you could calculate a hash of the array size, the first, last, and median element, and the element at index 1,000,000,000 modulo array size, and combine these five numbers. There is a good chance for non-equal arrays to have non-equal hashes.

gnasher729
  • 32,238
  • 36
  • 56
1

Suppose the objects you want to compare are strings containing terminating programs that output either 1 or 0 (and written in Visual Basic, because why not), and the equality function returns true if and only if the output of the two programs given as operands is the same. (The termination assumption is there to ensure that this equality is well-defined.)

First, notice that the only valid hash functions are those that produce at most 2 distinct output values: If, to the contrary, there are some three inputs A, B, C that yield distinct hash values, then at least one of A == B, A == C or B == C holds by the pigeonhole principle, but the RHS of the corresponding A == B $\implies$ hash(A) == hash(B) constraint fails. (This also holds for your ParityInteger example.)

So the only valid hash functions are:

  1. Functions that always return the same value (as mentioned in gnasher729's answer, these are valid and maybe even OK as a "let's just get this to compile" stopgap but give you none of the benefits hash functions are intended to have)
  2. Functions that compute what the input program will finally write out, either by simulating it or by some more complicated static analysis, and return (a function of only) this output value.

Why so strict for the second possibility above? If the hash function ever returns the "wrong" answer for some program A, and the "right" answer for some other program B that produces the same output as the first program, then A == B $\implies$ hash(A) == hash(B) is violated.

Note that simulation always works but is slow in the worst case (all we are guaranteed is that the hash function will finish in finite time; it's not even O(anything)!), and coming up with static analysis techniques that correctly determine the result of every terminating program in O(something) time seems hard, since this amounts to discovering a way of running terminating programs unboundedly faster than simulating them. (E.g. all NP-complete problems can be solved by terminating programs that return 0 or 1, so if you could find a polynomial-time static analysis approach you would have shown P=NP; probably stronger statements along these lines can be made.)

j_random_hacker
  • 5,509
  • 1
  • 17
  • 22
0

The assumption is: You have an "equality" operator. The equality operator follows the usual rules, and it is also stable over time: If x = y or x ≠ y today, then x = y or x ≠ y tomorrow as well. You need a hash function which is guaranteed to hash equal values to equal hashes. Since equality is stable over time, if hash (x) is calculated and stored today, then hash (x) must return the same value tomorrow as well. And we assume there is no way to calculate a good hash function that only looks at the value of x, so if we want a good hash function it must be based on comparing values.

We can start with hash(x) = 0. We may be able to do better. For example, you might have an equality operator comparing arrays, and although you don't know the exact rules, you might know that arrays with different number of elements are always different. This divides all possible values into a large number of equivalence classes.

So we can start with hash (x) = hash ("number of elements in x"). This means that if we hash n values, using some method that compares values for equality, the difficulty doesn't grow with the number of values hashed, but only with the number of values in each equivalence class. If we are happy with only one hash value per equivalence class, we are done.

Otherwise, he problem is that once we have hashed k different values in some equivalence class, and we are given an arbitrary element x in the same equivalence class, we need to check if x is one of the previously hashed values. If we know nothing more about the equality function than what we have used, then we need to compare x with all previously hashed values until we find one that is equal, or compare with all previously hashed values if none are equal. So to hash k values in the same equivalence class that are all not equal takes about $n^2 / 2$ comparisons.

We may be able to do better if you know something about the equality function. You may be able to divide all values in an equivalence class into overlapping groups, so each value would belong to one or more groups, where only values in the same group can be equal, and all groups a value x belongs to can be calculated. In that case, instead of comparing x with all values, you would calculate which groups it belongs to, keep track of hashed values in each group, and compare x with all the values in these groups.

Summary: You divide the set of values x into equivalence classes, such that you can calculate which equivalence class each x belongs to. (If you know nothing about the equality operator then you have only one equivalence class). You then hash the equivalence classes. You now have a useful hash function. If you have various hash tables, then the average number of collisions due to the bad hash function equals the average number of items from each equivalence class in each hash table.

Otherwise, you calculate hash values for the items in each equivalence class. You define overlapping groups for each equivalence class, such that you can calculate which groups any value x belongs to, and such that only values belonging to the same group can be equal (If you know nothing more about the equality operator, then you have only one group). To calculate the hash value of x, you must either find an x' = x whose hash value has already been calculated, or assign a new hash value to x and remember it. To do this, you calculate which groups x belongs to, and compare it against all values belonging to any of those groups. You now have a good hash function that doesn't cause collisions. The complexity of calculating the hash code of a previously unhashed x is proportional to the number of previously hashed values belonging to any of the groups that x belongs to. For previously hashed values, you may be able to change the order of lookups if the same values are hashed repeatedly.

gnasher729
  • 32,238
  • 36
  • 56