0

I have circa 1 million datapoints, each with a unique integer ranging from 6 to 24 digits:

exampleDataPointOne: 655092
exampleDataPointTwo: 333402221 
exampleDataPointThree: 332021
...
exampleDataPointN: 903232211 

I want to run each of these unique integers through a hash function, with the objective of 1) maintaining the uniqueness, 2) obscuring the underlying integer value and 3) having a hash output with a length of <= 34.

I've looked around for potential hash algo's that are a good fit, such as BLAKE, md5 and the sha family of hashes.

Given my requirements, what is the most advisable hash algorithm to use?

2 Answers2

2

I want to run each of these unique integers through a hash function, with the objective of 1) maintaining the uniqueness, 2) obscuring the underlying integer value and 3) having a hash output with a length of <= 34.

You need to be very clear about what you mean by "obscuring." The basic problem here is that your values look like they're likely to be subject to brute force guessing attacks—just like passwords are. This means that if you apply a public, unkeyed hash function to your values, you subject yourself to the same sorts of attacks that are very successful against passwords—particularly attacks against unsalted passwords.

If by "obscure" you really mean that an attacker should not be able to figure out which number corresponds to each hash, then your solution needs to incorporate a secret key, and that key needs to be protected. This is much, much, much, much more important than whether you use SHA-2 or Blake2. (PS: Don't use MD5 for anything! And if you need to incorporate a secret into the computation, don't do it in an ad-hoc manner—use HMAC-SHA2 or Blake2's optional key support).

Luis Casillas
  • 14,703
  • 2
  • 33
  • 53
1

Note: this answer previously recommended using the SHAKE functions standardized as part of SHA-3 but has since been edited because I misunderstood the notation used in FIPS-202

I'd recommend using a keyed MAC, truncated to the appropriate length. The key would serve as your "domain separator" for this use case and prevent dictionary attacks against the small input space if that is a concern. You can share the key with third parties if they need to compute the hashes as well for your application.

If you're dealing with recent x64 hardware I'd choose HMAC-SHA2-256 truncated as needed due to the hardware support for SHA2-256 on-chip.

Keyed BLAKE2 might be a more performant choice on platforms without hardware SHA2-256 instructions (BLAKE2 can be keyed directly and used as a secure MAC without using the HMAC construction).

rmalayter
  • 2,297
  • 17
  • 24