2

Background

I'm building a URL shortener, and the URL to shorten may contain a SessionId.

For the URL shortener to not compromise the security of the SessionId, I must use a shortening strategy that fulfills the same requirements as the SessionId.

OWASP states that:

  • The session ID length must be at least 128 bits (16 bytes) here
  • The session ID value must provide at least 64 bits of entropy here

What I think about doing

  • Have a key-value store, where the value is the long (unshortened) URL, and the key is the shortened URL component.
  • When the user navigates to a shortened URL, the long URL is looked up in the key-value store, and returned to the user, who is then redirected.

So the key needs to be at least 128 bits (16 bytes) long, and provide at least 64 bits of entropy, to not compromise the SessionId.

If the key is a hash of the value, I can guarantee the key length, and know the entropy (based on the hashing algorithm used).

But which hashing algorithm should I use?

For example, MD5 digest length is exactly 128 bits. But I don't know what's it's minimum entropy.

The question

Which hashing algorithm produces a digest of at least 128 bits, with at least 64 bits of entropy?

(x-post from StackOverflow: https://stackoverflow.com/q/71224441/6517320)

1 Answers1

5

The short answer to your hashing question is "use SHA-256." That's the answer for almost every secure-hashing problem, unless the answer is "use SHA-512." If you want a 128 bit hash, then you can truncate SHA-256 (take the first or last 128 bites). All bits in SHA-256 are independent, so you can extract any 128 of them as a hash.

That said, IMO you're thinking about this problem incorrectly. The point isn't about protecting SessionId specifically. The issue is that URLs may contain sensitive information, of which SessionId is just one example (if it's stored in the URL). If someone already knows the shortened URL, they can just ask your system for the full URL, so this is all about stopping attackers finding keys by guessing. You need to make your keyspace sparse, which is it say "much, much larger than the number of keys actually stored."

You're maintaining a key/value database, so there's no need to use a hash at all. You can just generate a random key for each URL. This is better than a hash because there's absolutely no connection between the key and the value.

Given your design, attackers cannot search offline. They have to contact your server. Say you can serve 1,000 requests/second, and you scale your total keyspace to be a trillion times larger than your planned number of URLs. That would take an attacker roughly 15,000 years (~1/2 the space searched) to find a single URL if they could consume all of your available bandwidth (which I expect you might notice....). With just a little bit of rate limiting per IP address, you could dramatically complicate this attack.

Given the above, if you wanted to store a billion URLs in your system, you'd need a keyspace of:

log2(1 billion URLs * 1 trillion multiplier) = 80 bits

In Base58 (which I like for this kind of problem because it's human-friendly), it would take around 14 characters. Tweaking the above values for rate limiting, period of attack you want to protect against, and number of stored URLs, you can choose how long your keys are.

You generally can compute random values on this scale without worrying about collisions (which is good for performance). For the same reason that it's extremely hard for an attacker to find a collision, it's extremely unlikely that you'll have one by accident. But if you want to double-check just how unlikely, look at the Birthday Attack. The calculations for "what's the likelihood that any values collide" are a different than "how long will it take an attacker to find a collision," and in some cases will force you to use longer keys.

IMO there's no need for hashes. But if you do need one, use SHA-256, truncated to whatever number of bits you want.

Rob Napier
  • 669
  • 3
  • 11