Hash length vs Data length

Question

I'm very (!) new to the world of cryptography, so pardon me if this question is very basic.

Related to hashing, something that hasn't become clear to me is the relation between data length and hash length.

If I understood correctly, any change in the data should also alter the hash, while still avoiding hash collisions. So my question is if it's possible to have a hash whose length is inferior to the data length and still avoid hash collisions.

My use case would be something like hashing data with less than 128 bits, but I absolutely cannot spare another 128-bit space to store the hash. I would be able to store maybe a 32-bit hash, if even that.

Are there any hashing functions that perform better in these scenarios over others?

Edit: To clear some confusion, here are some bullet points:

I have a set of data that is about 128 bits long. I can only fit another 32 bits in that same packet.
Ideally, the data should not be readable by anyone else other than the destination (there is a shared key), and it should be possible to identify 'fake' packets (modified or injected into the network)
My first idea was to create a 32-bit hash and then encrypt the whole 128+32 bit using the shared key
I'm looking for better suggestions/guidelines, since I'm a newbie in this area.

Thanks!

tylo · Answer 1 · 2017-06-02T16:51:02.893

My use case would be something like hashing data with less than 128 bit, but I absolutely cannot spare another 128 bit space to store the hash. I would be able to store maybe a 32 bit hash, if even that.

If you have any kind of security in mind, $32$ bit are definately not enough. Even $128$ bit is not enough today if you need collision resistance due to the birthday paradox.

With just $32$ bit output, everyone can find a preimage on the most basic computer within a very short time (seconds maybe) by just trying out enough possible inputs.

if it's possible to have a hash whose length is inferior to the data length and still avoid hash collisions"?

I think there is a misunderstanding in the properties of a cryptographic hash function. My first suggestion would be to study the properties collision resistance and preimage resistance in detail (the wiki-link above is a useful starting point). As you can see, the length of the input is not mentioned at all - it does not matter for the security. What is stated is:

It should be hard to find any $m_1,m_2$, such that $h(m_1) = h(m_2), m_1 \neq m_2$ (collision resistance)
Given $h$, it should be hard to find any $m$, s.t. $h(m) = h$ ((first) pre-image resistance)

It is not about:

if collisions are possible (because they always are - that is given by the fact that the input is of arbitrary length)
special treatment of certain lengths for the property in general (including shorter/longer than the input)
any kind of brute force: If you have just two possible messages and know $h(m)$, then it's easy to test just both and see which message was the preimage.

And then, it's important to get a rough idea what is actually feasable in tdoay's world and what is not. A full search over $2^{32}$ bit is easy. A full search over $2^{64}$ is practically possible but far beyond what you can do on a single computer (e.g. that's roughly the number of hashes of the entire bitcoin network in $3.7$ seconds), $2^{128}$ is practically impossible ($2^{100}$ hashes would take the bitcoin network already around $8000$ years at its current rate - $2^{128}$ is around a quarter million times longer).

SEJPM · Answer 2 · 2017-06-02T17:00:57.937

My first idea was to create a 32 bit hash and then encrypt the whole 128+32 bit using the shared key

I take two things from this:

You have a shared key (and hopefully only two parties).
You only want to spend 32-bits on authentication (due to hard external limits).

Now the approach you proposed certainly is at least sub-optimal.

Luckily for you, (some) people have fore-seen such constrained deployment scenarios and this is at least part of an appendix of NIST SP 800-38D (Appendix C, PDF) and of NIST SP 800-38C (Appendix B, PDF) and of NIST SP 800-38B (Appendix A, PDF).

If you can (and performance does well), you should use AES-CCM (if you also want privacy) and a standard MAC (such as HMAC or CMAC) otherwise. You can use 32-bit authentication tags, however be sure to read the mentioned sections before doing so, so that you actually understand the involved risks. These 32-bit authentications tags will mean that you only get a message expansion of 32-bit. Note that you still need to somehow supplement a nonce to the scheme (if you use CCM), but this can be a simple packet sequence number or a synchronized counter or something similar, it's only important that both sender and receiver know the value associated with a packet and that these values are unique under a specific key.

What you really want to do is to use ephemeral keys. That is, every 10,000 or so messages you run a pre-shared-key-based key negotiation protocol between the involved parties and use the resulting shared key for the next 10,000 messages. If you want to, you can keep the "master" key and use it for authentication and the ephemeral keys for transport. The details of such a protocol would greatly depend on the number of involved parties. If you re-key regularly (and especially if you hit many decryption errors) you can keep the probability of a successful forgery quite low.

Now to AES-CCM. You really should use the re-key strategy from above here, but also note the discussion in the relevant standard, which states:

$$Tlen\geq\lg(MaxErrs/Risk)$$

That is, the (binary) logarithm of the number of (back-reported) tolerated "invalid" decryptions divided by the maximal risk you are willing to accept for an unwanted message to slip through should be smaller than the length of the authentication tag. So for example if you are going to retry / notify the sender (under the same key!) for up to $2^{10}$ errors, you can still get away with a one-in-a-million chance of a bad message slipping through with 32-bit tags. Note that a re-key resets the "count" for MaxErrs.

If you don't need encryption (that is privacy for the message contents), you also can get away with use CMAC or HMAC which require similar thoughts as CCM does, due to the fact that the probability to successfully forge a message is linear in the number of "yes / no" outputs the adversary gets.

Hash length vs Data length

2 Answers2