Optimal compression ratio in compression function

Question

So we'll have a transform that maps any of $A=2^a$ possible inputs to $B=2^b$ possible outputs. $a$ is generally fixed as the number of bits in the input, and $b$ is determined by the transform used. We'll name the difference $d=a-b$.

There's 3 main groups:

$d$ is large and the output is much shorter than the input - we see this in SHA1, SHA2 and a lot of other hashes like it, where they use it to absorb blocks.
$d=0$, so it's bijective. As far as I know SHA3 is the only hash that does this, and it has a large hidden internal state. Side question, for cases where there isn't some extra internal state, it being bijective implies it's reversible, which is quite bad for hashes right?
$d<1$ is small and the output is the same length as the input - a compression function, the interesting one and the one I'm concerned with. Around what value of $d$ should I aim for? Some considerations are obvious, like keep $d$ very small if the transform is iterated such as in a password hash, but otherwise I'm at a loss. Of course we can't just choose $d$ since it's based on the transform, but we can design the transform and pick constants to get $d$ around the desired value. So what order of magnitude should $d$ be? Maybe $d\approx a^{-\frac{1}{2}}$?

Main purpose of this is for hashing. I was going to try making my own hash.

score 2 · Accepted Answer · edited Jun 17 '20 at 08:17

You said ‘hash’, but you didn't say what properties you wanted. For example, the ChaCha core $C\colon \{0,1\}^{256} \times \{0,1\}^{128} \to \{0,1\}^{512}$ is called a hash function and used in CTR mode to make a one-time pad $C(k, 0) \mathbin\| C(k, 1) \mathbin\| \cdots$ to encrypt a message under key $k$. Poly1305 is called a universal hash function family and is used to make a one-time authenticator for a message under a single-use secret key, but is useless for encrypting messages.

Are you talking about building a variable-size collision-/preimage-/second-preimage-resistant unkeyed function $H\colon \{0,1\}^* \to \{0,1\}^n$ out of a fixed-size function $f\colon \{0,1\}^a \to \{0,1\}^b$? That's one of the outmoded definitions of ‘hash function’ in cryptography—outmoded because there are other properties like (enhanced) target collision resistance, prefix-PRF, etc., that turn out to be at least as important in protocol design, and are sometimes summarized as indifferentiability from a random oracle. And if that is what you're looking for, it's hard to imagine how $d$ could turn out to be anything other than a nonnegative integer.

But assuming that is what you're looking for, here are some comments on common values of $a$, $b$, and $d$:

The design principle of SHA-256 is to iterate a block cipher $E_k$ in Davies–Meyer form:

Break the padded message into 512-bit chunks $m_0, m_1, \dots, m_{\ell - 1}$.
Let $h_{-1} = \mathrm{IV}$ be the standard 256-bit initialization vector.
Compute $h_i = E_{m_i}(h_{i - 1}) \oplus h_{i - 1}$.
Reveal $h_{\ell - 1}$ as the hash.

Here a = 768 and b = 256, so d = 512.

The design principle of BLAKE2 is similar, except it uses a tweakable block cipher $E_{k,t}(m)$ instead of an ordinary block cipher, in HAIFA form, which fixes some potential pathologies of naive Davies–Meyer or Merkle–Damgård form, so that it goes beyond collision and second-preimage-resistance to simulate a random oracle in other dimensions.

The design principle of SHA3-256 is to iterate a fixed permutation $\pi$ in sponge form:

Break the padded message into 1088-bit chunks $m_0, m_1, \dots, m_{\ell - 1}$.
Let $h_{-1} = 0^{1600}$.
Compute $h_i = \pi(h_{i - 1} \oplus (m_i \mathbin\| 0^{512}))$.
Reveal the first 256 bits of $\pi(h_{\ell - 1})$ as the hash.

Here a = 2688 and b = 1600, so d = 1088.

Which is better—HAIFA or sponge? It depends on whom you ask! There's no evidence to suspect any weakness in BLAKE2 or SHA-3. Maybe it's easier to make something faster out of HAIFA—certainly BLAKE2 is much faster than SHA-3. Maybe it's easier to prove security reductions of compositions to primitives when the primitive is a single fixed permutation instead of a PRF made out of a PRP.

Note that in all of these compositions one of the basic components is a permutation—$E_k$, $E_{k,t}$, or $\pi$—but the fact that the component is reversible doesn't mean there's any security problem in the composition. Knowing the first 256 bits of $\pi(s)$ for a random permutation $\pi\colon \{0,1\}^{1600} \to \{0,1\}^{1600}$ doesn't help you to guess a 256-bit secret $s$: you'd have to fill in 1344 bits you don't know, which is unimaginably harder than just guessing $s$ in the first place!

score 0 · Answer 2 · answered Sep 17 '17 at 21:18

The problem with your question is that you've generalized a formula for (at least 2 - what hash does bullet point 3 refer to) dissimilar architectures.

SHA1/2 uses the MD construct, whilst SHA3 uses absorption instead as it is a novel "sponge" design. These cannot be sensibly compared with respect to a compression behaviour /metric. The security parameters are totally different. One type of architecture is not necessarily better than another, unless you figure that the most recent must be the best. You need to decide a basic hash primitive, be it conventional, sponge or something new(?) A new fundamental building block would be fascinating.

You will then find that d follows naturally.

And now a warning. It's hard to develop a secure cryptographic hash. Just because you can't invert it, doesn't mean that crypto geeks can't either. They almost certainly can. The only functional (non educational) use for a DIY hash is as a randomness extractor as invert ability is acceptable.

Optimal compression ratio in compression function

2 Answers2