12

(I updated the title, as I think there was some confusion as to the question)

Here's the question:

For example, if I have a bit stream that is 64K bytes long and there is about 16 *8 bits worth of entropy randomly dispersed in that byte stream, so I have 16*8 bits worth of entropy.

However, if I SHA256 that byte stream, I will now only have 32 bytes total rather than 64K bytes.

Some information has been lost of course, but perhaps all of the entropy is retained?

Another way of looking at the question, is the entropy of SHA256(10GB with 16 bytes of entropy) equal to SHA256(16 bytes with 16 bytes of entropy) and if not, how much exactly has been lost?

I'm having a hard time finding any literature which estimates entropy loss, just a lot of hand waving by various crypto engineers that it's all good.

Here's an algorithmic way of looking at it:

#!/bin/sh
HASHV=`echo <random secret> | sha256`
echo $HASHV
while(true) ;
do
HASHV=`echo "$HASHV 00000000000000000000000000000000" | sha256`
echo $HASHV
done

Will the entropy of HASHV decrease over time?

Anyone got anything specific? (refs to papers, books, etc are grand)

Relevant questions:

Blaze
  • 551
  • 1
  • 4
  • 13

1 Answers1

19

A simple way to imagine the effect of the hash function is a truncation. A "good" hash function ought to behave like a random oracle. If your source has entropy $s$ bits, then this means that the source somehow assumes $2^s$ possible values. When processed with a random oracle with an $n$-bit output, you force the $2^s$ input values into $2^n$ possible outputs.

When $s$ is smaller than $n/2$, then it is expected that the hash function will produce $2^s$ distinct values, and all your $s$ bits of entropy are preserved. When $s$ reaches $n/2$, collisions begin to appear, and each collision means a tiny fraction lost entropy. You still preserve most of it, though. When $s$ reaches $n$ (e.g. you hash an input with 256 bits of entropy, with SHA-256), then it is expected that you get about $0.6ยท2^s$ distinct output (you lost one third of the inputs to collisions), so the resulting entropy will be a bit more than $n - 1$ bits. When $s$ is higher than $n$, the output entropy will still rise, but never exceed $2^n$: you cannot have more than $2^n$ distinct outputs of $n$ bits...

To sum up, when hashing your input, you preserve almost all your input entropy, up to at most the output size of the hash function. To make things simple: when you hash with SHA-256 an input of entropy $s$ bits, then you get an output entropy at least equal to the lower of $s-1$ and 255 bits.

Thomas Pornin
  • 88,324
  • 16
  • 246
  • 315