10

I know there are different statistical tests out there (NIST, Dieharder, etc), which all do different ways of analyzing entropy.

What I'm having a hard time finding is any particular literature which describes how to go from those tests to actual bits of entropy in the byte stream.

How do you go from p value of a byte stream (say a 100 MB long) to bits of entropy?

Updated:

As mentioned below, you can't estimate entropy of the output (not possible), but only by understanding the physics of the underlying process that generates the entropy.

Mike Edward Moras
  • 18,161
  • 12
  • 87
  • 240
Blaze
  • 551
  • 1
  • 4
  • 13

2 Answers2

19

Entropy is a function of the distribution. That is, the process used to generate a byte stream is what has entropy, not the byte stream itself. If I give you the bits 1011, that could have anywhere from 0 to 4 bits of entropy; you have no way of knowing that value.

Here is the definition of Shannon entropy. Let $X$ be a random variable that takes on the values $x_1,x_2,x_3,\dots,x_n$. Then the Shannon entropy is defined as

$$H(X) = -\sum_{i=1}^{n} \operatorname{Pr}[x_i] \cdot \log_2\left(\operatorname{Pr}[x_i]\right)$$

where $\operatorname{Pr}[\cdot]$ represents probability. Note that the definition is a function of a random variable (i.e., a distribution), not a particular value!

So what is the entropy in a single flip of a coin? Let $F$ be a random variable representing such. There are two events, heads and tails, each with probability $0.5$. So, the Shannon entropy of $F$ is:

$$H(F) = -(0.5\cdot\log_2 0.5 + 0.5\cdot\log_2 0.5) = -(-0.5 + -0.5) = 1.$$

Thus, $F$ has exactly one bit of entropy, what we expected.

So, to find how much entropy is present in a byte stream, you need to know how the byte stream is generated and the entropy of any inputs (in the case of PRNGs). Recall that a deterministic algorithm cannot add entropy to an input, only take it away, so the entropy of all inputs to a deterministic algorithm is the maximum entropy possible in the output.

If you're using a hardware RNG, then you need to know the probabilities associated with the data it gives you, else you cannot formally find the Shannon entropy (though you could give it a lower bound if you know the probabilities of some, but not all, events).

But note that in any case, you are dependent on the knowledge of the distribution associated with the byte stream. You can do statistical tests, like you mention, to verify that the output "looks random" (from a certain perspective). But you'll never be able to say any more than "it looks pretty uniformly distributed to me!". You'll never be able to look at a bitstream without knowing the distribution and say "there are X bits of entropy here."

Reid
  • 6,879
  • 1
  • 40
  • 58
1

There are some tests out there: Draft Special Publication 800-90B - National Institute of Standards

In particular the min-entropy, partial collections, Markov (useful for non-IID sources), collisions, and compression tests.

The issue with the Markov test is the constraint on bit size of the sample.

Updated:

These tests only measure output. They don't measure the underlying entropy that was used to generate the data. You can take completely non random data and make it look perfectly random according to these tests (or any tests).

Blaze
  • 551
  • 1
  • 4
  • 13