Shannon entropy of string, not same in Binary form? But just as good as a password?

Question

This string, "Au+u1hvsvJeEXxky" has a Shannon entropy of 3.75 and a length of 16.

The binary form of the string ( derived from Ascii-table ) is:

01000001 01110101 00101011 01110101 00110001 01101000 01110110 01110011 01110110 01001010 01100101 01000101 01011000 01111000 01101011 01111001

This string has a Shannon entropy of 1.39 and a length of 128.

So while both strings are interchangeable, their entropy differs. Why is that? And are both strings evenly strong in resisting a brute force guessing attack?

thnx.

score 3 · Answer 1 · edited Jun 17 '20 at 08:17

Shannon entropy is a property of a random variable. It is defined as

$$H = -\sum_i^n {p_i \log_2 p_i}$$

where $p_i$ is a non-zero probability for each possible outcome. Note how the equation doesn't quite make sense for a single possible value. When people say a string (a password, message, or file) has a certain entropy they mean the string was sampled from a distribution with that entropy. This is just informal shorthand. Strings and values do not have entropy themselves.

Two different processes with two different probability distributions can produce the same output with different probabilities. It is incorrect to say a string has a certain amount of entropy with no context. There is no way to tell if a string posted on the internet has a certain Shannon entropy or min-entropy.

Example: If I tell you I rolled 6 twenty times in a row on a six sided die then I cannot ask you to tell me how much entropy is in a single roll of the specific die I used. It could be the case that the die is highly biased towards six, meaning it has relatively low entropy, or it could be the case I used a fair die and this outcome just happened by chance. (In which case it's not special despite the fact that it looks like I'm lying about the results. By definition this outcome is just as probable as every other list of twenty rolls I could give you.)

If you apply a lossless transformation to a random variable, you do not change the entropy of the result. If one outcome has some probability then the transformed outcome has the same probability.

If I'm using shorthand and each of the following are different encodings of the same 24 bit value, then I can say each string has equal entropy.

01000001 01110101 00101011

Zero One Zero Zero Zero Zero Zero One Zero One One One Zero One Zero One Zero Zero One Zero One Zero One One

ABAAAAAB ABBBABAB AABABABB

THTTTTTH THHHTHTH TTHTHTHH

Tails Heads Tails Tails Tails Tails Tails Heads Tails Heads Heads Heads Tails Heads Tails Heads Tails Tails Heads Tails Heads Tails Heads Heads

HTHHHHHT HTTTHTHT HHTHTHTT

Heads Tails Heads Heads Heads Heads Heads Tails Heads Tails Tails Tails Heads Tails Heads Tails Heads Heads Tails Heads Tails Heads Tails Tails

Whatever source told you that the two strings have different entropy is wrong. Either you mean the were generated the same way, only differing in the final encoding they use. (In which case the should have equal entropy.)

Or we just have these two """equivalent""" strings but we know nothing about how they were generated. Then we must say that we don't know what entropy they have. (Or more properly, that we don't know the entropy of the process used to generated those strings.)

I am guessing you used an online calculator to determine entropy. Those cannot determine the entropy of a (process that generates a) string as I already explained. The results of such calculators are not usable.

Or you may have used some formula that estimates the entropy in normal English prose or in a password as a function of the length of the string (and maybe some other details). This has the same problem as the online calculators.

Squeamish Ossifrage · Answer 2 · 2019-08-05T00:48:30.970

I'm going to telepathically make a wild guess.

My wild guess is that you generated the string Au+u1hvsvJeEXxky by asking a computer to choose sixteen characters independently and uniformly at random from the alphabet consisting of a-z, A-Z, 0-9, +, and -, which is, coincidentally, the base64 alphabet.

The distribution on individual characters has 6 bits of entropy per character. The distribution on strings of sixteen characters chosen independently from this distribution is sixteen times that, namely 96 bits of entropy per string.

If I, as the adversary knowing this information about your process but not knowing the particular outcome, tried to guess your string, I would have a $1/2^{96}$ chance of getting it right. If I kept trying guesses, the expected number of guesses before I get it right—that is, the average of number of guesses over all possible values of your string—is $2^{95}$. That's a lot of guesses.

However, as the adversary, I often have more powers than that. Often, what I have is some hash of your string $H(s)$, and not just yours but $H(s_0), H(s_1), \ldots, H(s_{9999})$ of ten thousand different users who all used the same process. My goal as the adversary is to find at least one of the strings $s_i$—chances are if I can get a foothold by compromising one user, I can use that to compromise more users in a network.

If I do this intelligently, the cost of my attack—measured in joules, or USD, or EUR—is significantly less than $2^{95}$ times the cost of testing a single guess by evaluating $H$. With the help of Oechslin's rainbow tables, if parallelized $p$ ways, at least hundred million, I can share work between attacking many targets at once, and it will cost only about $2^{82}$ evaluations of $H$, in the time for about $2^{82}/p \leq 2^{56}$ sequential evaluations. The Bitcoin network spends this cost in about a year; $2^{56}$ nanoseconds is about two years.

That's a high cost, and a long time to wait, but it's absolutely within the budget of a major corporation or government. I would recommend making sure that the cost is around $2^{128}$ evaluations of $H$ so that it is completely out of reach of foreseeable human engineering. There are three ways to do this:

Have every user choose from ${\geq}2^{256}$ possibilities uniformly at random. For example, instead of sixteen-character base64 strings, use forty-three-character base64 strings. Or use sequences of twenty words chosen independently uniformly at random from a word list of 7776 words.
Store a salt unique to each user, and use $(\sigma_i, H(\sigma_i, s_i))$ where $\sigma_i$ is the $i^{\mathit{th}}$ user's salt and $s_i$ is the $i^{\mathit{th}}$ user's secret. This thwarts rainbow tables and prevents the adversary from sharing work between multiple users.
Use a password hash that is costly to evaluate like scrypt or argon2id.

Method (1) is something the users can do. Alternatively, the computer can choose the user's secret for them, and ask the users to remember it. Methods (2) and (3) are things that whatever uses the secrets can do—something that the engineers of an application can put into their system to defend it against brute force attacks even if some users choose secrets poorly like human-chosen passwords.

All of the numbers above are premised on the model I telepathically guessed. Not everyone guesses the same model. The ent utility suggested by Paul Uszak and an entropy calculator on the web suggested by conchild instead guess the following probability distribution on symbols: probability 1/8 for u and v, probability 1/16 for {+, 1, A, E, J, X, e, h, k, s, x, y}, 0 probability for any other character. They suggested this by (a) counting the number of appearances of each character in your string, and (b) dividing by the length of your string. I, instead, used knowledge of common protocols on the internet to guess that you are using the base64 alphabet. We all assumed independence between characters. But nobody here knows anything about the process you used.

Paul Uszak · Answer 3 · 2018-08-08T23:24:33.200

You may have seen an equation similar to:-

$$ H = -K\sum_{i=1}^n {p_i \log p_i} $$

It comes from Shannon's A Mathematical Theory of Communication, 1948 and is the classic definition of information entropy. The cryptographic security notion of entropy is different to this, but your calculations are based on Shannon's classic definition (Note 1.). Ignore the scalar K for this question, but feel free to follow up in Shannon's work.

What you will not have seen much of on this site is the definition of $i$. This is the crux of your question, and why Shannon entropy is tricky to measure accurately. $i$ relates to the alphabet used, and even that is confusing as the alphabet may be over multiple characters, such as "qu" in English. Bits, bytes and words may the tallied differently and produce different entropy measures.

To demonstrate my answer with a more substantial dataset, consider a 4 MB file of sequential 32 bit integers from 0 $\rightarrow$ 999,999. Measure it's entropy using different bit widths for $i$:-

i (bits)         Hrate (bits /byte)    Htotal (bits)
  1                    7.1                 28M
  8                    6.3                 25M
 32                    3.5                 14M            
 ???                 0.0031              12,464

where ??? is unknown as Htotal was obtained via a proprietary compression algorithm (paq8px_v112). This example effectively changes the alphabet (read $i$) used in digesting the file, just as the question does with a swap to ASCII. Compressors exploit correlation over many bits, and thus the resultant measure is so small. But the result is as expected, in that the entropy of the first million integers is clearly low.

Anyway, that's why your entropy numbers come out differently.

Note 1. This subtle distinction in entropy definitions will cause us all problems. Some commentators will suggest that your strings do not have any entropy whatsoever, in contrast with accepted information theory and Shannon himslef. A good example of entropy measures of fixed text is this entropy rate graph from Thomas Schürmann and Peter Grassberger, Entropy estimation of symbol sequences.

Shannon entropy of string, not same in Binary form? But just as good as a password?

3 Answers3

Linked