118

If part of the password is a whole regular English word, does the entropy of that part depend on the number of English words in existence, the number of English words known by the choosing algorithm, the number of English words assumed by the attacker?

Does the language matter, is the average entropy per word in German, French, Italian, or Spanish significantly different from the average entropy in English?

Does a numeric digit always have an entropy of $\log_2(10) = 3.321928$?

wythagoras
  • 207
  • 1
  • 6
this.josh
  • 2,037
  • 4
  • 17
  • 13

4 Answers4

113

Entropy is a measure of what the password could have been so it does not really relate to the password itself, but to the selection process.

We define the entropy as the value $S$ such the best guessing attack will require, on average, $S/2$ guesses. "Average" here is an important word. We assume that the "best attacker" knows all about what passwords are more probable to be chosen than others, and will do his guessing attack by beginning with the most probable passwords. The model is the following: we suppose that the password is generated with a program on a computer; the program is purely deterministic and uses a cryptographically strong PRNG as source of alea (e.g. /dev/urandom on a Linux system, or CryptGenRandom() on Windows). The attacker has a copy of the source code of the program; what the attacker does not have is a copy of the random bits that the PRNG actually produced.

Entropy is easy to compute if the random parts of the selection process are uniform (e.g. with dice or a computer with a good PRNG -- as opposed to a human being making a "random" chance in his head). For instance, if you have a list of 2000 words and choose one among them (uniformly), then the entropy is $S = 2000$. Entropy is often expressed in bits: an entropy of $n$ bits is what you get out of a sequence of $n$ bits which have been selected uniformly and independently of each other (e.g. by flipping a coin for each bit); it is a simple logarithmic scale: "$n$ bits of entropy" means "entropy is $S = 2^n$" (and the attack cost is then $2^{n-1}$ on average).

If you think of a password as two halves chosen independently of each other, then the total entropy is the product of the entropies of each half; when expressed with bits, this becomes a sum, because that's what logarithms do: they transform multiplications into sums. So if you take two words, randomly and independently (i.e. never ruling out any combination, even if the two words turn out to be the same), out of a list of 2000, then the total entropy is $2000\cdot2000 = 4000000$. Expressed in bits, each word implies an entropy of about 11 bits (because $2^{11}$ is close to $2000$), and the total entropy is close to 22 bits (and, indeed, $2^{22}$ is close to $4000000$).

This answers your question about digits: a decimal digit has entropy 10, as long as it is chosen randomly and uniformly and independently from all other random parts of the password. Since $10 = 2^{3.321928...}$ then each digit adds about 3.32 extra bits to the entropy.

If a human being is involved in the selection process then calculating the entropy becomes much more difficult. For instance, if a human chooses two digits and the first digit is '4', then the probability that the second digit is '2' is quite higher than $\frac1{10}$. It could be argued that it is also difficult for the attacker: he will also have more work to do to sort the potential passwords so that he begins with the most probable. But this becomes a psychological problem, where the attacker tries to model the thinking process of the user, and we try to model the thinking process of the attacker: it will be hard to quantify things with any decent precision.

AleksanderCH
  • 6,511
  • 10
  • 31
  • 64
Thomas Pornin
  • 88,324
  • 16
  • 246
  • 315
14

Information entropy is closely related to the "predictability" of the same information.

When we talk about password entropy we are usually concerned with how easy it is for a password cracking software to predict a password. The more passwords the software has to try before guessing the password the larger the entropy is.

You can check software like John the Ripper (http://www.openwall.com/john/). It's free and you can download for free a list of words from 20 different languages (to answer your question about different languages).

Using this entropy concept, its easy to see that a digit in the middle of a word probably has more entropy than a digit in the end of a word. John will try words + 1~2 digits combinations pretty early in the attempts, so something like crypto5 has less entropy than cryp5to and uses the same characters.

4

Basically, any password is a string of letters and entropy can be easily calculated. For example you can use Shannon entropy calculator or by hand using a scientific calculator.

Entropy is calculated based on frequencies of letters in the password, it does not care about used language. So diverse passwords with many different letters are preferred as entropy will be larger. Words are treated equally if they have the same proportions of used letters, e.g. English 'and' and Indonesian 'dan' has the same entropy). This means, contrary to what Paulo said earlier, that 'cryp5to' and 'crypto5' has the same entropy, entropy does not care about letter order. If you do not believe this, try it yourself by entering similar examples into http://www.shannonentropy.netmark.pl

Of course, if an attacker will assume that your password is a word, not a random string (most people do that) he will use a dictionary to break your password and he will break it earlier, but his knowledge that you use a word, not a random string is actually information which decreases entropy, so he used external information to lower the entropy needed to break it.

"Does the entropy of that part depend on the number of English words in existence, ..." NO, it depends on all the combinations which can be done based on password length and diversity.

"... the number of English words known by the choosing algorithm..." it may affect the algorithm, but not from an entropy point of view, e.g. if this algorithm will be: just try all words from dictionary in which there is no crypto5, but crypto is present, it fails, but if the algorithm is more clever, for instance take all words from dictionary and mutate them by random letter or number it will finally find crypto5.

" ... the number of English words assumed by the attacker?" it may affect the algorithm, but not from an entropy point of view, see above, and remember you do not know who and how will hack your password, so you cannot assume anything like I will use different language, because it has more words, but on the other hand you can use different language if it has more letters (and you will use them in the password).

"Does the language matter, is the average entropy per word in German, French, Italian, or Spanish significantly different from the average entropy in English?" You can calculate entropy for different languages (actually this is what Shannon did), but again it does not influence the entropy of the password.

"Does a numeric digit always have an entropy of $\log_2(10) = 3.321928$?" No, base 2 is the most common, and it has nothing to numeric digits, it can be used also to letters or any other signs, see Wikipedia [information theory entropy]

wythagoras
  • 207
  • 1
  • 6
1

The entropy for a randomly generated password is based on the character Library space (i.e. range of valid characters) and then the length of the passwords (i.e. total number of characters in the password), and with no other constraints (i.e. ability to have a random message that produces a password of all the same characters even if it is unlikely for that to occur).

In such a setup, the entropy will be the $log_2{(Library^{length}}$), see below for examples and Claude Shannon's formula.

The entropy "H" of a discrete random variable "X" is defined as:

${\\H(X) = - \sum_{i=1}^{n} P(x_i) \ log_b P(x_i) }$

If the English word is a mnemonic and represents some underlying index value or other code value such as ASCII or UTF-8, then I don't think there is a difference so long as it was chosen randomly, as its entropy will depend entirely on the range of words or letters that it was chosen from. There is a difference though between the user choosing a word, versus randomly chosen letters that "happen" to equal a word when read from left to right etc..

Here is a simple explanation regarding password entropy, and depending on what needs to be measured. Let's first assume two following two points:

  1. The password has a specific "length" (consisting of its number of characters, some of - or all of which - may be duplicate/identical and/or repeat consecutively).
  2. Any character in the password has been chosen from a single common library or "range" of unique characters and chosen randomly using a cryptographically secure process.

Formula:

  • Log2(Possible combinations)= overall password entropy

  • Range^Length=Possible combinations (can also be rounded as 2^overall password entropy)

  • Log2(Range) = Entropy per character
  • Entropy per character * Length = overall password entropy

Example Test:

  • Range = 2048 unique character values (or 2048 unique words)
  • Length =12 characters (or 12 words, some or all of which may repeat)
  • Possibilities = 5444517870735015415413993718908291383296 or 2048^12
  • Overall Entropy = 132 or log2(possibilities)
  • Entropy per character (or per word if words are used) = 11 or log2(2048)

Another way to double-check roughly (depending on precision available if dealing with decimals and not whole number results): 2^(log2(Range)*Length) == (2^Entropy)

In Python3: 2**(int(math.log2(2048))*12) == int(2**132)


P.S. I think frequency analysis is useful here in two situations, one) the password was chosen deterministically without a crypto-secure process, and/or two) the characters in the library are either not distinctly unique (i.e. one or more duplicates exist, or many characters share strong similarities) or other unknown leakages of information in the library set.

Steven Hatzakis
  • 401
  • 4
  • 14