Calculating maximum plaintexts without birthday collisions given a probability, when the encryption scheme has multiple parts?

Question

I'm sorry if the answer to this is actually simpler than it seems to me.

I'm running AES-GCM to encrypt some data keys, but I don't actually know how to go about calculating the probability of collisions for my setup, or how to derive the maximum number of plaintexts I can encrypt without violating NIST standards on key/IV reuse. I understand the math behind the birthday problem specifically (at least, the bit where each successive student has an (n-1)/365 chance of a matching birthday), but I don't know how I can apply it practically to the case below:

How would I find out the maximum number of plaintexts I can encrypt while remaining below a given collision probability for a scheme such as the following?

For each plaintext, generate a 16-byte random value, a 32-byte random salt, and a 12-byte random IV.
Hash the 16-byte random value with the 32-byte salt with something like HKDF (SHA-256).
Encrypt a plaintext with the hash as a key and the 12-byte IV via AES-GCM.

(For clarification, I'm hoping to learn a general method I can apply to figure out the answer to any similar/modified scheme)

I know there are several methods (and even online calculators) for something like "approximate maximum keys given a 12-byte IV and maximum collision probability of 2^-32" (it's 2³²) and similarly for a 32-byte value (it's about 2¹¹²). But I get the feeling that even though both the hash and IV need to match to be a collision, simply multiplying numbers won't get the right number. My gut instinct says to multiply 2²⁵⁶ and 2⁹⁶ to get 2³⁵² (because you need both the IV and hash output to be the same as another set of IV+hash to get a collision) and then approximate the chance of a collision using Stirling or Taylor with 2³⁵² as the "number of days in a year" space, but that can't be right...right?

_{Side note: I mention the hashing/salting process as well because I feel like hashing might increase the chance of collisions due to the Pigeonhole Principle-- I'm feeding in 16 + 32 bytes of input and getting 32 bytes as output, could this possibly affect the end result?}

score 0 · Accepted Answer · edited Aug 03 '24 at 07:02

Using HKDF-SHA256 on your 2 random values should produce a cryptographically random key, so as I understand your question I think we can just look at the probability of key / $\mathsf{IV}$ pair collisions given a random 32-byte key and a random 12-byte $\mathsf{IV}$ for each plaintext.

And I think we can look at this as the probability of collisions in a 32 + 12 = 44 byte or 352-bit key space (i.e. the key + $\mathsf{IV}$ combined).

So given $M$ plaintexts, the probability of a collision is:

$$M(M- 1)/2^{n+1}$$

where $n=352$.

To calculate the number of plaintexts $M$ given a probability threshold $p$, you should be able to do (approx. for large $M$):

$$M = \sqrt{p2^{n+1}}$$

which checks out for $p = 0.5, n=352$ as $M \approx 2^{176}$ which we know from the birthday bound.

aiootp · Answer 2 · 2024-08-24T11:20:10.970

(For clarification, I'm hoping to learn a general method I can apply to figure out the answer to any similar/modified scheme)

As the soatok blog helpfully explains in Blowing Out the Candles on the Birthday Bound, the "optimal" bound can be determined if you take the cubed root of the randomness space $2^n$. The result will reveal the total number of messages $M = 2^r = 2^{n/3}$ after which the probability $p$ of having come across a collision in the randomness space is $p \approx 2^{-r}$, and continues growing passed $50{\%}$ after $2^{n/2}$ messages.

This can be derived with the sum of probabilities of randomly picking an already chosen random context for each message from the space of all possible random contexts. The first message has a $\frac{0}{2^n}$ collision probability (since no prior context has been chosen) and the final message has a $\frac{M - 1}{2^n}$ probability. The sum of these probabilities is closely related to the sum of an arithmetic progression:

$$ p = \frac{M(M-1)}{2^{n+1}} = \frac{1}{2^n} \left( \frac{M(M-1)}{2} \right) = \frac{1}{2^n} \sum_{i=M}^{1} (M-i) $$

Which often gets simplified when $1 \ll M \ll 2^n$ such that $M^2 \approx M(M-1)$:

$$ p \approx \frac{M^2}{2^{n+1}} \Rightarrow M \approx \sqrt{p2^{n+1}} $$

Therefore, when $M = 2^r = 2^{n/3}$:

$$ p \approx 2^{-r} \approx 2^{-n/3} \approx 2^{(2n/3) - (3n/3)} \approx \frac{2^{2n/3}}{2^{3n/3}} \approx \frac{(2^{n/3})^2}{2^{n+1}} $$

Since you have the luxury of sampling a random 256-bit key, you're not at all likely of repeating a key/nonce context. However, it's still good practice to clearly state the max $p$ value your system accepts. Because a repeat is catastrophic to the security of $\texttt{AES-GCM}$, $p$ should be reflective of a confidence inspiring security margin, such as $p \le 2^{-64}$. This is safer than the specified max of $p = 2^{-32}$ for static keys and random nonces.

Side note: I mention the hashing/salting process as well because I feel like hashing might increase the chance of collisions due to the Pigeonhole Principle-- I'm feeding in 16 + 32 bytes of input and getting 32 bytes as output, could this possibly affect the end result?

Indeed, the hash is truncating the input randomness space, which can be a good idea. But the impact is significant, as it limits the max randomness space to that of the output space. Consequently, the total possible unique key/nonce contexts would be $2^{128+96}$ when choosing $\texttt{AES-128-GCM}$, and $2^{256+96}$ for $\texttt{AES-256-GCM}$.

$\texttt{AES-GCM}$ uses a nonce, which is what you've called the IV (probably because you're choosing it randomly instead of counting from zero, as is recommended with GCM). In your particular scheme a counting nonce isn't strictly necessary, since the key is essentially chosen at random for each encryption. So, even though going the random nonce route increases the chance of collisions, the probability is already acceptably low.

What is concerning in this scheme, however, is that it's not clear what the actual initial key material is: is it the 32-byte salt or the 16-byte random value? Are both of those values secret and handled with care? Moreover, why derive a random key from a random salt and another random value when you can just create a 32-byte random key directly from the randomness source?

If you're going to be deriving a key from various random values, consider also safely including other important information in the derivation process, like the context, nonce, and associated data.

Calculating maximum plaintexts without birthday collisions given a probability, when the encryption scheme has multiple parts?

2 Answers2