3

I'm brainstorming some different ways of making deterministic encryption more secure. I want to use deterministic encryption to preserve searchability over the keywords in a document set. However, I know that frequency analysis is always a concern when preserving searchability in this way.

The strategy I came up with basically still does deterministic encryption (with SIV mode), except instead of just one IV I use ten different ones.* When I encrypt a keyword I choose a random number modulo ten and use that to index into the array of IVs. I then append the index to the ciphertext so it can be decrypted.

To search, you just encrypt the keyword with all ten possible IVs and make a big OR query with all ten ciphertexts.

I don't know a lot about frequency analysis, so I'm not sure how much (if any) additional security this provides compared to regular deterministic encryption. Anyone have any ideas?

My intuition is that for small document sets with relatively few keywords this does provide nontrivial protection against frequency analysis. But I also think the security might not scale to very large data sets.

*In practice the 'IV' here will be additional input to the first step of SIV.

pg1989
  • 4,736
  • 25
  • 43

1 Answers1

1

Such a scheme would have two effects against an attacker trying to analyze the frequency of words and word combinations:

  1. They would need more samples to differentiate between two tokens at the same level of confidence.
  2. For two different tokens with similar frequency they would no longer know that the words differ.

The first means you are making the problem like that of breaking a smaller data set. Larger collections of ciphertext are still breakable. You could correct for this by making the number of different IVs depend on the total plaintext size to be encrypted, but then you make searches even slower.

The second may even be more of a problem for an attacker. The less likely options can occur at almost the same frequency – e.g. in letters there is a significant difference between the frequency of 'e' and 't', but 'j' and 'x' are almost equally uncommon. That means even if you can deduce one token from context, you've only solved e.g. one twentieth part of the problem.

Are these insurmountable problems? No. The fix may patch some leaks, but the encryption remains much less secure than non-deterministic, semantically secure encryption.


Trying to extend your tweak, it might be better to use a non-uniform distribution for the different IVs. For example, illustrating only two IVs with letters, if the probability was $\frac{1}{3}$ for one and $\frac{2}{3}$ for the other, the frequency of 'e' with the first IV would become about equal to that of 'n' with the other.

Again, I emphasize that you should only encrypt information like this if you are OK with an attacker learning nontrivial facts about the plaintext.

otus
  • 32,462
  • 5
  • 75
  • 167