1

I’m thinking about the convergent encryption system described in the top answer to this question. However, it seems like preferred modern HMAC algorithms like HMAC-SHA256 (as used in step 2 to create the “key”) are much slower than the actual encryption. Then SHA256 is invoked again on data the size of the plaintext to create the locator.

It seems like the first SHA-256 invocation that’s part of the HMAC step could be replaced by any other transformation of the plaintext which cannot be reversed and which will produce the same output given the same plaintext data, like simply encrypting the data with the “specially-selected HMAC key”. I also read about authenticated encryption and am wondering if I could use the MAC created during e.g. AES256-GCM as the key.

Would either of those approaches allow sidestepping the performance overhead of traditional HMAC methods? The only problem that I could think of is that it may be difficult to come up with an IV if that is required, since all writers would somehow need to use the same one in order for the “locator” to be consistent for all instances of the same input data.

If not, are there any known techniques for reducing the overhead of HMACs (other than reducing the input size)?

Dan
  • 125
  • 5

1 Answers1

1

The answer depends on how you define ‘convergent encryption’ and what threats you're hoping to defend against. Here are two broad options that you might mean:

  1. You're a storage service and you make lofty promises about encryption to your users, but you also do deduplication between unrelated parties. (Maybe you abuse the term ‘zero-knowledge’; it's trendy among such companies!) You just want to prevent an adversary from

    • recovering a plaintext they don't know, or
    • forging a plaintext for deduplication so that some hapless user who tried to store a file would get the forgery when they later try to retrieve it.

    This is a very weak notion of security, but it roughly corresponds to, e.g., Dropbox, who presumably save a lot of money by deduplicating files unrelated people have shared many times, and who presumably earn a lot of brownie points by preemptively checking files for known child pornography and notifying the authorities (of course, exactly the same technique applied to subversive imagery would serve the interests of an authoritarian government, but we won't talk about that).

    So you use $H(m)$ as the encryption key, and maybe $H(H(m))$ as the deduplication index, like Tahoe-LAFS does (without convergence secrets). In this case, you need a collision-resistant hash function $H$—otherwise an adversary could find a collision between a benign file $m$ and a file $m'$ full of child pornography, and then persuade the victim to try uploading $m$. There are no shortcuts. Sorry. If you really need to squeeze performance out of it, some collision-resistant hash functions are cheaper than others—see eBASH for fair performance comparisons on a wide variety of hardware.

  2. You, or you and a small group of friends, have secrets that you can use to encrypt files, and you want to do deduplication among your (and your friends') files. You want to prevent an adversary who doesn't know the key from even confirming a hypothesis about what files are stored, and, of course, from fooling you into accepting forged files.

    What you need to derive the convergent encryption key and deduplication key is a pseudorandom function family: the output must be indistinguishable from uniform random to anyone who doesn't know the secret key.

    SHA-256 is expensive mainly because you it pays for collision resistance, and while HMAC-SHA256 is a perfectly respectable and widely available PRF, you're still paying for that collision resistance when you use it.

    Here are a couple of alternatives:

    • A cheaper PRF like Kravatte, which is based on 6-round Keccak, in contrast to the 24-round Keccak used in SHA-3. There aren't a whole lot of widely available options here: most work in this area goes into short-input PRFs or collision-resistant hashes.

    • Pick a universal hash family $H$ with bounded collision probability, such as Poly1305, and a short-input PRF $f$, such as ChaCha or Salsa20 (or HChaCha or HSalsa20, slightly cheaper). Then build a long-input PRF as follows:

      \begin{equation*} F_{k_1,k_2}(m) := f_{k_1}(H_{k_2}(m)). \end{equation*}

      That is, loosely, use $H$ to compress the message with low collision probability, and then use $f$ to scramble it.

      Universal hash families like Poly1305 are considerably cheaper to compute than SHA-256 or any other collision-resistant hash functions, and can be vectorized effectively. Poly1305 implementations are widely available, e.g. in NaCl/libsodium as crypto_onetimeauth_poly1305.

      If the collision probability of $H$, that is $\Pr[H_{k_2}(x) = H_{k_2}(y)]$ for $x \ne y$, is at most $\varepsilon$, and if the best PRF-advantage against $f$ is $\delta$, then the best PRF-advantage against $F$ is at most $q^2 \varepsilon + \delta$ where $q$ is the number of messages you handle under a single key (proof).

      For Poly1305, $\varepsilon = 8\lceil L/16\rceil/2^{106}$ for messages up to $L$ bytes long, so with Poly1305 and ChaCha, if you store—or if the adversary tries to forge—up to a billion megabyte-long files (a petabyte of data), the advantage of the adversary at distinguishing any of the convergence keys from uniform random strings will be at most about $2^{60} 2^{16} 8/2^{106} + \delta = 2^{-27} + \delta$. If the adversary puts additional computational power to it, that only raises $\delta$—but unless they can break ChaCha, the additional computational power will pale in comparison to what they need to guess your 256-bit key.

      This technique is part of how the deterministic authenticated cipher AES-GCM-SIV works—the output of $F$ is used both as an authentication tag and to derive a per-message subkey to encrypt the message. (Of course, AES-GCM-SIV uses GHASH, not Poly1305, and AES, not ChaCha, so it's good for hardware implementations and bad for software implementations.)

Squeamish Ossifrage
  • 49,816
  • 3
  • 122
  • 230