2

I have a scenario similar to the one described in Wikipedia: hash list, but with a twist. I'm looking for a cryptographically secure hash function that would create the same root hash for the same file, no matter how the the file is chopped up for the individual hashes in the list.

E.g. case 1: File is divided into 3 parts; the hash list consists of the hashes for the 3 parts; the root-hash is computed from the 3 hashes. case 2: Same file is divided into 2 parts; the hash list consists of the hashes computed for the 2 parts, the root hash is computed from the 2 hashes. Since it is the same file I want the root hash to be the same.

Is this doable (maybe with some restrictions on number and size of file parts)?

[Edit] Specific use case: My system stores files for users. Large files are usually sent / stored in smaller chunks (currently I don't control in which way the files are split up into chunks). Each chunk is encrypted beforehand by the client, but accompanied by a hash of the unencrypted content. I now would like to know if two users upload the same file (as this allows me to do some optimization) without having to know the content of the file. So if I could compute a "hash" of the whole file by using the individual chunk hashes I could easily achieve this.

David Cary
  • 5,744
  • 4
  • 22
  • 35
user12889
  • 145
  • 4

4 Answers4

6

A Hash Tree is meant for that. A binary tree seems fit. I'll restrict the description to something directly derived from SHA-256 (256-bits output, 512-bit hashed per round).

  1. A parameter n>0 is selected, defining a "superblock" size of n*512 bits. Say 8192 bits (1kB, for n=16); n=1 works, but a higher value improves computing efficiency markedly.

  2. The file is padded as in SHA-256, and conceptually organized into m superblocks of that size (the last superblock's size might be shorter, but is mutiple of 512 bits; the padding might be in the last superblock, or span the last two superblocks).

  3. The file is chopped in segments of consecutive superblocks, with the limit of segments on superblock boundaries. Each computation point is assigned a segment which it receives (or generates and pads).

  4. Each computation point separately hash each superblock in its assigned segment, using SHA-256 less padding. Each superblock requires n hash rounds (except the last which may require less). Most of the rounds are performed here (exactly as many as for SHA-256 of the whole file), in a distributed manner.

    All the hashes obtained at step 4 form the m leaves on top of a single overall binary tree with a 256-bit hash at each node, and structure independent of how the file was chopped. Each of the m-1 non-leaf node of the tree will be the hash, obtained with one round, of the 512 bits in the two nodes linked on its top-left and top-right. The bottom node of the tree will be the final hash. It will be independent on how the file was chopped at step 3, because the computation will be performed according to the same tree regardless of the chopping.

    [drawing of tree needed]

    All branches in the tree join adjacent levels, except on the right where the j+1th level from the top is skipped when the jth lower-order bit of m-1 is 0.

  5. The computation point responsible for a segment computes the hashes for nodes which have all their leaves assigned to that segment, and must keep the others. This needs precise organization [and is left as an exercise to the reader; my earlier description was flawed].

  6. Each computation point returns its partial result. That will include Ceil(Log2(k+1)) 256-bit hashes at most, for the partial hash of k superblocks. If the communication is centralized, the central point finishes the calculation, re-hashing according to the same binary tree as necessary. With decentralized communication, it is advantageous to aggregate partial results with a peer handling a segment of the file just before or after, which might allow to perform slightly more of the work.

Note: Computations can be interleaved to reduce storage requirements. There is a total of m-1 extra rounds in steps 5 and 6, an overhead of about 1/n, distributed in part.

fgrieu
  • 149,326
  • 13
  • 324
  • 622
6

The property you want is inconsistant with the definition of a cryptographically-secure hash function.

If $\mathcal{H}'(\mathcal{H}(\mathrm{half}_1),\mathcal{H}(\mathrm{half}_2)) = \mathcal{H}'(\mathcal{H}(\mathrm{third}_1),\mathcal{H}(\mathrm{third}_2),\mathcal{H}(\mathrm{third}_3))$, finding second-preimages is as trivial as repartitioning the original message. If you consider either $\mathcal{H}$ or $\mathcal{H'}$ (or both) as random oracles, that may assist is seeing why the scheme won't work.

Note: in a hash tree, messages are processed identically regardless of partitioning (halves would be processed the same as quarters, so it is related, but not on target for your use case). If the depth of the tree were agreed upon in advanced, then the partitioning could be as well. In your use case, your stuck with hashes of the partitions, and so you cannot accumulate them without the hash $\mathcal{H}'$ "knowing" something about the preimages of $\mathcal{H}$, which is against the security definition of $\mathcal{H}$.

PulpSpy
  • 8,767
  • 2
  • 31
  • 46
2

Not an answer to your question, but a security point. Revealing simple hashes of unencrypted content is a security vulnerability. (Revealing HMACs or anything with a secret component do not pose the same vulnerability, even if they are calculated from the unencrypted content.)

For example, consider the hash of a configuration file, whose contents are mostly known, but differ only for a 8 character password. It becomes really easy to iterate through all the possible passwords and check which hash corresponds to the one sent. Or simply the case of copyright protected content being stored there - if the plaintext hash is revealed, anybody can see if you've stored the content or not.

If the unencrypted content hashes are revealed only to the service provider (you?), this means that you can mount these types of attacks against the users - or anybody able to force your hand. This is something that the users would have to be aware of, and it kind of undermines the benefit of not having the service provider have access to the encryption key.

In general, doing deduplication between different users cannot be done securely - it always leaks information and the leak may be critical. Doing deduplication for the single user among his own files leaks information as well, but this leakage is often small enough to be ignored entirely.

Nakedible
  • 1,460
  • 11
  • 15
-1

I have a way, but it's really awful. You can individually hash the combination of each byte's position in the file and the content of that byte, and then XOR all the hashes together.

If you have some control over the block arrangement, you can do a bit better. For example, if the blocks are always at least aligned on 1KB boundaries, you can individually hash each KB using its offset into the file as a HMAC key. Then XOR all the block hashes together. You can accomplish the XORs any way you want, so you can XOR each block and then XOR all the blocks, allowing a block of any size to carry a fixed-length hash value that can easily be combined.

Note that XORing hashes together weakens the security properties of some hashes. I would not recommend this technique if it has to resist a deliberate attack specifically intended to defeat this algorithm. It may take an expert to assure you that the resulting hash algorithm truly is a cryptographically secure hash algorithm

David Schwartz
  • 4,739
  • 21
  • 31