Are there better alternatives to CRC32 for distributed download components hash calculation?

Question

Given a server with with download resume capabilities (byte range), we can download a file chunk by chunk. Multiple people will download different parts of the file.

If a file A.jpg has blocks 1, 2, 3, 4, 5, 6.

Following are the people who obtained the blocks

Alice - 1, 2
Bob - 3, 6
John - 4, 5

What kind of hashing algorithm can be used to calculate the hash of A.jpg given that Alice, Bob and John can only hash each block separately?

Assume that the hashes for each block are located in a place Alice, Bob and John can access. And they can combine them as they wish.

What I've found: Combine two sha512 hashes to a single hash cannot be used.

I also found that CRC32 can be used, Are there any more algorithm ?

Desired outcome

Everyone should be able to generate the same final hash, which should be more powerful than CRC32 to prevent collisions. In the end I want to join the blocks to get the final file. So that Alice, Bob and John have the same file.

fgrieu · Accepted Answer · 2018-03-17T07:54:24.930

I'll assume we want a cryptographic hash giving security in the Random Oracle Model; collision-resistance and preimage-resistance follows. Collision-resistance alone rules out CRC, regardless of size.

The standard technique would be to split the file into blocks, distribute them with index to the participants, which hash each of their blocks; then hash the hashes concatenated in order of increasing indexes, or use a Merkle tree, to form the hash of the whole file. However, with blocks distributed in haphazard manner (as in the question), most block hashes need to be exchanged, which can get sizable; and the distributed computation of the final hash is slightly hard to organize.

Rather, we can group the block hashes (made dependent of their index) using an order-independent hash, used by each participant over all the hashes of the blocks s/he is responsible for, then again to obtain the final hash. This simplifies the organization, and saves bandwidth when there more than a few blocks per participant: like 8 in the following simple example using $\Bbb Z_p^*$, but I conjecture that the overhead can be made negligible down to 1 block per participant using an Elliptic Curve Group instead.

For a 256-bit hash, marginally more costly than a regular one for large files, we'll use:

Some 512-bit hash $H$, e.g. $H=\operatorname{SHA-512}$.
Some 2048-bit prime $p$ making the Discrete Logarithm Problem in $\mathbb Z_p^*$ (conjecturally) hard; see final section.
Some public block size $b$ multiple of $2^{12}$ bit (512 bytes), e.g. $b=2^{23}$ for blocks of 1 MiB.
Implicit conversion from integer to bitstring and back, per big-endian convention.

To hash a file of $s$ bits (with $s\le2^{62}b$, which is more than ample):

Split the file into $\lceil s/b\rceil$ blocks $B_i$ of size $b$-bit, except for the last which may be smaller (but non-empty), with $0\le i<\lceil s/b\rceil$. Distribute the blocks $B_i$ and indexes $i$, such that each participant $j$ is assigned a block only once.
Have each participant $j$ perform:
- $f_j\gets1$
- For each block $B_i$ assigned to participant $j$
  - $h_i\gets H(B_i)$. That's a 512-bit bitstring characteristic of $B_i$.
  - $g_i\gets H(h_i\mathbin\|\widetilde{4i})\mathbin\|H(h_i\mathbin\|\widetilde{4i+1})\|H(h_i\mathbin\|\widetilde{4i+2})\|H(h_i\mathbin\|\widetilde{4i+3})$ where $\widetilde{\;n\;}$ is the representation of integer $n$ as a 64-bit bitstring.
    Since function $H$ returns 512 bits, the concatenation of the 4 hashes makes $g_i$ a 2048-bit bitstring, characteristic of $B_i$ and $i$.
  - $f_j\gets f_j\cdot g_i\bmod p$.
- $f_j$ is a 2048-bit bitstring characteristic of the $B_i$ and $i$ assigned to participant $j$.
- If $j\ne 0$, transmit that $f_j$ to participant $0$.
Participant $0$ performs:
- $f\gets f_o$
- When receiving $f_j$ with $j\ne 0$
  - $f\gets f\cdot f_j\bmod p$.
- $h\gets H(f)$ truncated to it's first 256 bits, where $f$ is represented as a 2048-bit bitstring when applying $H$.
- Send $h$ to all participants.

Absent message alteration or loss, $h$ is independent of how the blocks have been distributed. That is a 256-bit bitstring characteristic of the whole file, computed in a largely distributed manner. The computation of $f$ and $h$ could be distributed too, at a small extra cost in message exchange.

The order-independent hash is borrowed from the multiplicative one in Dwaine Clarke, Srinivas Devadas, Marten van Dijk, Blaise Gassend, G. Edward Suh, Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking, in proceedings of AsiaCrypt 2013, which is given a security reduction in appendix C. The security of the whole construction should follow.

On choice of $p$: our requirement is hardness of the DLP in $\Bbb Z_p^*$, as in classic Diffie-Hellman key exchange. We need a 2048-bit safe prime, with no special form $p=2^k\pm s$ that could make SNFS easier. Customarily, it is used a nothing-up-my-sleeves number based on the bits of some transcendental mathematical constant, as a good-enough assurance that $p$ is of no special form.

That can be $p=\lfloor2^{2046}\pi\rfloor+3617739$. The construction uses the first 2048 bits of the binary representation of $\pi$, then increments until hitting a safe prime. Hexadecimal value:

c90fdaa22168c234c4c6628b80dc1cd129024e088a67cc74020bbea63b139b22514a08798e3404ddef9519b3cd3a431b302b0a6df25f14374fe1356d6d51c245e485b576625e7ec6f44c42e9a637ed6b0bff5cb6f406b7edee386bfb5a899fa5ae9f24117c4b1fe649286651ece45b3dc2007cb8a163bf0598da48361c55d39a69163fa8fd24cf5f83655d23dca3ad961c62f356208552bb9ed529077096966d670c354e4abc9804f1746c08ca18217c32905e462e36ce3be39e772c180e86039b2783a2ec07a28fb5c55df06f4c52c9de2bcbf6955817183995497cea956ae515d2261898fa051015728e5a8aaac42dad33170d04507a33a85521abdf53ee2f

As pointed by Sqeamish Ossifrage in comment, we could use the 2048-bit MODP proposed by RFC 3526: $p=2^{2048}-2^{1984}-1+2^{64}\cdot\lfloor2^{1918}\pi+124476\rfloor$. That similarly uses as many of the first bits of the binary representation of $\pi$ as possible, but by construction has the 66 high-order bits (including two from $\pi\approx 3$) and 64 low-order bits set. The high-order bits simplify choice of dividend limbs in Euclidean division by the classical method, while the low-order simplify Montgomery reduction. This is believed few enough forced bits to not allow a huge speedup of the DLP.

ffffffffffffffffc90fdaa22168c234c4c6628b80dc1cd129024e088a67cc74020bbea63b139b22514a08798e3404ddef9519b3cd3a431b302b0a6df25f14374fe1356d6d51c245e485b576625e7ec6f44c42e9a637ed6b0bff5cb6f406b7edee386bfb5a899fa5ae9f24117c4b1fe649286651ece45b3dc2007cb8a163bf0598da48361c55d39a69163fa8fd24cf5f83655d23dca3ad961c62f356208552bb9ed529077096966d670c354e4abc9804f1746c08ca18217c32905e462e36ce3be39e772c180e86039b2783a2ec07a28fb5c55df06f4c52c9de2bcbf6955817183995497cea956ae515d2261898fa051015728e5a8aacaa68ffffffffffffffff

cypherfox · Answer 2 · 2018-03-15T18:01:49.467

I've split my answer into two sections: first making any normal hash distributed, second by creating a parallel hash function from any normal hash.

Hack existing functions to enable distributed computation

If Alice, Bob and John can share state, they may migrate the internal hash function state and take turns. This is easy to demonstrate with any Merkel-Damguard based hash function as each segment is appended to the state of the last.

If they cannot share state, then you must use a hash function that supports splitting the work into trees. Now the segmentation is controlled by the hash function, no longer by your software.

Create an independent hash function for this purpose

We can hash the xor of the hashes of each block. Assuming only that each participant has some canonical segment ID such as the starting byte and assume a fixed length, or we concatenate the $\text{identifier} = \text{start} \| \text{end}$. It is critical that no participants overlap and no bytes are skipped.

$$\tilde H(m)\ =\ H\Bigl(\bigl(H(m_1\mathbin\| 1)\oplus H(m_2\mathbin\|2)\oplus\dots\oplus H(m_n\mathbin\|n)\bigr)\mathbin\|n\Bigr)$$

Alice produces: $H(m_1 \| 1) \oplus H(m_2 \| 2)$. Bob produces $H(m_3 \| 3) \oplus H(m_6 \| 6)$ and so on.

The above function is roughly equivalent to PMAC, which uses keyed-PRPs instead of unkeyed PRFs. You may consider this as an unkeyed or a static-key variant of PMAC and so all proofs for PMAC should apply here as well.

Edit: See https://crypto.stackexchange.com/a/56512/56625. This is not collision-resistant, unlike PMAC where the secret is unknown from the adversary. However, just as our participants aren't trusted anyway, this is intended for weak-integrity checks (as apparently CRC is valid), right?

In any case we must trust the participants to produce correct hashes. If we can, we should verify the result by hashing the data alone. If any participants lied, we can tell by their hashes and distrust them in later rounds.

Are there better alternatives to CRC32 for distributed download components hash calculation?

Desired outcome

2 Answers2

Hack existing functions to enable distributed computation

Create an independent hash function for this purpose

Linked