I want to use parallel hash for a large file. I want final result equal to single hash from this file. What techniques would be best suitable to solving this problem?
1 Answers
If you want to produce a hash compatible with an existing sequential hash, like SHA-2, you're out of luck. But when you choose a parallel mode, you can always compute the same output using a single thread.
The two most important parallel constructions are:
Interleaved
You initialize a fixed number of sequential hashes. Then you feed them one block in turn. At the end of the file you take the outputs of the individual hashes and hash it down into a single hash.
The biggest downside of this approach is that you need to choose the maximal parallelism at design time and changing it breaks compatibility. It still needs to process the file start to finish, it can only take advantage of multi-core CPUs but can't parallelize IO by hashing different parts of the file at the same time.
A (deep) Merkle-Tree.
For example the Tiger-Tree-Hash splits the file into 1 KiB and computes an unlimited depth binary hash tree over it. (Don't forget tagging leaves and inner nodes differently, or an equivalent mechanism to avoid ambiguities)
This has slightly higher CPU (a few percent) and memory (a few KB) overhead. It can only parallelize hashing files larger than the leaf size.
In exchange you can hash pieces independently in any order and the maximal parallelism grows with the file size.
Personally I prefer the tree approach due to its flexibility.
If you can use a key, you can use a PRF/MAC instead of a collision resistant hash. For example GHash (the MAC part of GCM) is very flexible in how it allows you to partition and parallelize the computation. It's also much cheaper to compute than collision resistant hashes.
- 25,121
- 2
- 90
- 129