4

Perhaps someone can confirm my intuition:

A tree hash with leaves of size 1 (for example Blake2 in a tree-hash mode) would of course not be very efficient.

But it seems that it might be a suitable way (or maybe the only way?) to provide a verifier that's still useful if the underlying content is subject to intentional declared byte-range-elisions later. (That is: a form of digital redaction, where the original verifier still helps to assure that only the declared range has been altered.)

A content host performing elision would record the range, record the one or (usually) two internal tree node hashes that cover the to-be-elided range, then zero (or encrypt) the range.

When other parties want to verify the content is unaltered (except for the declared range), the host would provide the content, the declared range, and the up-to-two internal node hashes to be used in-place of the missing tree-hashed values. A root tree hash value would confirm that all other bytes in the file remain as when originally hashed.

Sound right?

Are there gotchas with this approach?

Any other better way to achieve the same end-result? (That is, use a compact-authentication value from when the content was whole, to still strongly verify unchanged ranges of content after some declared edit-range is removed?)

gojomo
  • 221
  • 1
  • 8

1 Answers1

1

It is not true that there would always be at most two hashes required per elided range.

Suppose you have an eight-byte long content and want to elide all but the last. You need to supply one hash for the left half, one to cover the next two and one more to cover the last one.

Similarly, with 16 bytes of content, eliding the first 15 requires four hashes. And in general, eliding $2^n-1$ bytes out of $2^n$ requires $n$ hashes, so it is not bounded by a constant, but can grow linearly with tree height (logarithm of content length).


There is an additional problem with the scheme: a preimage search depends on the length of content hashed. Suppose that the redacted value is a random 128-bit string (encoded as just the raw 16 bytes). Unguessable, right?

Not if it falls poorly on the hash tree boundaries. It is possible that the redaction algorithm reveals the hash of the first byte, $H(b_1)$, the higher level hash of the next two bytes, $H(H(b_2)||H(b_3))$, etc. The first can be found out with $2^7$ calls to the hash function on average. The next with about $2^{15}$. The rest with $2^{31}$, $2^{63}$, $2^{7}$. Guessing the complete value requires less than $2^{64}$ work. Difficult, but much less than $2^{127}$ and probably doable.

For the redaction to be secure, a much larger block size than one byte would be needed and each redacted block would need to have enough (computational) entropy. You might be able to pair each byte of the document with a 128-bit random number and redact those as well, but clearly that would be a quite inefficient encoding (>16x expansion).

otus
  • 32,462
  • 5
  • 75
  • 167