Which hash functions have the mathematical properties required to prove data availability?

Question

I'm looking for a specific keyed hash function with security properties that allow it to be used for a step in an interactive proof, in which a Verifier has some message, and the Prover needs to prove that they have the entirety of this message stored.

While this looks like a simple Proof of Knowledge (trivially solved by the verifier sending a random key and the prover replying with a universal hash), the larger proof I'm using this in requires a subtly stronger condition I'm calling Proof of Data Availability, which I describe below. It's easy to find research backing up the idea that all the modern keyed hash algorithms are secure wrt Proof of Knowledge, and my gut feeling is that they're probably all secure for Proof of Data Availability -- but a gut feeling is not a security proof, and I'm struggling to find actual research that addresses the additional properties an algorithm suitable for Data Availability would have.

Data Availability

The goal of an attack against a scheme for proving Data Availability isn't to be able to forge or replay past hashes of an unknown secret message based on snooped traffic. Rather, the attacker is a dishonest data storage service, that is attempting to forge a proof to a customer that the customer's data is securely stored and retrievable, but who has instead discarded most or all of it in favor of a smaller "compressed oracle representation" that lossily discards some or all of the original while retaining the ability to calculate the message's hash for any given key?

That is, instead of storing the full message $m$ in order to calculate $\mathrm{H}_k(m)$ in the specified way during the proof, the attacker instead computes and stores some other value $\mathrm{C}(m) = \mu$, with the size of this compressed representation being strictly smaller than the size of the original message, $\|\mu\| < \|m\|$; then during the proof the attacker uses some other algorithm $\mathrm{F}_{\mu}(k) \to \mathrm{H}_k(m)$ to compute the hash from only $\mu$ and the provided key $k$.

(Note that $m$ is assumed incompressable: if there's some efficient $\mathrm{C}^{-1}(\mu) = m$ that would allow an "attacker" to complete the proof by just computing $\mathrm{F}_{\mu}(k) = \mathrm{H}_k(\mathrm{C}^{-1}(\mu))$, their claim of being able to retrieve the original message isn't dishonest.)

Relationship to Proof of Knowledge

Proof of Knowledge is Not Necessary

The goal of an attack against Proof of Knowledge doesn't always involve learning the plaintext message, but an attacker who can break Proof of Knowledge for a given hash function with only a small number of key-hash pairs can use it to break Data Availability.

In a Proof of Knowledge attack, the attacker observes exchanges between a prover and verifier and obtains a some number of pairs $P = \{(k_1, \mathrm{H}_{k_1}(m)), ... (k_n, \mathrm{H}_{k_n}(m))\}$, and then employs some efficient algorithm capable of computing $\mathrm{A}(P, k) \to \mathrm{H}_k(m)$ to calculate a new hash value for a verifier-chosen key not among their set of pairs.

To try to turn this into an attack on Data Availability, the attacker defines a compression function $\mathrm{C}(m) = P = \mu$ to generate and store a compressed set of key-hash pairs they've generated themselves; then at the proving step they can use that same algorithm $A$ by defining $\mathrm{F}_{\mu}(k) = \mathrm{A}(\mu, k) \to \mathrm{H}_k(m)$. But, this is still only a break of Data Availability if the $\|\mu\| < \|m\|$ constraint is satisfied. If even with chosen-key pairs it would take more pairs than could be stored in the equivalent storage for just storing $m$, then that's not technically a break wrt Data Availability.

Proof of Knowledge is Not Sufficient

To demonstrate this, take this toy intentionally-poorly-constructed universal hash, VBMAC-SHA256 ("very bad MAC, SHA256"), which is like HMAC except that it appends the keys instead of prepending them:

$$\mathrm{H}_k(m) = \mathrm{SHA}(\mathrm{SHA}( m \,\|\, k_{ipad} ) \,\|\, k_{opad} )$$

Theory says this ought to still be secure for Proof of Knowledge. Even if an attacker managed to learn pairs of the inner hash, i.e.: $$\mathrm{SHA}( m \,\|\, k_{ipad} ) = \mathrm{SHA_{finalize}}(\mathrm{SHA_{extend}}(\mathrm{SHA_{IV}}, m \,\|\, k_{ipad}) ) = \mathrm{SHA_{finalize}}(\mathrm{SHA_{extend}}(\mathrm{SHA_{extend}}(\mathrm{SHA_{IV}}, m), k_{ipad}) ) $$ That's still not a full break wrt Proof of Knowledge, because the state-vector-extension and finalizer functions in SHA-256 are still (claimed) strong enough to prevent an attacker from efficiently computing their inverses to derive $\mathrm{SHA_{extend}}(\mathrm{SHA_{IV}}, m)$.

But, thanks to that internal MD structure of the SHA algorithm, an attacker who has learned $m$ can trivially compute a compressed representation $\mu$ that's more than adequate to break Data Availability:

$$\mathrm{C}(m) = \mathrm{SHA_{extend}}(\mathrm{SHA_{IV}}, m) = \mu \\ \mathrm{F}_\mu(k) = \mathrm{SHA}(\mathrm{SHA_{finalize}}(\mathrm{SHA_{extend}}(\mu, k_{ipad} ))\,\|\, k_{opad}) $$

Group Theory: Associative Representation Vulnerability

Of course, it's no great surprise that it's possible to show that an obviously-insecure construction like VBMAC-SHA256 is insecure. The underlying vulnerability, though, is a far more general result group-theoretic result that applies to all hash functions, not just MD-construction ones:

Given universal hash function family $\mathrm{H}$ and a set of possible messages and blocks $b_i \in B, m=\langle b_1, b_2, ..., b_n \rangle$, you can define an associated block-action group $(G, \cdot)$ consisting of the set of primitive elements $g(b_i) \in G$ defined by the group action $g(x) : \mathrm{H}_k(y) \mapsto \mathrm{H}_k(x \,\|\, t) \forall y, k$; and their closure under the group operator $\cdot$ to include all $g(m) = g(\langle b_1, b_2, \ldots, b_n \rangle) = g(b_1) \cdot g(b_2) \cdot \ldots \cdot g(b_n)$

Regardless of any other properties, any hash function where composite elements $g(b_1) \cdot g(b_2) \cdot \ldots \cdot g(b_n)$ of this group admit storage representations smaller than $m = \langle b_1, b_2, \ldots, b_n \rangle$, then the underlying hash function is not secure for Proof of Data Availability.

Use Case

The surrounding proof that will make use of Data Availability, ultimately aims to use it to allow a small microcontroller to prove firmware integrity, i.e. that the software being run is the specific firmware image that the verifier expects.

Under a threat model that doesn't enable an attacker to modify its hardware (only its software), and a system design that places a strict upper bound on the amount of persistent storage this microcontroller has available to it, it's possible to derive Firmware Integrity from Data Availability of an incompressible firmware image that fills the entire available space using a pidgeonhole argument. Specifically, by the squeeze between that upper bound on available storage and the lower bound on the minimum code size for an adversarial firmware image (which to count as adversarial has to include some type pf payload, on top of the original image needed to pass the Data Availability proof).

Some margin does need to be added to this to account for the possibility of an adversarial program stashing bytes in unaccounted-for-places (e.g. overwriting the calibration parameters of some temperature sensor or something). Any additional compressability margin on the actual firmware needs to also be accounted for, but that can be minimized by shipping a firmware image that is already aggressively compressed and then all remaining space random-filled.

But the logic needed to coordinate any type of subterfuge or indirection also has a certain minimum possible code size -- and if nothing else, it'll substantially raises the floor for the level of skill and effort required to engineer a successful exploit that fits in the handful of spare bytes an attacker manages to wring out using some marginally-better lossless compression method.

But a universal hash that admits a compressed oracle representation of the message would be entirely broken from the perspective of this proof -- even if it's secure enough to be suitable for other tasks.

The Question

What specific algorithms exist for universal hashing, that make credible security claims about the infeasibility of a compressed oracle representation, or equivalently the infeasability of computing compact associative representations for the relevant elements of their associated group?

aiootp · Answer 1 · 2024-06-21T23:22:07.813

What hash algorithm/construction to use to prove Data Availability?

The problem of proving data availability has only auxiliary need of a specific function or construction, dependent on the interaction protocol with the remote location being probed.

Said another way, proving information $X$ exists in a physical location $L'$ requires an observation to occur at location $L$ after receiving a signal from location $L'$. This requires communication / interaction — subject to universal physical laws regarding information travel through space.

A communication protocol⁽⁰⁾ that can prove an observation of some information occurred at some time at some remote location is what's needed. The protocol may involve a hash function, a zero-knowledge proof, or something else. But, it must involve communication / interaction with something that has interacted with the information at the remote location. IPFS⁽¹⁾⁽²⁾ is one project that has attempted to solve this problem, which may provide a research direction for you.