0

Some standard compression procedures, like the IANA's gzip provided by HTTP protocol, will consume CPU-time anyway... So we can reuse compressed file, $Z(x)$, in the checksum procedure. That is, we can use $H(Z(x))$ instead $H(x)$.

Let $H$ be a collision resistant hash function and $P_c[H](S)$ the collision rate probability about a sample set $S$ of input elements (eg. random numbers). It decreases when using $Z(x)$?
$P_c[H\circ Z](S) \le P_c[H](S)$   ?


Intuitivally we can imagine some decrease in collision-rate because hashing a file is similar to an aleatory sampling procedure: if we reduce the redundance in the sample set, we reduce collision probability. HTML, XML, and many other file formats have a lot of redundancy.

If the intuition is correct, the questions is also about "how mutch better is it"?


NOTICE: I am using the strange term collision rate probability (with a link to this other question) to avoid confusion with the "general collision probability". Please take care to not mix the concepts in the answer.

Peter Krauss
  • 193
  • 1
  • 13

1 Answers1

1

No. On the contrary, if we model $H$ as a random function chosen independently of $Z$ and $S$, then $P_c[H\circ Z](S)=P_c[H](S)$.

Argument is that $P_c[H](S)$ is dependent only on the width of $H$ and the number of elements in $S$, as long as $H$ is modeled as a random function chosen independently of $S$. Define $S'$ as the set of $Z(s)$ for $s$ in $S$. It holds that $P_c[H\circ Z](S)=P_c[H](S')$, and because compression is reversible, $S'$ and $S$ have the same number of elements.

fgrieu
  • 149,326
  • 13
  • 324
  • 622