As per the Recursive n-gram hashing is pairwise independent, at best paper, I want to use the algorithm described in chapter 6 and 7 (page 7 - 10). The hash works as follows:
Define a random function $h_1$ that maps elements in a set $B$ to elements in a set $I$, where $|B| = 2^8$ and $|I| = 2^{32}$.
i.e. elements of $B$ are all single-byte values, and elements of $I$ are some 32-bit numbers:
h1 = array of 32-bit numbers with indices in <0; 255>
FillArrayWithRandomValues(h1);Interpret $h_1$ values as polynomials in $GF(2)[x] / p(x)$, where $degree(p(x)) = n = 32$ and $p(x)$ is an irreducible polynomial (also randomly chosen just like $h_1$). So the degree of polynomials in $I$ is at most 31, and $GF(2)[x] / p(x)$ is a field.
The hash function $h$ is then defined as:
$$h(a_1, a_2, ..., a_n) = h_1(a_1) * x^{n-1} + h_1(a_2) * x^{n-2} + ... + h_1(a_n) * x^0$$ where $a_i$ is the $i$-th input byte (element in $B$) to hash. Basically we take a polynomial from $h1$ based on $a_i$, and multiply it by $x^{n-i}$.All math is done in $GF(2)$. Since I am reducing everything using polynomial of degree 32 (33-bit number), the output of $h$ is a polynomial of degree at most 31 in $GF(2)$ (32-bit value), which has been proven to be uniformly distributed and pair-wise independent in the paper.
This type of hash function is also called universal hash function. Because of the pair-wise independence, $h$ is a strong universal hash function. Using this formula, there are 134 215 680 irreducible polynomials in $GF(2)$ of degree 32, so that's about 27 bits of entropy in addition to the 8 192 bits of entropy coming from $h_1$, so the hash family is pretty large.
I play the following game with an attacker. He can submit up to $2^{64}$ queries ($2^{64}$ strings $a_1, a_2, ..., a_n$ of his choice), and I will reply $true$ to each query if the last $k$ bits of the $h(a_1, a_2, ..., a_n)$ is all 0 (coefficients of the last $k$ terms of the result polynomial are 0), otherwise I reply $false$. The $h_1$ and $p(x)$ are randomly chosen by me and kept secret, but everything else is known. I haven't decided on the value of $k$ yet, but it will be an integer in $<12; 18>$.
Given the game described in $4.$:
- What can an attacker learn about $h_1$ and $p(x)$?
- After he makes all the queries, is he eg. able to predict whether I will answer $true$ or $false$ for some strings that he did not submit in step $4$?
- What would change if I used $0 < n < 32$, such as $n = 16$ or $n = 4$?
- What would change if $p(x)$ was revealed to the attacker before he makes any queries?
- What would change if he could make up to $2^{128}$ queries?
First progress
So there is a simple way to find some information about $h_1$. Let's say we have 2 input strings:
- $a_1$ $a_2$ ... $a_{31}$ $a'_{32}$
- $a_1$ $a_2$ ... $a_{31}$ $a''_{32}$
Ie. they only differ in the last byte. Then, assuming the last $k$ bits of the hash of both strings is 0, then we know that last $k$ bits of $h1(a'_{32})$ and $h1(a''_{32})$ are the same (even though we don't know what those bits are). It's easy to see this when the hash function is written in code (for input strings of length 2):
var hash = h1[a1];
// Galois multiplication by x and subsequent reduction
hash = (hash << 1) ^ ((hash >> 31) * irreduciblePoly);
hash = hash ^ h1[a2];
Using this method, I can assign members of $h1$ to groups based on equality of the last $k$ bits (so there will be $2^k$ groups and I will know exactly which members (indices of members in $h1$, to be precise) belong to which group).
More progress
Let there be 2 input strings:
- $a_1$, $a_2$, ... $a'_i$ ... $a_{32}$
- $a_1$, $a_2$, ... $a''_i$ ... $a_{32}$
The strings only differ in the $i$-th byte. If we find out that the last $k$ bits of hashes of both strings are equal to $0$, then we know that last $k$ bits of $h_1(a'_i) * x^{n-i}\ \textrm{mod}\ p(x)$ and $h_1(a''_i) * x^{n-i}\ \textrm{mod}\ p(x)$ must be the same (based on the $(a + I) + (b + I) = (a + b) + I$ rule).
Then, if we have another set of strings:
- $b_1$, $b_2$, ... $b_{i-1}$, $a'_i$, $b_{i + 1}$, ... $b_{32}$
- $b_1$, $b_2$, ... $b_{i-1}$, $a''_i$, $b_{i + 1}$, ... $b_{32}$
where $b_t$ is an arbitrary byte, then a hash of these two strings will always lead to the same answer (which is that both strings give $true$ - last $k$ bits of the hash is $0$, or both strings give $false$).
References
- Quick intro to finite fields
- Binary carryless multiplication
- Universal hash functions based on finite field arithmetics (code)
- Bleichenbacher's Attack on PKCS 1 might provide some ideas on how to attack this hash function, as it shows a technique to decrypt RSA PKCS 1 encrypted messages given an oracle that tests whether the 16 most signicant bits of the decryption of $r * C\ \textrm{mod}\ N$ are equal to $2$, for any $r$ of attacker's choice (from RSA-survey.pdf p. 14). Obviously these problems are only loosely related but may have some useful information nevertheless.
DISCLAIMER: I am not sure whether this question belongs to Math, Security, Cryptography or Stackoverflow forums, but I think mathematicians are most capable to provide an answer. I am a programmer so my question is probably not using the "standard" math terminology, so feel free to edit my question to clarify for others.