Is it a good idea to use bitwise XOR on a set of MD5 sums?

Question

I have designed an SQL aggregate function in Oracle that bitwise XORs all MD5 sums of the values stored in a column.

For example, if my table is:

+-----+----------+---------+
| Key | Sequence |  Value  |
+-----+----------+---------+
|   1 |        1 | 'Hello' |
|   1 |        2 | 'World' |
|   2 |        1 | '1234'  |
|   3 |        0 | (empty) |
|   4 |        1 | 'Hello' |
|   4 |        3 | 'World' |
+-----+----------+---------+

I can run the following query in Oracle:

with t AS (select 1 key, 1 sequence, 'Hello' value FROM dual
           union all select 1, 2, 'World' from dual
           union all select 2, 1, '1234' from dual
           union all select 3, 0, '' from dual /* ... */
          )
   select key, md5_agg(value) from t group by key

and get (unfortunately aggregate functions in Oracle ignore NULL values and '' is considered as NULL)

+---+----------------------------------+
|key| md5_agg(value)                   |
+---+----------------------------------+
| 1 | 7EBD0B1DA67F965F802D31DF25C4B321 |
| 2 | 81DC9BDB52D04DC20036DBD8313ED055 |
| 3 | 00000000000000000000000000000000 |
| 4 | 7EBD0B1DA67F965F802D31DF25C4B321 |
+---+----------------------------------+

I would like to use this approach to compare if the contents of some columns are equal when I compare subsets of the same table (think of finding duplicates in a complex structures that spans over multiple rows in the same table). Here with this results I know that I have the same subsets for keys 1 and 4.

What are the limits of such an approach? Here are the ones I could list:

This is interesting only if my column contains distinct values. If my columns contains twice the same string, the xor operation will be a no-op.
Due to Oracle limitations, if my column contains empty values, they do not count.

With those limitations in mind, is it still possible to infer, from two equal md5_agg results computed from distinct and non-empty values, that the original values make up the same sets?

In order to reformulate, are there odds that the MD5 sums of distinct strings XOR to 0?

score 6 · Answer 1 · edited Apr 13 '17 at 12:48

The short answer is no. The scheme gives poor protection against collisions, that is inputs detected as having the same content (within order) when they have not.

As noted in the question, this can occur when entries in an input are duplicated; e.g. ("O","X","O") and ("X") collide.

This can also occur for maliciously crafted entries. For a start, MD5 is broken, in the sense that it is now easy to create different strings/entries that hash to the same MD5 result; replacing one by the other will go undetected. And even if one uses a collision-resistant hash such as SHA-256, it is very easy to construct a set of messages such that the XOR of their hashes equals a given value, by mere Gaussian elimination. See the answer to this question.

On the other hand, short of the issue of duplicated entries in an input, there is little chance that the scheme fails by accident.

A robust scheme would be:

hash each string using SHA-256
sort the hashes by ascending value
hash the concatenation of the sorted hashes using SHA-256

score 5 · Accepted Answer · edited Apr 13 '17 at 12:48

There are two points to consider here:

How likely is it that two different strings give the same MD5 hash?

This is known as a hash collision. A good hash function makes this probability as small as possible (this would be about $1/2^{128}$ for MD5). If you have a larger number of strings to hash, the probability of any collision between any of them grows ... but you need about $2^{64}$ strings to have a non-neglectible chance of collision (This is the birthday paradox).

Unfortunately, MD5 is not a good hash function - its collision-resistance is basically broken. It doesn't take much work to create two different strings with the same hash. So if you want to ensure this even in the face of an adversary feeding strings into your database, don't use MD5. Use one of the SHA-2 hash functions instead, they are still considered secure. (They also have a larger output size, which also moves the birthday bound higher.)

Assuming an ideal hash function (e.g. a random oracle), we can reduce the question to this:

Given two sets $A$, $B$ of $k_A$, $k_B$ random bit-strings of same lengths $n$, how likely is it that we have this? $$\bigoplus_{i=1}^{k_A} A_i = \bigoplus_{j=1}^{k_B} B_j$$

Actually, the XOR of random strings (or even many non-random strings with a random one) is again a random string, so this boils down to how likely is it that two random strings of size $n$ are identical?

Here we have the same answer as above: This probability is $1/2^{n}$ for two strings, and after about $2^{n/2}$ such random strings you'll get a good chance for a collision.

So, it looks like your scheme is solid, just make sure to choose a good hash function (not MD5, and even SHA-1 has bad reputation nowadays) with a big enough output size $n$ so that the number of individual strings hashed stays small compared to $2^{n/2}$.

But if the adversary can control a large enough part of your subset (including a subset of a set of around $n$ such strings is enough, and if he can inject those strings arbitrarily, even less are needed, but a bit more pre-calculation to choose them), he can choose those strings in a way that their XOR will give any value he wishes, which then translates to an arbitrary collision in the final "hash". Use the method highlighted in fgrieu's answer instead.

Is it a good idea to use bitwise XOR on a set of MD5 sums?

2 Answers2