I have designed an SQL aggregate function in Oracle that bitwise XORs all MD5 sums of the values stored in a column.
For example, if my table is:
+-----+----------+---------+
| Key | Sequence | Value |
+-----+----------+---------+
| 1 | 1 | 'Hello' |
| 1 | 2 | 'World' |
| 2 | 1 | '1234' |
| 3 | 0 | (empty) |
| 4 | 1 | 'Hello' |
| 4 | 3 | 'World' |
+-----+----------+---------+
I can run the following query in Oracle:
with t AS (select 1 key, 1 sequence, 'Hello' value FROM dual
union all select 1, 2, 'World' from dual
union all select 2, 1, '1234' from dual
union all select 3, 0, '' from dual /* ... */
)
select key, md5_agg(value) from t group by key
and get (unfortunately aggregate functions in Oracle ignore NULL values and '' is considered as NULL)
+---+----------------------------------+
|key| md5_agg(value) |
+---+----------------------------------+
| 1 | 7EBD0B1DA67F965F802D31DF25C4B321 |
| 2 | 81DC9BDB52D04DC20036DBD8313ED055 |
| 3 | 00000000000000000000000000000000 |
| 4 | 7EBD0B1DA67F965F802D31DF25C4B321 |
+---+----------------------------------+
I would like to use this approach to compare if the contents of some columns are equal when I compare subsets of the same table (think of finding duplicates in a complex structures that spans over multiple rows in the same table). Here with this results I know that I have the same subsets for keys 1 and 4.
What are the limits of such an approach? Here are the ones I could list:
- This is interesting only if my column contains distinct values. If my columns contains twice the same string, the
xoroperation will be a no-op. - Due to Oracle limitations, if my column contains empty values, they do not count.
With those limitations in mind, is it still possible to infer, from two equal md5_agg results computed from distinct and non-empty values, that the original values make up the same sets?
In order to reformulate, are there odds that the MD5 sums of distinct strings XOR to 0?