Computing set difference between two large sets

Question

I have two large sets of integers $A$ and $B$. Each set has about a million entries, and each entry is a positive integer that is at most 10 digits long.

What is the best algorithm to compute $A\setminus B$ and $B\setminus A$? In other words, how can I efficiently compute the list of entries of $A$ that are not in $B$ and vice versa? What would be the best data structure to represent these two sets, to make these operations efficient?

The best approach I can come up with is storing these two sets as sorted lists, and compare every element of $A$ against every element of $B$, in a linear fashion. Can we do better?

score 9 · Answer 1 · edited Nov 14 '13 at 13:47

If you are willing to store the sets in a specialized data-structure, then you can possibly get some interesting complexities.

Let $I=\mathcal O\left(\min\left(|A|,|B|,|A\Delta B|\right)\right)$

Then you can do set operations $A\cup B, A\cap B,A\setminus B$ and $A\Delta B$, each in $\mathcal O\left(I\cdot\log\frac{|A|+|B|}{I}\right)$ expected time. So essentially, you get the minimum size of the two sets, or, the size of the symmetric difference, whichever is less. This is better than linear, if the symmetric difference is small; ie. if they have a large intersection. In fact, for the two set-difference operations you want, this is practically output-sensitive, since together they make up the size of the symmetric difference.

See Confluently Persistent Sets and Maps by Olle Liljenzin (2013) for more information.

score 6 · Answer 2 · edited Feb 23 '18 at 06:47

A linear scan is the best that I know how to do, if the sets are represented as sorted linked lists. The running time is $O(|A| + |B|)$.

Note that you don't need to compare every element of $A$ against every element of $B$, pairwise. That would lead to a runtime of $O(|A| \times |B|)$, which is much worse. Instead, to compute the symmetric difference of these two sets, you can use a technique similar to the "merge" operation in mergesort, suitably modified to omit values that are common to both sets.

In more detail, you can build a recursive algorithm like the following to compute $A \setminus B$, assuming $A$ and $B$ are represented as linked lists with their values in sorted order:

difference(A, B):
    if len(B)=0:
        return A # return the leftover list
    if len(A)=0:
        return B # return the leftover list
    if A[0] < B[0]:
        return [A[0]] + difference(A[1:], B)
    elsif A[0] = B[0]:
        return difference(A[1:], B[1:])  # omit the common element
    else:
        return [B[0]] + difference(A, B[1:])

I've represented this in pseudo-Python. If you don't read Python, A[0] is the head of the linked list A, A[1:] is the rest of the list, and + represents concatenation of lists. For efficiency reasons, if you're working in Python, you probably wouldn't want to implement it exactly as above -- for instance, it might be better to use generators, to avoid building up many temporary lists -- but I wanted to show you the ideas in the simplest possible form. The purpose of this pseudo-code is just to illustrate the algorithm, not propose a concrete implementation.

I don't think it's possible to do any better, if your sets are represented as sorted lists and you want the output to be provided as a sorted list. You fundamentally have to look at every element of $A$ and $B$. Informal sketch of justification: If there is any element that you haven't looked at, you can't output it, so the only case where you can omit looking at an element is if you know it is present in both $A$ and $B$, but how could you know that it is present if you haven't looked at its value?

smossen · Answer 3 · 2013-11-13T19:43:49.020

If A and B are of equal size, disjoint and interleaved (e.g. odd numbers in A and even numbers in B), then pairwise comparison of items in linear time is probably optimal.

If A and B contain blocks of items that are in exactly one of A or B, or in both of them, it is possible to compute set difference, union and intersection in sub linear time. As an example, if A and B differ in exactly one item, then the difference can be computed in O(log n).

http://arxiv.org/abs/1301.3388

score 2 · Answer 4 · answered Nov 14 '13 at 03:47

one option is to use bitvectors to represent the sets (where the $n$th position represents presence or absence of an item) and set-type operations then reduce to binary operations which can be performed quickly (& on multiple bits in parallel) on digital computers. in this case $A-B$ = $a \wedge \overline b$ where $a,b$ are the bitvectors. the relative efficiency of this technique over other techniques also depends on the sparsity. for more dense sets it may be more efficient than other approaches. also of course the whole operation is embarrassingly parallel so set operations can be done in parallel.

Computing set difference between two large sets

4 Answers4

Linked