4

I have two endpoints, $a$ and $b$, that can communicate through a channel. $a$ is storing a set of fixed-length strings $A = \{a_1, \ldots, a_{N_A}\}$, and $b$ is storing another set of fixed-length strings $B = \{b_1, \ldots, b_{N_B}\}$, with in general $A \cap B \neq \emptyset$.

Now, I would like to run a procedure that, using the communication channel between $a$ and $b$, results in both $a$ and $b$ storing $A \cup B$.

Both $a$ and $b$ can run an arbitrary amount of precomputation (e.g., sorting their sets, indexing them, organizing their elements in a tree...) but I would like the procedure to be as efficient as possible in terms of communication complexity.

The case I am most interested in is the one where $N_A \simeq N_B \simeq N$ with $N$ large and $|(A \setminus B) \cup (B \setminus A)| \ll N$. In that case, it seems to me like having the endpoints send their whole sets to each other should be an highly inefficient solution.

Matteo Monti
  • 205
  • 1
  • 4

1 Answers1

3

You can do this with $O(K \log N)$ bits of communication on average, where $K$ is the size of $|(A\setminus B) \cup (B\setminus A)|$, assuming you are willing to use a non-interactive protocol and are willing to accept a randomized protocol.

Let $H(\cdot)$ denote a randomized hash function. Do binary search on $i$ to find the smallest $i$ such that $H(a_1,\dots,a_i) \ne H(b_1,\dots,b_i)$, i.e., (ignoring hash collisions) such that $a_i \ne b_i$. Then you know that $a_1=b_1$, ..., $a_{i-1}=b_{i-1}$ and $a_i \ne b_i$. Thus $a$ can send $a_i$ to $B$. You can use a $O(\log N)$-bit hash function and the probability of a hash collision can be made exponentially small. This takes $O(\log N)$ iterations of binary search, and in each iteration of binary search you send a $O(\log N)$-bit hash, so in all you send $O((\log N)^2)$ bits of information. At the end we have sent one of the $K$ elements from $A$ to $B$. Now you can repeat on the remaining elements (after discarding $a_1,\dots,a_i$ from both $a$ and $b$'s sets). You repeat at most $K$ times, so this gives a protocol that requires sending $O(K (\log N)^2)$ bits.

You can reduce this to $O(K \log N)$ bits by using a shorter hash. First do binary search using a 1-bit hash, to find the smallest $i$ such that $H(a_1,\dots,a_i)_1 \ne H(b_1,\dots,b_i)_1$, where $H(\cdot)_b$ denotes the first $b$ bits of the hash $H(\cdot)$. This takes $O(\log N)$ iterations. Then, by using backtracking, you can find the smallest $i$ such that $H(a_1,\dots,a_i)_2 \ne H(b_1,\dots,b_i)_2$ , using on average $O(1)$ iterations. Then find the smallest $i$ such that $H(a_1,\dots,a_i)_4 \ne H(b_1,\dots,b_i)_4$, and so on, in each stage doubling the length of the hash. Each stage uses on average $O(1)$ iterations. If you sum the series of the number of bits sent, it sums to $O(\log N)$.

Thus, we get an improved protocol that requires sending $O(K \log N)$ bits on average. It does require interaction between $a$ and $b$ to enable the binary search, and it is randomized.

I don't know whether there is a non-interactive protocol with similar complexity, or whether the protocol can be derandomized to achieve $O(K \log N)$ bits in the worst case rather than in the average case (or rather than an exponentially small probability of sending more than $O(K \log N)$ bits).

D.W.
  • 167,959
  • 22
  • 232
  • 500