19

Consider the following problem:

Input: two arrays $A$ and $B$ of length $n$, where $B$ is in sorted order.

Query: do $A$ and $B$ contain the same items (with their multiplicity)?

What is the fastest deterministic algorithm for this problem?
Can it be solved faster than sorting them? Can this problem be solved in deterministic linear time?

Albert Hendriks
  • 2,481
  • 16
  • 35

2 Answers2

14

You haven't specified your computation model, so I will assume the comparison model.

Consider the special case in which the array $B$ is taken from the list $$ \{1,2\} \times \{3,4\} \times \cdots \times \{2n-1,2n\}. $$ In words, the $i$th element is either $2i-1$ or $2i$.

I claim that if the algorithm concludes that $A$ and $B$ contain the same elements, that the algorithm has compared each element in $B$ to its counterpart in $A$. Indeed, suppose that the algorithm concludes that $A$ and $B$ contain the same elements, but never compares the first element of $B$ to its counterpart in $A$. If we switch the first element then the algorithm would proceed in exactly the same way, even though the answer is different. This shows that the algorithm must compare the first element (and any other element) to its counterpart in $A$.

This means that if $A$ and $B$ contain the same elements, then after verifying this the algorithm knows the sorted order of $A$. Hence it must have at least $n!$ different leaves, and so it takes time $\Omega(n\log n)$.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
10

This answer considers a different model of computation: the unit-cost RAM model. In this model, machine words have size $O(\log n)$, and operations on them take $O(1)$ time. We also assume for simplicity that each array element fits in one machine word (and so is at most $n^{O(1)}$ in magnitude).

We will construct a linear time randomized algorithm with one-sided error (the algorithm might declare the two arrays to contain the same elements even if this is not the case) for the more difficult problem of determining whether two arrays $a_1,\ldots,a_n$ and $b_1,\ldots,b_n$ contain the same elements. (We don't require any of them to be sorted.) Our algorithm will make an error with probability at most $1/n$.

The idea is that the following identity holds iff the arrays contain the same elements: $$ \prod_{i=1}^n (x-a_i) = \prod_{i=1}^n (x-b_i). $$ Computing these polynomials exactly will take too much time. Instead, we choose a random prime $p$ and a random $x_0$ and test whether $$ \prod_{i=1}^n (x_0-a_i) \equiv \prod_{i=1}^n (x_0-b_i) \pmod{p}. $$ If the arrays are equal, the test will always pass, so let's concentrate on the cases in which the arrays are different. In particular, some coefficient of $\prod_{i=1}^n (x-a_i) - \prod_{i=1}^n (x-b_i)$ is non-zero. Since $a_i,b_i$ have magnitude $n^{O(1)}$, this coefficient has magnitude $2^n n^{O(n)} = n^{O(n)}$, and so it has at most $O(n)$ prime factors of size $\Omega(n)$. This means that if we choose a set of at least $n^2$ primes $p$ of size at least $n^2$ (say), then for a random prime $p$ of this set it will hold with probability at least $1-1/n$ that $$ \prod_{i=1}^n (x-a_i) - \prod_{i=1}^n (x-b_i) \not\equiv 0 \pmod{p}. $$ A random $x_0$ modulo $p$ will witness this with probability $1-n/p \geq 1-1/n$ (since a polynomial of degree at most $n$ has at most $n$ roots).

In conclusion, if we choose a random $p$ of size roughly $n^2$ among a set of at least $n^2$ different primes, and a random $x_0$ modulo $p$, then when the arrays don't contain the same elements, our test will fail with probability $1-O(1/n)$. Running the test takes time $O(n)$ since $p$ fits into a constant number of machine words.

Using polynomial time primality testing and since the density of primes of size roughly $n^2$ is $\Omega(1/\log n)$, we can choose a random prime $p$ in time $(\log n)^{O(1)}$. Choosing a random $x_0$ modulo $p$ can be implemented in various ways, and is made easier since in our case we don't need a completely uniform random $x_0$.

In conclusion, our algorithm runs in time $O(n)$, always outputs YES if the arrays contain the same elements, and outputs NO with probability $1-O(1/n)$ if the arrays don't contain the same elements. We can improve the error probability to $1-O(1/n^C)$ for any constant $C$.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514