Can we remove duplicates faster than we can sort?

Question

The problem is (integer) duplicate removal, which can also be perceived as producing the image of an evaluated function (of integers):

Given a sequence $S_\text{in}$ of $n$ integers, produce a sequence $S_\text{out}$ of elements such that any element in $S_\text{in}$ also appears in $S_\text{out}$, and all elements in $S_\text{out}$ are distinct.

Ignoring some details regarding the elements' size in bits, this can be done by sorting in $O(n \log n)$ time; or by hashing in expected time $O(n)$.

Given that we don't require $S_\text{out}$ to be sorted - is it possible to do better than sorting in the worst case?

Notes:

Some dependence on the output size (let's call it $m$) rather than $n$ is an improvement, but I'm mostly interested in getting closer to linearity in $n$.
The details I've ignored may not be so insignificant.
Algorithms need not be restricted to algebraic computation, i.e. you can tear into the bit representation if it helps.
We cannot make any assumptions regarding the input distribution.

score 6 · Answer 1 · answered Dec 01 '18 at 02:56

It is a classical result that the element distinctness problem requires $\Omega(n\log n)$ comparisons in the comparison model (the one used to analyze sorting algorithms); in fact, it also requires $\Omega(n\log n)$ time in stronger models such as algebraic decision trees, in which we are allowed to compute the sign of arbitrary bounded degree polynomials (rather than just $x_i - x_j$).

The element distinctness problem asks whether all elements in the input are distinct. This is clearly easier than your problem, since you can just compare the sizes of the input and output arrays. Hence in these computation models, you cannot do asymptotically better than $n\log n$.

ATOMP · Answer 2 · 2018-12-01T00:12:13.940

1

You can do $O(n \log m)$ if you use a (balanced) binary search tree as a set.

I don't know about a linear-time algorithm to solve this problem, perhaps there is a reduction to prove this is impossible?

edited Dec 01 '18 at 00:12

answered Nov 30 '18 at 23:59

ATOMP

276
1
4

Mr. Sigma. · Answer 3 · 2018-12-01T04:24:02.067

-1

If every element of the array ($a$) is in the range of $0$ to $k$, you can have $\theta(n)$ runtime when $k=O(n)$. All you have to do is to maintain an array ($c$) of $k$ elements initialized to $0$ and set $c[a[i]] = 1$ whenever $a[i]$ is encountered to be 0. At the same time you should write the corresponding element ($a[i]$) in an array $b$ only if the $c[a[i]] =0$.

edited Dec 01 '18 at 04:24

answered Dec 01 '18 at 00:51

Mr. Sigma.

1,301
1
16
38

Can we remove duplicates faster than we can sort?

3 Answers3