6

The problem is (integer) duplicate removal, which can also be perceived as producing the image of an evaluated function (of integers):

Given a sequence $S_\text{in}$ of $n$ integers, produce a sequence $S_\text{out}$ of elements such that any element in $S_\text{in}$ also appears in $S_\text{out}$, and all elements in $S_\text{out}$ are distinct.

Ignoring some details regarding the elements' size in bits, this can be done by sorting in $O(n \log n)$ time; or by hashing in expected time $O(n)$.

Given that we don't require $S_\text{out}$ to be sorted - is it possible to do better than sorting in the worst case?

Notes:

  • Some dependence on the output size (let's call it $m$) rather than $n$ is an improvement, but I'm mostly interested in getting closer to linearity in $n$.
  • The details I've ignored may not be so insignificant.
  • Algorithms need not be restricted to algebraic computation, i.e. you can tear into the bit representation if it helps.
  • We cannot make any assumptions regarding the input distribution.
einpoklum
  • 1,025
  • 6
  • 19

3 Answers3

6

It is a classical result that the element distinctness problem requires $\Omega(n\log n)$ comparisons in the comparison model (the one used to analyze sorting algorithms); in fact, it also requires $\Omega(n\log n)$ time in stronger models such as algebraic decision trees, in which we are allowed to compute the sign of arbitrary bounded degree polynomials (rather than just $x_i - x_j$).

The element distinctness problem asks whether all elements in the input are distinct. This is clearly easier than your problem, since you can just compare the sizes of the input and output arrays. Hence in these computation models, you cannot do asymptotically better than $n\log n$.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
1

You can do $O(n \log m)$ if you use a (balanced) binary search tree as a set.

I don't know about a linear-time algorithm to solve this problem, perhaps there is a reduction to prove this is impossible?

ATOMP
  • 276
  • 1
  • 4
-1

If every element of the array ($a$) is in the range of $0$ to $k$, you can have $\theta(n)$ runtime when $k=O(n)$. All you have to do is to maintain an array ($c$) of $k$ elements initialized to $0$ and set $c[a[i]] = 1$ whenever $a[i]$ is encountered to be 0. At the same time you should write the corresponding element ($a[i]$) in an array $b$ only if the $c[a[i]] =0$.

Mr. Sigma.
  • 1,301
  • 1
  • 16
  • 38