19

Context:

When trying to tame real-world datasets that contain outliers and noise, the interquartile mean is a handy tool: you sort the data, throw away the top and bottom 25% of the data and take the mean of what's left. (Of course, you can choose other partitioning than top and bottom 25%.)

Which led me to wonder: is there any efficiency to be gained only partially sorting the array? That is, if we describe three groups: A is the low quartile, B is the middle, and C is the high quartile, we don't care if A or C are sorted: we're going to discard them. And we don't care if B is sorted since we're only going to take the mean of its values. It's sufficient that the data is partitioned into those three groups.

The question:

  • is there a "partial sorting" algorithm that is more efficient than a full sort that will yield those three groups?
  • Are there additional savings if the array is always a power of 2 (assume N >= 4)?
  • What if you want to adjust the partition boundaries other than quartiles? Does that make it less efficient?

Update

I've added "partitioning" to the title, since (I now know) that's the correct term for what this question is about. Thank you to everyone with good answers!

fearless_fool
  • 445
  • 5
  • 12

5 Answers5

26

The algorithm quickselect can return the $k$-th value of an unordered array in average linear time. It can be "improved" (though not so much in practice) using the median of medians to guarantee worst case linear time.

Using that, you can quickselect the $\frac{N}4$-th, $\frac{N}2$-th and $\frac{3N}4$-th values. The algorithm will partition the array into the four desired parts. All this can be done in linear time. It is optimal since you need to check each element at least once.

As long as you use a constant number of them, you could use other values than quartiles (like deciles, for example).

Nathaniel
  • 18,309
  • 2
  • 30
  • 58
10

There are a couple of algorithms which are useful for this particular problem. Although they are usually described as selection algorithms, which compute the $k^{th}$ order statistic of an unordered dataset, they can also be used for in-place partitioning of the data-set; a $V_1, V_2, ..., V_n$ is partitioned at $k$ if $\forall i<k,V_i\le V_k$ and $\forall i>k,V_i\ge V_k$.

In other words, $V_k$ is at the position it would be if $V$ were fully sorted, and no preceding element is greater. This is sometimes called an "unordered partial sort", but I think "partition" describes it better.

Unlike sorting, this operation can be performed in time $O(n)$ and space $O(1)$, although the most common algorithm only promises expected $O(n)$ time. (In theory, worst-case $O(n^2)$ is possible, but superlinear times are rare with random data and the guaranteed $O(n)$ time algorithm has a lot of overhead.) The best-known instance of such an algorithm is C.A.R. Hoare's Quickselect but there are many other possibilities.

For computing a truncated subsequence, such as the interquartile range, you can gain a bit of extra efficiency by doing both partitions at the same time. Of course, this doesn't improve the asymptotic efficiency, but the real-world execution time is less.

Quickselect can be thought of as the beginning of a quicksort (and it's not a coincidence that Hoare also developed the quicksort algorithm.) It avoids the $\log n$ factor by only recursing on one side of the pivot. (Or, equivalently, looping, which demonstrates $O(1)$ space.) Since the only outcome required is that the vector be partitioned at $V_k$, it's only necessary to process one side of the selected pivot. This should be reasonably intuitive; after partitioning with the pivot, all the elements preceding the pivot point are no greater than the pivot, and all elements following it are at least as great. There must either be at least $k-1$ preceding elements or at least $n-k$ following elements (or both); whichever side of the array satisfies that condition cannot possibly contain the $k^{th}$ order statistic, and thus there is no need to examine it further.

As with quicksort, selecting a good pivot --or at least, not selecting a bad pivot-- is crucial. If you somehow managed to always select a pivot whose value is near the minimum or maximum of the dataset, you'd end up with quadratic time. (But since all recursive calls are tail calls, even in pathological cases, the space requirement continues to be constant.) In practice, you can avoid this problem by selecting a random pivot and using a partitioning algorithm which is tolerant of datasets with large numbers of repeated values.

To compute the interquartile range, you might start by using the selection algorithm to find $V_{n/4}$ and then call the algorithm again on the upper three-quarters of $V$ to find $V_{3n/4}$. But it's evident that this involves quite a bit of duplication of work at the beginning. A better solution is to partition the array until a pivot is discovered which is inside the interquartile range. Once the array is partitioned at that point, quickselect can be called independently on the two sides of the pivot, to select the first quartile on the left and the third quartile on the right.

Quickselect is probably the best solution for the interquartile range, but if the desired subsequence endpoints are closer to the ends of the array --for example, if you want to find the first and ninth decile-- then you might consider using a partial heapsort. This algorithm partitions at $V_k$ in worst-case $O(n \log k)$ (but on average, the time complexity is close to $O(n)$), as follows:

  1. Use heapify to turn the first $k$ elements of $V$ into a max-heap.
  2. For each of the remaining $n-k$ elements:
    • If the element is less than the heap's maximum, swap it and the heap's maximum and repair the heap.
    • Otherwise, continue.

If you build the heap backwards, so that the maximum is at position $k$, then you'll end up with $V$ correctly partitioned. (Otherwise, you'll have to swap $V_1$ and $V_k$.)

Repairing the heap at step 2 takes $O(\log k)$, but note that it doesn't always happen. In fact, assuming the dataset is randomly ordered, it happens less and less frequently as you proceed, since the heap's range shrinks each time an element is replaced making it less likely that a new element will need to be inserted.

Again, this algorithm can be adapted to find a range, by building a reversed max-heap at the beginning of the array and a forward min-heap at the end of the array, and then scanning the elements in between. This doesn't help as much as it does with quickselect, but it still helps a bit.

If you're worried about the possibility of a pathologically sorted input --not so much of a worry as it would be for quicksort, since the worst case here is log-linear rather than quadratic-- you can shuffle the input as you go by swapping each element with a randomly selected element from the unprocessed segment, in effect a single step of the Fisher-Yates shuffle algorithm. Although that makes worst-case performance extremely unlikely, it does have a practical cost, because reading a large array in random order is much less cache-friendly than reading the array sequentially.

rici
  • 12,150
  • 22
  • 40
4

is there a "partial sorting" algorithm that is more efficient than a full sort that will yield those three groups?

Since you're talking about "real-life" scenarios rather than theoretical asymptotics, here's something you could do:

  • Take a number of uniformly and independently sampled values from the array, and use them to estimate the distribution of values; or at least - estimate the values of your first and third quartiles. The exact number of samples depends on the probability-and-accuracy combination you're interested in, and what you know about the size of the value space; but let's not go into the statistical details here. Suffice it to say that getting a value in a given quarter happens with probability 1/4, independently of other samples, so you can apply large deviation bounds.

  • Now that you have a likely-decent-estimate of the quartiles, perform your filtering. With very high probability, you now have a little more or a little less than 50% of the elements, ranging between around the 1st quartile and around the 3rd quartile.

  • You can know how far you are from having gotten the actual quartiles, by keeping count of how many elements are in each of the 3 sets you've formed.

This is usually enough for your real-life needs, as the choice of quartiles is typically arbitrary, i.e. you usually want "the stuff that's around the middle", and the above gives you this. The overall time complexity will be linear in the length of the array.

... Now, you might be asking - why did I even bother suggesting this solution? You already got suggestions of linear-time solutions (even if it's estimated-linear-time), right?

Well, the answer is this is efficient in real life. Your $O(\cdot)$ constant will be extremely low in term of abstract computational operations; and your memory access pattern (after the initial sampling) will be ideal: A clean single consecutive pass over your elements. Cache friendly, CPU prefetch friendly, and it can even be parallelized relatively well.

Are there additional savings if the array is always a power of 2 (assume N >= 4)?

I doubt it. You're thinking like a theoretician here - they/we always hate it when we have to soil our neat pseudo-code with some "if it's not divisible by two" corner cases.

einpoklum
  • 1,025
  • 6
  • 19
-1

Basically what you're asking about with partitioning into three groups is a famous Dutch national flag problem posed by Dijkstra himself and analysed throughout (with proofs) in his book "A Discipline of Programming". Highly recommend. :)

Alex Chichigin
  • 267
  • 2
  • 5
-1

You have an array of n items, and you want items number l to r to be where the would be in a fully sorted array, and you don't care where the other elements are.

A very slight modification of quicksort does this easily. Start with the subarray from 1 to n. Using the quicksort partitioning method, you create two portions from 1 to k and from k+1 to n. At this point, quicksort would sort both halves recursively. Instead you only handle partitions containing at least one point from l to r.

Normal Quicksort: Call sort (1, n)
sort(l, r):
    if r > l
        partition (l, r) to a..b and c..d
        sort(a, b)
        sort(c, d)

Quicksort to get range x, y: Call sort(1, n, x, y) sort(l, r, x, y) intersect l..r with x..y return if intersection is empty partition (l, r) to a..b and c..d sort(a, b, x, y) sort(c, d, x, y)

gnasher729
  • 32,238
  • 36
  • 56