9

Suppose we have a list $L$ consisting of $N$ numbers (may include repetitions).

I am curious which is more computationally intensive to calculate, the mean or the median?

Naively, I would suppose calculating the mean involves summing up the $N$ numbers and then dividing by $N$, hence it has linear $O(N)$ complexity.

Computing the median would need to perform some sort of sorting algorithm, https://en.wikipedia.org/wiki/Sorting_algorithm, it seems the best algorithm performs at $O(n\log n)$ complexity.

Hence, for general $N$, it is more computationally intensive to calculate median? Is my reasoning correct?

Thanks for any help.

yoyostein
  • 203
  • 2
  • 5

2 Answers2

10

You can find the median in linear time using the linear time selection algorithm. There are also faster randomized algorithms such as quickselect and Floyd–Rivest.

The two tasks are really incomparable, since computing the mean requires arithmetic (mainly addition) whereas computing the median requires comparisons.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
2

Just quickly saying how can get find the median in linear time: Say you have a million items 0 .. 999,999 then the median is the average of items 499,999 and 500,000.

Run the quicksort algorithm. But after each partitioning, you don't sort both halves, you only sort the half (or two halves) containing elements #499,999 and 500,000.

The average would be O(n) if you just add all the values and divide by n. The problem is you get rounding errors. At the extreme, you could get a result that is less than the minimum or greater than the maximum of all values (especially if all items are equal to the same value x; due to rounding errors it's quite unlikely that your result is exactly x).

A reasonably precise method for large n is this: Add the numbers in pairs. Say $b_0 = a_0 + a_1$, $b_1 = a_2 + a_3$ etc. Then $c_0 = b_0 + b_1$, $c_1 = b_2 + b_3$ and so on, until only one number is left. Since the results are smaller than if you added sequentially, the errors are smaller. So you get a better approximation for the average.

That approximation is still not good. If the average you calculated is A, you then calculate the average of $a_i - A$. This is more precise since the values involved are smaller (the sum should in theory be 0 but isn't due to rounding errors), so you just add that average to A to get a better result.

It's still linear time, but it's a bit slower than just adding all the numbers.

gnasher729
  • 32,238
  • 36
  • 56