10

I use a variation of a 5-cross median filter on image data on a small embedded system, i.e.

    x
  x x x
    x

The algorithm is really simple: read 5 unsigned integer values, get the highest 2, do some calculations on those and write back the unsigned integer result.

What is nice is that the 5 integer input values are all in the range of 0-20. The calculated integer value are also in the 0-20 range!

Through profiling, I have figured out that getting the largest two numbers is the bottleneck so I want to speed this part up. What is the fastest way to perform this selection?

The current algorithm uses a 32 bit mask with 1 in the position given by the 5 numbers and a HW-supported CLZ function.
I should say that the CPU is a proprietary one, not available outside of my company. My compiler is GCC but tailor made for this CPU.

I have tried to figure out if I can use a lookup-table but I have failed to generate a key that I can use.

I have $21^5$ combinations for the input but order isn't important, i.e. [5,0,0,0,5] is the same as [5,5,0,0,0].

It happens that the hash-function below produces a perfect hash without collisions!

def hash(x):
    h = 0
    for i in x:
        h = 33*h+i
    return h

But the hash is huge and there is simply not enough memory to use that.

Is there a better algorithm that I can use? Is it possible to solve my problem using a lookup-table and generating a key?

Raphael
  • 73,212
  • 30
  • 182
  • 400
Fredrik Pihl
  • 203
  • 1
  • 7

4 Answers4

11

In my other answer I suggest that conditional jumps might be the main impediment to efficiency. As a consequence, sorting networks come to mind: they are data agnostic, that is the same sequence of comparisons is executed no matter the input, with only the swaps being conditional.

Of course, sorting may be too much work; we only need the biggest two numbers. Lucky for us, selection networks have also been studied. Knuth tells us that finding the two smallest numbers out of five² can be done with $\hat{U}_2(5) = 6$ comparisons [1, 5.3.4 ex 19] (and at most as many swaps).

The network he gives in the solutions (rewritten to zero-based arrays) is

$\qquad\displaystyle [0:4]\,[1:4]\,[0:3]\,[1:3]\,[0:2]\,[1:2]$

which implements -- after adjusting the direction of the comparisons -- in pseudocode as

def selMax2(a : int[])
  a.swap(0,4) if a[0] < a[4]
  a.swap(1,4) if a[1] < a[4]
  a.swap(0,3) if a[0] < a[3]
  a.swap(1,3) if a[1] < a[3]
  a.swap(0,2) if a[0] < a[2]
  a.swap(1,2) if a[1] < a[2]
  return (a[0], a[1])
end

Now, naive implementations still have conditional jumps (across the swap code). Depending on your machine you can cirumvent them with conditional instructions, though. x86 seems to be its usual mudpit self; ARM looks more promising since apparently most operations are conditional in themselves. If I understand the instructions correctly, the first swap translates to this, assuming our array values have been loaded to registers R0 through R4:

CMP     R0,R4
MOVLT   R5 = R0
MOVLT   R0 = R4
MOVLT   R4 = R6

Yes, yes, of course you can use XOR swapping with EOR.

I just hope your processor has this, or something similar. Of course, if you build the thing for this purpose, maybe you can get the network hard-wired on there?

This is probably (provably?) the best you can do in the classical realm, i.e. without making use of the limited domain and performing wicked intra-word magicks.


  1. Sorting and Searching by Donald E. Knuth; The Art of Computer Programming Vol. 3 (2nd ed, 1998)
  2. Note that this leaves the two selected elements unordered. Ordering them requires an extra comparison, that is $\hat{W}_2(5) = 7$ many in total [1, p234 Table 1].
Raphael
  • 73,212
  • 30
  • 182
  • 400
4

Just so that it's on the table, here's a direct algorithm:

// Sort x1, x2
if x1 < x2
  M1 = x2
  m1 = x1
else
  M1 = x1
  m1 = x2
end

// Sort x3, x4
if x3 < x4
  M2 = x4
  m2 = x3
else
  M2 = x3
  m2 = x4
end

// Pick largest two
if M1 > M2
  M3 = M1
  if m1 > M2
    m3 = m1
  else
    m3 = M2
  end
else
  M3 = M2
  if m2 > M1
    m3 = m2
  else
    m3 = M1
  end
end

// Insert x4
if x4 > M3
  m3 = M3
  M3 = x4
else if x4 > m3
  m3 = x4
end

By clever implementation of if ... else, one can get rid of some unconditional jumps a direct translation would have.

This is ugly but takes only

  • five or six comparisons (i.e. conditional jumps),
  • nine to ten assignments (with 11 variables, all in registers) and
  • no additional memory access.

In fact, six comparisons is optimal for this problem as Theorem S in section 5.3.3 of [1] shows; here we need $W_2(5)$.

This can not be expected to be fast on machines with pipelining, though; given they high percentage of conditional jumps, most time would probably be spent in stall.

Note that a simpler variant -- sort x1 and x2, then insert the other values subsequently -- takes four to seven comparisons and only five to six assignments. Since I expect jumps to be of higher cost here, I stuck with this one.


  1. Sorting and Searching by Donald E. Knuth; The Art of Computer Programming Vol. 3 (2nd ed, 1998)
Raphael
  • 73,212
  • 30
  • 182
  • 400
4

This could be a great application and test case for the Souper project. Souper is a superoptimizer -- a tool that takes a short sequence of code as input, and tries to optimize it as much as possible (tries to find an equivalent sequence of code that will be faster).

Souper is open source. You might try running Souper on your code snippet to see if it can do any better.

See also John Regehr's contest on writing fast code to sort 16 4-bit values; it's possible that some of the techniques there might be useful.

D.W.
  • 167,959
  • 22
  • 232
  • 500
3

You can use a $21^3$ table that gets three integers and outputs the largest two. You can then use three table lookups:

T[T[T[441*a+21*b+c]*21+d]*21+e]

Similarly, using a $21^4$ table, you can reduce it to two table lookups, though it's not clear that this would be faster.

If you really want a small table, you can use two $21^2$ tables to "sort" two numbers, and then use a sorting network. According to Wikipedia, this requires at most 18 table lookups (9 comparators); you might be able to do with less since (1) you only want to know the two largest elements, and (2) for some comparator gates, you might only be interested in the maximum.

You can also use a single $21^2$ table. Implementing a sorting network then uses less memory accesses but more arithmetic. This way you get at most 9 table lookups.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514