15

I just read Can this algorithm still be considered a Binary Search algorithm? and recalled that a few years back I wrote an indexer/search for log files to find log entries in large plain text files by date/time window.

Whilst doing this, I decided to try interpolation search (I didn't know that was what it was called, I kind of stumbled across the idea by myself). Then for some reason I continued to the idea of alternating interpolation steps with binary split steps: On step 0 I would interpolate to decide test point, then step 1 I would take the exact midpoint etc.

I then benchmarked the system using pure interpolation search, pure binary search and my combination attempt. The alternating approach was a clear winner, both in time and number of tests required before finding a set of randomly chosen times.

Inspired by the linked question, I just made a quick search for "alternating interpolation search and binary search" and found nothing. I also tried "hedged interpolation search" as suggested on my comment on one of the answers.

Have I stumbled across a known thing? Is there any theoretical justification for it being faster for certain types of data? The log files were typically large for the time (e.g. 1-2 GB of text with maybe 10 million rows to search), and the spread of dates/times in them was complex with heavy bursts of activity, general peak times and quiet times. My benchmark tests sampled from an even distribution of target times to find.

Neil Slater
  • 253
  • 1
  • 8

3 Answers3

6

Interleaving two algorithms to get the best of both worlds is a known technique, though it is usually stated as running them in "parallel" and returning an answer as soon as either terminates.

Though theoretically faster, interpolation search has two disadvantages compared to binary search:

  • It has terrible (linear) worst case performance

  • The overhead of computing the midpoint is rather large; a binary search iteration is hundreds of times faster than an interpolation search one

I would expect that an approach where you do interpolation search while the range is large and switch to binary search when the range becomes small is the most efficient. It would be nice if you could try this experiment.

As your dataset becomes small, the difference between $\log n$ and $\log \log n$ becomes insignificant; $\log n$ is already really small, and $\log \log n$ couldn't possibly be much smaller. At this point, the overhead of doing interpolation search is not worth it compared to the iterations you might save.

I think that your results can be explained by two phenomena:

  • Combining with binary search allows you to avoid the worst-case behavior

  • The positive effect of switching to binary search on a small dataset

Tom van der Zanden
  • 13,493
  • 1
  • 39
  • 56
5

Have I stumbled across a known thing?

There are various methods, based on a mix of interpolation-search and binary search, with a $O(log\ log\ n)$ average case access time (uniform distribution) and $O(log\ n)$ worst case time (values unevenly distributed):

  • Introspective search is your method (iterating between an interpolation search and a binary search). I haven't further details.
  • Interpolation-binary search (IBS) by N. Santoro, J. B. Sidney (1985).

    The general idea is that interpolation search is useful only when the array searched is larger than a given threshold. When the considered search segment is smaller than a user-defined threshold, binary search is applied unconditionally. Vice-versa, over that threshold, an interpolation search step is applied, followed eventually by a binary search step.

    This has many common points with your approach.

  • Adaptive search (AS) by Biagio Bonasera, Emilio Ferrara, Giacomo Fiumara, Francesco Pagano, Alessandro Provetti

    Using the authors' words:

    [Interpolation-binary search] devised a similar solution that combines (but does not blend) together interpolation and binary search. Although the asymptotic complexity is the same, there are some marked differences.

    [CUT]

    Hence, it is possible to show that for any input AS will not take more elementary operations than IBS.

    The algorithm may spend up to double number of operations than "simple" interpolation search in carefully finding out the best halving of the search segment, which in turn will mean that less iterations shall be needed to complete (but you have an even greater overhead).

manlio
  • 2,092
  • 20
  • 30
2

Last study published here and compared with Adaptive search experimentally. Interpolated binary search: An efficient hybrid search algorithm on ordered datasets