How to fool the "try some test cases" heuristic: Algorithms that appear correct, but are actually incorrect

Question

To try to test whether an algorithm for some problem is correct, the usual starting point is to try running the algorithm by hand on a number of simple test cases -- try it on a few example problem instances, including a few simple "corner cases". This is a great heuristic: it's a great way to quickly weed out many incorrect attempts at an algorithm, and to gain understanding about why the algorithm doesn't work.

However, when learning algorithms, some students are tempted to stop there: if their algorithm works correctly on a handful of examples, including all of the corner cases they can think to try, then they conclude that the algorithm must be correct. There's always a student who asks: "Why do I need to prove my algorithm correct, if I can just try it on a few test cases?"

So, how do you fool the "try a bunch of test cases" heuristic? I'm looking for some good examples to show that this heuristic is not enough. In other words, I am looking for one or more examples of an algorithm that superficially looks like it might be correct, and that outputs the right answer on all of the small inputs that anyone is likely to come up with, but where the algorithm actually doesn't work. Maybe the algorithm just happens to work correctly on all small inputs and only fails for large inputs, or only fails for inputs with an unusual pattern.

Specifically, I am looking for:

An algorithm. The flaw has to be at the algorithmic level. I am not looking for implementation bugs. (For instance, at a bare minimum, the example should be language-agnostic, and the flaw should relate to algorithmic concerns rather than software engineering or implementation issues.)
An algorithm that someone might plausibly come up with. The pseudocode should look at least plausibly correct (e.g., code that is obfuscated or obviously dubious is not a good example). Bonus points if it is an algorithm that some student actually came up with when trying to solve a homework or exam problem.
An algorithm that would pass a reasonable manual test strategy with high probability. Someone who tries a few small test cases by hand should be unlikely to discover the flaw. For instance, "simulate QuickCheck by hand on a dozen small test cases" should be unlikely to reveal that the algorithm is incorrect.
Preferably, a deterministic algorithm. I've seen many students think that "try some test cases by hand" is a reasonable way to check whether a deterministic algorithm is correct, but I suspect most students would not assume that trying a few test cases is a good way to verify probabilistic algorithms. For probabilistic algorithms, there's often no way to tell whether any particular output is correct; and you can't hand-crank enough examples to do any useful statistical test on the output distribution. So, I'd prefer to focus on deterministic algorithms, as they get more cleanly to the heart of student misconceptions.

I'd like to teach the importance of proving your algorithm correct, and I'm hoping to use a few examples like this to help motivate proofs of correctness. I would prefer examples that are relatively simple and accessible to undergraduates; examples that require heavy machinery or a ton of mathematical/algorithmic background are less useful. Also, I don't want algorithms that are "unnatural"; while it might be easy to construct some weird artificial algorithm to fool the heuristic, if it looks highly unnatural or has an obvious backdoor constructed just to fool this heuristic, it probably won't be convincing to students. Any good examples?

score 99 · Accepted Answer · answered Aug 31 '14 at 20:16

A common error I think is to use greedy algorithms, which is not always the correct approach, but might work in most test cases.

Example: Coin denominations, $d_1,\dots,d_k$ and a number $n$, express $n$ as a sum of $d_i$:s with as few coins as possible.

A naive approach is to use the largest possible coin first, and greedily produce such a sum.

For instance, the coins with value $6$, $5$ and $1$ will give correct answers with greedy for all numbers between $1$ and $14$ except for the number $10 = 6+1+1+1+1 = 5+5$.

Juho · Answer 2 · 2014-09-04T06:17:57.430

I immediately recalled an example from R. Backhouse (this might have been in one of his books). Apparently, he had assigned a programming assignment where the students had to write a Pascal program to test equality of two strings. One of the programs turned in by a student was the following:

issame := (string1.length = string2.length);

if issame then
  for i := 1 to string1.length do
    issame := string1.char[i] = string2.char[i];

write(issame);

We can now test the program with the following inputs:

"university" "university" $\Rightarrow$ True; OK

"course" "course" $\Rightarrow$ True; OK

"" "" $\Rightarrow$ True; OK

"university" "course" $\Rightarrow$ False; OK

"lecture" "course" $\Rightarrow$ False; OK

"precision" "exactness" $\Rightarrow$ False, OK

All of this seems very promising: maybe the program does indeed work. But a more careful testing with say "pure" and "true" reveals faulty output. In fact, the program says "True" if the strings have the same length and the same last character!

However, testing had been pretty thorough: we had strings with different length, strings with equal length but different content, and even equal strings. Furthermore, the student had even tested and executed every branch. You can't really argue testing had been careless here -- given that the program is indeed very simple, it might be hard to find the motivation and energy to test it thoroughly enough.

Another cute example is binary search. In TAOCP, Knuth says that "although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky". Apparently, a bug in the binary search implementation of Java went unnoticed for a decade. It was an integer overflow bug, and only manifested with large enough input. Tricky details of binary search implementations are also covered by Bentley in the book Programming Pearls.

Bottom line: it can be surprisingly hard to be certain a binary search algorithm is correct by just testing it.

score 40 · Answer 3 · edited Aug 30 '14 at 06:57

The best example I ever came across is primality testing:

input: natural number p, p != 2
output: is p a prime or not?
algorithm: compute 2**(p-1) mod p. If result = 1 then p is prime else p is not.

This works for (almost) every number, except for a very few counter examples, and one actually needs a machine to find a counterexample in a realistic period of time. The first counterexample is 341, and the density of counterexamples actually decreases with increasing p, although just about logarithmically.

Instead of just using 2 as the basis of the power, one may improve the algorithm by also using additional, increasing small primes as basis in case the previous prime returned 1. And still, there are counterexample to this scheme, namely the Carmichael numbers, pretty rare though

score 28 · Answer 4 · answered Aug 28 '14 at 21:12

Here's one that was thrown at me by google reps at a convention I went to. It was coded in C, but it works in other languages that use references. Sorry for having to code on [cs.se], but it's the only to illustrate it.

swap(int& X, int& Y){
    X := X ^ Y
    Y := X ^ Y
    X := X ^ Y
}

This algorithm will work for any values given to x and y, even if they the same value. It will not work however if it's called as swap(x,x). In that situation, x ends up as 0. Now, this might not satisfy you, since you can somehow prove this operation to be correct mathematically, but still forget about this edge case.

score 21 · Answer 5 · edited Mar 10 '17 at 09:42

There is a whole class of algorithms that is inherently hard to test: pseudo-random number generators. You can not test a single output but have to investigate (many) series of outputs with means of statistics. Depending on what and how you test you may well miss non-random characteristics.

One famous case where things went horribly wrong is RANDU. It passed the scrutiny available at the time -- which failed to consider the behaviour of tuples of subsequent outputs. Already triples show lots of structure:

Basically, the tests did not cover all use cases: while single-dimensional use of RANDU was (probably mostly) fine, it did not support using it to sample three-dimensional points (in this way).

Proper pseudo-random sampling is a tricky business. Luckily, there are powerful test suites there days, e.g. dieharder that specialise in throwing all the statistics we know at a proposed generator. Is it enough?

^{To be fair, I have no idea what you can feasibly prove for PRNGs.}

Neal Young · Answer 6 · 2018-03-31T11:09:10.680

2D local maximum

input: 2-dimensional $n \times n$ array $A$

output: a local maximum -- a pair $(i,j)$ such that $A[i,j]$ has no neighboring cell in the array that contains a strictly larger value.

(The neighboring cells are those among $A[i, j+1], A[i, j-1], A[i-1, j], A[i+1, j]$ that are present in the array.) So, for example, if $A$ is

$$\begin{array}{cccc} 0&1&3&\mathbf{4}\\ \mathbf{3}&2&\mathbf{3}&1\\ 2&\mathbf{5}&0&1\\ \mathbf{4}&0&1&\mathbf{3}\end{array}$$

then each bolded cell is a local maximum. Every non-empty array has at least one local maximum.

Algorithm. There is an $O(n^2)$-time algorithm: just check each cell. Here's an idea for a faster, recursive algorithm.

Given $A$, define cross $X$ to consist of the cells in the middle column, and the cells in the middle row. First check each cell in $X$ to see if the cell is a local maximum in $A$. If so, return such a cell. Otherwise, let $(i, j)$ be a cell in $X$ with maximum value. Since $(i, j)$ is not a local maximum, it must have a neighboring cell $(i', j')$ with larger value.

Partition $A \setminus X$ (the array $A$, minus the cells in $X$) into four quadrants -- the upper left, upper right, lower left, and lower right quadrants -- in the natural way. The neighboring cell $(i', j')$ with larger value must be in one of those quadrants. Call that quadrant $A'$.

Lemma. Quadrant $A'$ contains a local maximum of $A$.

Proof. Consider starting at the cell $(i', j')$. If it is not a local maximum, move to a neighbor with a larger value. This can be repeated until arriving at a cell that is a local maximum. That final cell has to be in $A'$, because $A'$ is bounded on all sides by cells whose values are smaller than the value of cell $(i', j')$. This proves the lemma. $\diamond$

The algorithm calls itself recursively on the $\frac{n}{2}\times\frac{n}{2}$ sub-array $A'$ to find a local maximum $(i, j)$ there, then returns that cell.

The running time $T(n)$ for an $n\times n$ matrix satisfies $T(n) = T(n/2) + O(n)$, so $T(n) = O(n)$.

Thus, we have proven the following theorem:

Theorem. There is an $O(n)$-time algorithm for finding a local-maximum in an $n\times n$ array.

Or have we?

score 13 · Answer 7 · edited Sep 22 '14 at 12:05

Fisher-Yates-Knuth shuffling algorithm is an (practical) example and one on which one of the the authors of this site has commented about.

The algorithm generates a random permutation of a given array as:

 // To shuffle an array a of n elements (indices 0..n-1):
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ i
       exchange a[j] and a[i]

One sees that in the loop the elements are swapped between $i$ and $j$, $0 \le j \le i$. This produces unbiased sampling of the permutations (no permutations are over-represented and others under-represented).

A "naive" algorithm could be:

 // To shuffle an array a of n elements (indices 0..n-1):
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ j ≤ n-1
       exchange a[j] and a[i]

Where in the loop the element to be swapped is chosen from all available elements. However this produces biased sampling of the permutations (some are over-represented etc..)

Actually one can come-up with the fisher-yates-knuth shuffling using a simple (or naive) counting analysis.

The number of permutations of $n$ elements is $n! = n \times n-1 \times n-2 ..$, meaning 1st element can be placed in any of $n$ positions, 2nd element in remaining $n-1$ positions and so on.This is exactly what Fisher-Yates shuffle does and is why it produces un-biased (random) permutations (unlike the "naive" algorithm)

The main problem with verifying whether the shuffling algorithm is correct or not (biased or not) is that due to the statistics, a large number of samples is needed. The codinghorror article I link above explains exactly that (and with actual tests).

score 11 · Answer 8 · answered Aug 31 '14 at 19:03

These are primality examples, because they're common.

(1) Primality in SymPy. Issue 1789. There was an incorrect test put on a well-known web site that didn't fail until after 10^14. While the fix was correct, it was just patching holes rather than rethinking the issue.

(2) Primality in Perl 6. Perl6 has added is-prime which uses a number of M-R tests with fixed bases. There are known counterexamples, but they're quite large since the default number of tests is huge (basically hiding the real problem by degrading performance). This will be addressed soon.

(3) Primality in FLINT. n_isprime() returning true for composites, since fixed. Basically the same issue as SymPy. Using the Feitsma/Galway database of SPRP-2 pseudoprimes to 2^64 we can now test these.

(4) Perl's Math::Primality. is_aks_prime broken. This sequence seems similar to lots of AKS implementations -- lots of code that either worked by accident (e.g. got lost in step 1 and ended up doing the entire thing by trial division) or didn't work for larger examples. Unfortunately AKS is so slow that it is difficult to test.

(5) Pari's pre-2.2 is_prime. Math::Pari ticket. It used 10 random bases for M-R tests (with fixed seed on startup, rather than GMP's fixed seed every call). It will tell you 9 is prime about 1 out of every 1M calls. If you pick the right number you can get it to fail relatively often, but the numbers become sparser, so it doesn't show up much in practice. They have since changed the algorithm and API.

This isn't wrong but it's a classic of probabilistic tests: How many rounds do you give, say, mpz_probab_prime_p? If we give it 5 rounds, it sure looks like it works well -- numbers have to pass a base-210 Fermat test and then 5 pre-selected bases Miller-Rabin tests. You won't find a counterexample until 3892757297131 (with GMP 5.0.1 or 6.0.0a), so you'd have to do a lot of testing to find it. But there are thousands of counterexamples under 2^64. So you keep raising the number. How far? Is there an adversary? How important is a correct answer? Are you confusing random bases with fixed bases? Do you know what input sizes you'll be given?

There is a related point: what is a big number? To students it seems many think 10,000 is a huge number. To many programmers, $10^{16}$ is a big number. To programmers working on cryptography, these are small, and big is, say 4096 bits. To programmers working on computational number theory, these are all small, and big might be 10 to 100 thousand decimal digits. To some mathematicians these all may be considered "not big" considering there are many more positive numbers larger than these examples than there are smaller. This is something a lot of people don't think about, but makes a difference when thinking about correctness and performance.

These are quite difficult to test correctly. My strategy includes obvious unit tests, plus edge cases, plus examples of failures seen before or in other packages, test vs. known databases where possible (e.g. if you do a single base-2 M-R test, then you've reduced the computationally infeasible task of testing 2^64 numbers to testing about 32 million numbers), and finally, lots of randomized tests using another package as a standard. The last point works for functions like primality where there is a fairly simple input and a known output, but quite a few tasks are like this. I have used this to find defects in both my own development code as well as occasional problems in the comparison packages. But given the infinite input space, we can't test everything.

As for proving correctness, here is another primality example. The BLS75 methods and ECPP have the concept of a primality certificate. Basically after they churn away doing searches to find values that work for their proofs, they can output them in a known format. One can then write a verifier or have someone else write it. These run very fast compared to the creation, and now either (1) both pieces of code are incorrect (hence why you'd prefer other programmers for the verifiers), or (2) the math behind the proof idea is wrong. #2 is always possible, but these have typically been published and reviewed by multiple people (and in some cases are easy enough for you to walk through yourself).

In comparison, methods like AKS, APR-CL, trial division, or the deterministic Rabin test, all produce no output other than "prime" or "composite." In the latter case we may have a factor hence can verify, but in the former case we're left with nothing other than this one bit of output. Did the program work correctly? Dunno.

It's important to test the software on more than just a few toy examples, and also going through some examples at each step of the algorithm and saying "given this input, does it make sense that I am here with this state?"

score 8 · Answer 9 · answered May 28 '15 at 06:23

The best example (read: thing I am most butt hurt about) I have ever seen has to do with collatz conjecture. I was in a programming competition (with a 500 dollar prize on the line for first place) in which one of the problems was to find the minimum number of steps it takes for two numbers to reach the same number. The solution of course is to alternately step each one until they both reach something that has been seen before. We were given a range of numbers (I think it was between 1 and 1000000) and told that the collatz conjecture had been verified up to 2^64 so all of the numbers we were given would eventually converge at 1. I used 32-bit integers to do the steps with however. It turns out that that there is one obscure number between 1 and 1000000 (170 thousand something) that will cause a 32-bit integer to overflow in due time. In fact these numbers are extremely rare bellow 2^31. We tested our system for HUGE numbers far greater than 1000000 to "ensure" that overflow was not occurring. Turns out a much smaller number that we just didn't test caused overflow. Becuase I used "int" instead of "long" I only got a 300 dollar prize rather than a $500 prize.

Jonathan Prieto-Cubides · Answer 10 · 2019-04-22T08:50:46.077

The Knapsack 0/1 problem is one that almost all the students think is solvable by a greedy algorithm. That happens more often if you previously show some greedy solutions as the Knapsack's problem version where a greedy algorithm works.

For those problems, in class, I should show the proof for Knapsack 0/1 (dynamic programming) for remove any doubt and for the greedy problem version too. Actually, both proofs are not trivial and the students probably find them very helpful. In addition, there's a comment about this in CLRS 3ed, Chapter 16, Page 425-427.

Problem: thief robbing a store and can carry a maximal weight of W into their knapsack. There are n items and ith item weigh wi and is worth vi dollars. What items should thief take? to maximize his gain?

Knapsack 0/1 problem: The setup is the same, but the items may not be broken into smaller pieces, so thief may decide either to take an item or to leave it (binary choice), but may not take a fraction of an item.

And you can get from students some ideas or algorithms that follow the same idea as greedy version problem, that's:

Take the total capacity of the bag, and put as much as possible the most value object, and iterate this method until you can't put more object because bag it's full or there's not object with less o equal weight for put inside the bag.
Other wrong way is thinking: put lighter items and put these following highest to lowest price.
...

Is it helpful for you? actually, we know the coin problem is a knapsack problem version. But, there are more examples in the forest of knapsack's problems, by example, what about Knapsack 2D (that's is real helpful when you want cut wood for make furniture, I saw in a local from my city), it's very common think that the greedy works here, too, but not.

score 4 · Answer 11 · edited Sep 22 '14 at 15:50

Pythons PEP450 that introduced statistics functions into the standard library might be of interest. As part of the justification for having a function that calculates the variance in the standard library of python the author Steven D'Aprano writes:

def variance(data):
        # Use the Computational Formula for Variance.
        n = len(data)
        ss = sum(x**2 for x in data) - (sum(data)**2)/n
        return ss/(n-1)
The above appears to be correct with a casual test:
>>> data = [1, 2, 4, 5, 8]
>>> variance(data)
  7.5
But adding a constant to every data point should not change the variance:
>>> data = [x+1e12 for x in data]
>>> variance(data)
  0.0
And variance should never be negative:
>>> variance(data*100)
  -1239429440.1282566

The issue is about numerics and how precision gets lost. If you want maximum precision then you have to order your operations in a certain way. A naive implementation leads to incorrect results because the imprecision is too large. That was one of the issue my numerics course at university was about.

score 4 · Answer 12 · answered Aug 31 '14 at 08:51

A common mistake is to implement shuffling algorithms wrong. See discussion on wikipedia.

Trouble is that the bias is usually not easy to detect, and one needs to prove that there are indeed $n!$ "choices" done by the algorithm, and not $n^n$ or $(n-1)^n$ which are common for wrong implementations.

Laakeri · Answer 13 · 2019-10-31T08:47:04.417

3

For almost 40 years it was thought that an intuitive two-pointers based algorithm for finding a maximum-area triangle inside a convex polygon was correct. It was proved incorrect in https://arxiv.org/abs/1705.11035.

edited Oct 31 '19 at 08:47

answered Oct 31 '19 at 08:41

Laakeri

1,339
1
10
19

Rick Decker · Answer 14 · 2014-08-29T01:35:12.903

While this is likely not quite what you're after, it's certainly easy to understand and testing some small cases without any other thinking will lead to an incorrect algorithm.

Problem: Write a function that takes a nonnegative integer $n$ and returns the number of proper divisors of $n^2+n+41$, namely the number of integers $0<d$ for which $d\text{ divides } n^2+n+41$ and $d < n^2+n+41$.

Proposed solution:

int f(int n) {
   return 1;
}

This happens to be correct for $n= 0, 1, 2, \dotsc, 39$ but fails when $n=40$.

This "try some small cases and infer an algorithm from the result" approach crops up frequently (though not as extremely as here) in programming competitions where the pressure is to come up with an algorithm that (a) is quick to implement and (b) has a fast run time.

score 0 · Answer 15 · answered May 16 '21 at 15:59

Though this question was asked a long time ago, let me also contribute with two - in my opinion - demonstrative examples (both matching all the criteria you listed) - maybe they will come in handy if you would like to update your slides someday.

The famous activity selection problem. An intuitive (greedy) approach many people may come up with is to choose the shortest interval/activity first. If you think of an edge case where a short activity conflicts with 2 other longer, non-conflicting activities (at the end of the first one and at the beginning of the other), it can be seen easily that the algorithm is incorrect (we could have picked those 2 instead of this single one). Interestingly, the correct solution is also greedy.
This one may be a less popular problem: instead of copying over all the details from Leetcode, please, read them there before proceeding. When I came across this problem first, I had a solid understanding of backtracking already, so I thought, let's solve it with a 2-phased (greedy) backtracking approach! I.e.: let's try to collect as many cherries as possible on the way down to the bottom right corner, then let's try to do the exact same thing on the way back to the upper left corner (for the remaining, so far unpicked cherries). I solved correctly 50 test cases out of 56. An edge case that I didn't consider was something like:

My approach would have solved it like:

Note that on the way back (2nd phase), due to the moving rules defined by the problem, you can pick only one of the remaining 2 cherries. However, you can actually pick up all of them, still playing by the rules:

Sorry for not posting pseudo-code snippets here: you can find it to the first one on the Wikipedia site, and the translation of my C++ solution for the 2nd problem would have been just way too long/complex.

How to fool the "try some test cases" heuristic: Algorithms that appear correct, but are actually incorrect

15 Answers15

Linked