Wilson Confidence Interval for binomial proportion, 99.9% level yields strange results for p=0.5

Question

Fellow Stackers,

I have a formal customer requirement for a mechanism to have a failure rate of 0.1% or less. After some study, I found various methods for estimating the confidence interval of a binomial proportion and settled on the Wilson (1927) [1,2]. I got a seemingly reasonable answer: in n=1637 runs, if you observe 0 failures, you can conclude that this mechanism falls somewhere in the distribution of p=0.999 with 90% confidence. Sounds plausible but I'm not sure of the details (z vs. t) or the conclusion.

I made a table of results for various levels of confidence and several values of p. I included p=0.5, for grins. At the 90% confidence level, one sided, n=2, a sample without failures (2 heads or 2 tails) is sufficient to conclude that the coin has p=0.5. At the 95% confidence level, 3 in-a-row (3 heads or 3 tails) is sufficient to conclude that the coin is fair. At the 99.9% confidence level, you need to observe 6 in-a-row (all heads or all tails) to conclude that p=0.5. I've flipped a lot of coins in my day and a run of 6 is fairly unusual. Based on this, the k-in-a-row calculation doesn't really tell you how much coin flipping work to expect, right? It does tell you what what kind of sample would prove p, if you happen to observe such a sample. Is that correct?

Going back to the original issue, 0.1% failure rate of a mechanism, I want to tell my manager what to expect in terms of labor. Do I tell him to plan for a tech to run 1637 trials and probably several times over? If we get lucky and the first sample of 1627 trials doesn't contain a failure, I can quit, right? But there's no way to know when a 0-failure sample will occur.

So my questions are:

Am I calculating the minimum number of trials correctly?
Can I calculate how likely am I to observe a sample that proves p? In other words, how likely am to get sufficient number of success to prove p (at a given confidence level) on the first try?

Thanks for your attention. I am looking forward to some expert commentary!

KLN

References

`

       Minimum Trials to Prove Failure Rate

             90% Confidence (z=1.28)

           Trials to Prove Failure Rate with Failures
Failure   --------------------------------------------

 Rate %      0        1         2       3        4
-------------------------------------------------------
  0.01     16383    33388    48060    61826    75069
  0.02      3276     6677     9611    12364    15012
  0.1       1637     3338     4805     6181     7505
  0.2        818     1668     2402     3090     3752
  0.3        545+    1112     1601     2059     2501
  0.5        327      666      960     1235     1500
  1          163      333      479      617      749
  2           81      166      239      307      374
  5           32       65       95      122      148
 10           15       32       47       60       73
 50            2        5        8       11       13
-------------------------------------------------------


                 95% Confidence (z=1.65)

            Trials to Prove Failure Rate with Failures
 Failure   --------------------------------------------
 Rate %      0        1         2       3        4
-------------------------------------------------------
  0.01     27223    45001    60625    75265    89307
  0.02      5443     8998    12123    15051    17859
  0.1       2720     4498     6060     7524     8928
  0.2       1359     2248     3029     3761     4463
  0.3        905     1498     2018     2506     2974
  0.5        542      898     1210     1503     1783
  1          270      448      604      750      890
  2          134      223      301      374      444
  5           52       88      119      148      176
 10           25       43       58       73       86
 50            3        7        9       12       15
-------------------------------------------------------


                  99% Confidence (z=2.33)

            Trials to Prove Failure Rate with Failures
 Failure   --------------------------------------------
 Rate %      0        1         2       3        4
-------------------------------------------------------
  0.01     54284    72913    89831   105775   121068
  0.02     10853    14578    17962    21151    24209
  0.1       5424     7287     8978    10573    12102
  0.2       2710     3641     4487     5284     6048
  0.3       1805     2426     2989     3521     4030
  0.5       1081     1453     1792     2110     2416
  1          538      724      893     1052     1205
  2          267      360      444      523      600
  5          104      141      174      206      237
 10           49       68       85      100      115
 50            6        9       13       16       18
-------------------------------------------------------

Here's the code I used to generate the table:

#!/usr/bin/perl -w

use strict;

# Wilson score for binomial proportion CI based on
# www.itl.nist.gov/div898/handbook/prc/section2/prc241.htm
#
#   X - successes
#   n - trials
#   z - For 2-sided interval, use z_{1-alpha/2} for lower limit and
#       z_{alpha/2} for upper limit.  For one-sided test, use z_{alpha}
# end - 0 is lower limit, 1 is upper limit
sub ws($$$$)
{
  my $X   = $_[0];
  my $n   = $_[1];
  my $z   = $_[2];
  my $end = $_[3];

  my $p_hat = $X / $n;
  my $C = 1 / ( 1 + ($z*$z)/$n );
  my $D = $p_hat + ($z*$z)/(2*$n);
  my $E = $z * sqrt(  $p_hat*(1-$p_hat)/$n + ($z*$z)/(4*$n*$n)  );

  if( $end == 0  )   {
    return $C*($D-$E);
  }
  else  {
    return $C*($D+$E);
  }
}


# Search for minimum number of trials such that the lower limit of
# reliability of the mechanism is within the confidence interval.
#
#  p  probability of mechanism succeeding
#  z  z-score for a given confidence level (1.65 for one-sided 95%)
#
#  j  number of failures allowed.  Ususally just use 0
#
sub ws_search($$$)
{
  my $p = $_[0];
  my $z = $_[1];
  my $j = $_[2];

  my $n;
  my $p_lower = -1;
  for( $n = $j + 1;  $n <= 1000000;  $n += 1  )  # j+1 avoids sqrt of neg
  {
    $p_lower = ws( $n-$j, $n, $z, 0 );
    if( $p_lower > $p )   {
      last;
    }
  }
  return( $n );
}


# Minimum success rates
my @ps = ( 0.9999, 0.9995, 0.999, 0.998, 0.997, 0.995, 0.99, 0.98, 0.95, 0.90, 0.5 );

# standard score for alpha, one sided
#          90%  95%   99%
my @zs = (1.28, 1.65, 2.33);

# Number of failures withing a given sample
my @fails = ( 0, 1, 2, 3, 4 );



foreach my $z (@zs)   {
  print( "z=$z\n" );

  foreach my $p (@ps)   {
    printf( "p=%8.6f  ", $p );
    foreach my $fail (@fails)   {
      printf(  "%8i ", ws_search( $p, $z, $fail )  );
    }
    print( "\n" );
  }
  print( "\n" );
}

Found someone else at the office who had worked through similar problems before. He pointed out that the confidence level is one of the driving factors in determining the amount of effort to test for the underlying probability that a mechanism will fail. I think the problem, for me, is that I was only considering the failure rate p as driving the minimum number of trials. — K Neff, Oct 04 '16 at 18:10

Wilson Confidence Interval for binomial proportion, 99.9% level yields strange results for p=0.5

0 Answers0