1

I can't understand why people seek small $p$-value. I mean, $p$-value is the smallest level at which $H_0$ can be rejected. Making level small we minimize the probability of type $1$ error, which as I think means that we are getting more evidence against $H_0$, since now the probability of accidentally rejecting $H_0$ when it is true decreases $=>$ probability of accepting $H_0$ when it is true is becoming bigger, because $P_{H_0}(test = 1)=1-P_{H_0}(test = 0)$

When $p$-value is small we reject $H_0$, but it seems to me that when $p$-value is small it must be vice versa, because the small $p$-value corresponds to the higher probability of accepting $H_0$ when it is true.

I saw the answers that you suggested... but none of them helps. I know the 'interpretation' of p-value, I can imagine it as an area under curve too, but I can't understand the relation to what I wrote above.

user13
  • 1,739

3 Answers3

3

If you actually understood the explanation and definition of $p$-value, you would not be asking your question, and you would not be saying nonsensical statements such as

the small $p$-value corresponds to the higher probability of accepting $H_0$ when it is true.

This single statement is indicative of a fundamental misconception about hypothesis testing, one that is persistently committed by many students of statistics, as well as actual statisticians who should know better!

Let me be as clear as absolutely possible.

The frequentist model of hypothesis testing does not regard the conclusion of a test as a decision between the null and alternative hypotheses. The inference we draw from the data that is observed is never "$H_0$ is accepted."

The real decision to be contemplated whenever conducting a hypothesis test always pertains to the question, "does the data furnish sufficient evidence to suggest that the null hypothesis is not true?" Note that the question is not simply "is the null hypothesis not true?" This distinction is important, because the answer to the first question, "yes" or "no," doesn't actually mean the same thing as the answer to the second. Specifically, if the answer to the first question is "no," that does not mean we have found evidence to support that the null hypothesis is true; it only means that we lack evidence to show it is not. To repeat a familiar phrase:

Absence of evidence is not evidence of absence.

Think of it a bit like the concept of "innocent until proven guilty." The presumption of innocence is like the null hypothesis. If, even under that presumption, the evidence brought to trial suggests otherwise, then guilt can be established--a high standard of proof must be brought. But if not, then guilt cannot be established. The defendant may still well be guilty of the crime, but the burden of proof was not met to convict the defendant of said crime.

Now you should be able to understand why your original statement is nonsensical. The $p$-value is a conditional probability where the condition that is assumed is the truth of $H_0$. It is calculated under this assumption; therefore, it makes no sense to ask $$\Pr[\text{accept } H_0 \mid H_0 \text{ is true}]$$ because if you presume the truth of $H_0$ then the test statistic for whether to accept $H_0$ will not be a random variable--it will be fixed because you fixed it in your assumption.

The only decisions available to us are "Reject $H_0$," or "Fail to reject $H_0$." The second inference simply means we lack evidence to reject the null hypothesis, just like a prosecutor may lack evidence to convict the defendant, but it doesn't mean the defendant is truly innocent of the crime.

The $p$-value is the chance you could have observed a result as extreme as your data assuming that the null is true. For example, if I give you a fair coin--we know it is fair--and you flip it ten times and it comes up heads every time, the chance that of getting this exact result is $1/1024$. Not likely, but not impossible either. The small but nonzero probability reflects the notion that even a fair coin could produce such a result.

The two-sided $p$-value is $1/512$, because you would be just as "surprised" if the fair coin had given you ten tails out of ten tries.

If your significance level for the test that the coin is not fair is set very small--say, $\alpha = 0.001$--then there is no way to reject the fairness of the coin with only $10$ trials, because there are no outcomes for which you could conclude with such a high degree of certainty that the coin is not fair. The significance level is the maximum tolerance you have for concluding the coin is biased when in fact it is not.

If that seems unreasonable--the possibility of no outcomes being in the rejection region--suppose you only flipped the coin once. How then, could you with any confidence at all, draw an inference about the coin's fairness? There is data, but it is not sufficient to say anything about the coin's fairness.

heropup
  • 143,828
2

A $P$-value is almost a rhetorical tactic. It's structured by granting your "opponent" (i.e. $H_0$) all the argumentative ground they ask for, and then showing that granting their premise leads to a contradiction. (Technically it's not a contradiction per se, but rather a conclusion that something that occurred was extraordinarily unlikely.)

As an example: let's say I claim I have a fair coin. You agree to place bets with me on it. If it comes up heads, you win a dollar from me, but if it comes up tails, I win a dollar from you. After 100 throws, it comes up heads just 30 times. The ensuing conversation would go something like this:

YOU: You've cheated!

ME: No I haven't.

YOU: Yes you have. That coin is obviously unfair.

ME: No it's not. This is a fair coin, and frankly I don't appreciate your tone.

YOU: All right, let's say you're telling the truth.*

ME: Thank you! Because I am.

YOU: Sure, whatever. If you're telling the truth, then we just saw something truly extraordinary.

ME: What do you mean? And remember, this is a fair coin.

YOU: Right, I got that. If your coin is actually fair, then if we played a game with 100 flips many times, you'd only do this well about $0.004\%$** of the time.

ME: Hmm. Well. I suppose unlikely things happen, right?

YOU: Actually, they don't. That's why we call them unlikely. ::calls police::

(*) Note the null hypothesis here.

(**) Note the $P$-value here.

It's in this sense that low $P$-values give you strong evidence; it's because you've assumed the opposite of the thing you're trying to prove all along.

0

When browsing the posts flaired Mathematics at r/askscience, I lighted upon the same question! I quote u/dampew .


I've been doing statistical methods research lately and this comes up a lot.

First, it's important to understand the definition of the p-value. The p-value tells you the odds that a result might occur by random chance, according to some model. It does NOT tell you the likelihood that a result is true or valid. It is NOT necessarily a good representation of the quality of work. An example of a p-value might be, "I flipped a coin 5 times and it landed heads 5 times. The odds of that occurring by random chance is 1 in 32. Therefore I believe my coin is loaded heads with p = 0.03." But this doesn't mean that the coin has a 97% chance of being loaded; it still could have happened by chance, you know? Especially if you try to do the test 20 times and don't tell anyone about the other 19! This is the first problem I'll talk about below.

P-values can be misused in several ways:

  1. The biggest problem in fields like sociology is that this has contributed to a (so-called) reproducibility crisis. There are thousands of people doing studies and publishing results with p-values of, say, p<0.05. Now imagine that a thousand studies have been published with p < 0.05; the odds are good that many of these results (at least 1 in 20) just happened to have occurred by random chance. It's actually worse than it seems, because the studies that aren't significant don't get published. For example, if every researcher performs several studies and publishes only 1 in 4 of the studies that they attempt, then 5 out of 20 of their studies will eventually get published. If they're using 1 in 20 (p<0.05) as their threshold for significance, then we would expect 1 in 20 of their attempted results to be a false positive; if they publish 5 out of 20, and 1 in 20 is false, then 1 in 5 of their publications is false. This is a general problem in the fields of sociology and psychology; many of the studies are false because of low significance thresholds.

  2. There's something called p-value hacking. Say you're looking for a result at the p=0.05 level (1 in 20), and you study or measure your data in 20 different ways. Some of those methods will appear to give you more significant results than others due to random chance, and if you try 20 different methods then you might expect one of them to be significant at the 1 in 20 level. This seems like it should be obvious, but it can come up innocently if you get a bunch of data and try to analyze various aspects of it.

  3. Very small P-values are often very difficult to calculate accurately. If you create a statistical model based on some approximations that works for the majority of cases, there's no guarantee that the model will hold true for the extreme tails of the distribution -- which, ironically, are the most important regions for statistical studies. For instance, if you're looking for something in the human genome, you might find a p-value of 1e-12; but there are a few billion basepairs in the human genome, so if it's actually a p-value of only 1e-9 instead of 1e-12, that would dramatically change the genome-wide significance of your finding.

Edit: Of course, there are other issues. Like people just using a totally wrong model for their calculations. And it's common for people to confuse p-values with effect sizes. I see this in talks all the time and it's really annoying.