The value of p
What we lose if we abandon p values
I spent a fair amount of time thinking and writing on both sides of the frequentist/Bayesian divide. In recent years, thanks to reading (and, importantly, re-reading) the persectives of Mayo, Cox, Spanos, Wasserman, I’ve become convinced of the importance of a frequentist perspective on statistical evidence, but I’ve not articulated precisely why. This post represents my first written draft doing so.
This topic is controversial, but for no good reason. Why? I’m convinced that most experienced scientists and statisticians have internalized statistical insights that frequentist statistics attempts to formalize: how you can be fooled by randomness; how what we see can be the result of biasing mechanisms; the importance of understanding sampling distributions. In typical scientific practice, the “null hypothesis significance test” (NHST) has taken the place of these insights. NHST takes the form of frequentist signficance testing, but not its function, so experienced scientists and statisticians rightly shun it. But they have so internalized its function that they can call for the general abolition of significance testing. At this point, we risk being sidetracked by useless labels (what is the difference between significance testing and NHST?). To prevent this, I will simply describe what I think is useful about p values and what we would lose in their abolishment.
Here is my basic point: it is wrong to consider a p value as yielding an inference. It is better to think of it as affording critique of potential inferences.
Suppose you were discussing the ongoing COVID-19 pandemic with a friend, who you now learn is skeptical of getting the vaccine. You probe their reasons why; as it happens, they read something “on the internet”. Your reply is simple: “Do you believe everything you read on the internet?” You start to show your friend other claims — ones that you both agree are false — made in various places on the internet.
Your point is simple: A person who believed everything they believed on the internet could be misled. Put another way, their standard of evidence (“it was on the internet”) is too weak. This is not an unfamiliar line of reasoning.
Don’t believe everything you see.
This sort of skepticism is baked into statistical thinking: confounding; Simpson’s paradox; Berkson’s paradox; survivor bias; immortal time bias; other biases in sampling; critiques of naïve or simplistic visualizations; demand characteristics; the list goes on.
Don’t believe everything you see.
What all of these biases and paradoxes have in common is that what is apparent can be caused by something trivial: aggregation over an unknown third variable, systematic elimination of a part of a sample, etc. We learn about these biases through examples and toy problems (“statistical models”) and we learn to be skeptical of what we see.
We learn to consider how what we see might arise under trivial conditions, or where our preferred explanation is wrong. As a critical aspect of scientific thinking, we learn to adjust our evidential bar for these biases and paradoxes.
An important aspect of this skepticism is that it is often one-way: it does not need to involve a plausible alternative. Ideally, we learn to be humble in the face of our own limited imagination: e.g., our reasoning can be flawed even if we (or a reviewer) cannot put their finger on, say, a particular third variable explanation. Sometimes we may need to act in the face of limited, flawed evidence, but that does not mean that the evidence was not limited or flawed.
This one-way skepticism isn’t terribly well accounted for by Bayesian reasoning, which is vested in weighing possibilities that we can put concrete probability to (see also Gelman and Shalizi, 2013). This is not anything against Bayes; it is merely to say that it is not complete as an account of scientific reasoning. Bayesian reasoning is particularly succeptible to limited imagination; a rational (Bayesian) who can only imagine one possibility will can remain convinced of that possibility, even if data contradicts it: the data can always be discounted.
Let us relate the logic of a p value, then, to our statistical biases. When we raise the spectre of Simpson’s paradox, we may be critiquing claimed evidence for a positive (causal) relationship: the causal relationship may be negative, we say, but there is an uncontrolled variance making the apparent relationship positive. Or, put another way: we appear to find evidence for one thing, when the opposite is true.
Don’t believe everything you see.
Consider the p value. We assume the opposite of what we’d like to show, then we compute the probability we’d see evidence at least as strong as what we obtained. How often¹ might we be misled, if we thought our evidence was “strong enough”?
The mistake many statistical commentators make is to interpret the p value as attempt at a quantification of evidence, or as a posterior probability. It is none of these things, nor is it meant to be. It should not even, really, be thought of as means to make an inference (although, it is in the most simplistic interpretation of the Neyman-Pearson paradigm). It is, instead, a means to critique a potential inference.
Suppose we obtain a high p value (say, p=0.4) testing the null hypothesis that some difference was at most 0. If we’d like to infer that the difference is positive, we have to do so in the face of the knowledge that this level of evidence could arise 40% of the time even if that inference were wrong (in a simple statistical model we set up). The p value should not be thought of as licensing an inference — instead, it stands a critique of a potential inference.
This view shows us why an inference from a low p value can be problematic as well. Suppose that we obtained p=0.001; does this license believing the difference is positive? No. It means that we have passed a particular bar (the evidence was inconsistent with the assumption of a particular null) but there are other ways to be wrong that could threaten that inference (bad model assumptions, biases, paradox, etc). We must consider those threats as well, and then (critically) remember that our statistical model is only a toy; the desired scientific inference is based on much more.
The p value is not an inference; it affords critique of an inference.
A nonsignificant p value also does not license the inference to the null hypothesis, obviously — it doesn’t license any inferences at all, but it does tell us why such an inference can be hasty. If our design is weak, we might obtain large (“nonsignificant”) p values even when the effect size is quite large. Interestingly, p values have been critiqued for not distinguishing between evidence of absence and absense of evidence. This critique only applies to p values shorn of their purpose.
When the logic of a p value is connected to the other aspects statistical reasoning we take for granted, like biases and paradoxes, it seems obvious to me that it plays a necessary role in good statistical thinking. This, of course, presupposes that the p value is based on a “good” statistic, etc (as I have argued with respect to confidence intervals, how they are constructed matters a great deal).
If we adopt the view that p values are not inferential statistics per se, but stand as important critiques of potential inferences, certain things are obvious:
- NHST is flawed, because it considers the p value an inferential statistic, missing the point. NHST, rather, offers a method to avoid critiquing one’s inference in its automaticity.
- Confidence intervals won’t solve the central problem, because the way most people interpret them has the same issue as NHST: a lack of robust self-critique. Thinking that the “plausible” values are within the interval (rather than the frequentist “values outside the CI are ruled out”-type logic) is problematic in the same way as simple rules about p values.
- p values can live along side other ways of thinking about statistics. Frequentist critiques of Bayesian methods are important to keep Bayesian methods “open” (and avoiding being fooled) and Bayesian (or fiducial) critiques of frequentism are important to understanding statistical evidence (what is a “good” test statistic?).
- Even in situations where p values are difficult or impossible to compute, the basic logic of self-skepticism applies. p values are a method to facilitate skepticism, but as I have pointed out, not the only one.
I want to stress that none of these thoughts are new — Mayo’s new book lays out a similar, and more detailed, account. I did want to start to articulate the ideas from my own perspective.
With recent calls to ban p values (including my own, in the past), I now worry that we will lose something important. Statisticians won’t lose it, of course; they will continue to apply the basic logic (through simulation or thought experiments). What abolishing the p value will do is make the reasoning inaccessible to everyone else. This would be a Brexit-like disaster; it is the active researcher that most needs the self-critique that p values, properly interpreted, provide; and it would be hard to overcome.
¹ There are some who deny the usefulness of the frequentist conception of probability altogether, and would deny the usefulness of the idea of sampling. I know of no working scientist who does; I take it for granted that sampling is a useful model for understanding variability and its consequences. Even though who deny the usefulness of frequentist inference typically grant frequentist sampling properties.